Multi-modal tool results (reading images from tool result message)
AgentTools
I'm using a tool to allow the LLM to generate images. However, the problem I'm running into is that the LLM cannot "see" the image from the tool result. (I've tried with both OpenAI and Anthropic models.) For example, if I tell it to generate an image, and it calls my tool and does so, and then I ask it a question about the image, it cannot actually see the image and answer my question accurately.
This seems important, because using a tool seems to be the number one recommended way to have LLMs generate images in a chat.
I think there is a specific way these messages need to be structured for the LLM to be able to see the image generate by a tool message. I know that when you send an image in a user message, Mastra will structure the right message format for the LLM to see it. But I can't find any guidance in the Mastra documentation about how to do this when generating an image from a tool message. So I'm not sure how my tool response is supposed to look.