thanks, any feedback on my approach? cuz vision api is mostly for images not videos, ive not tried t

thanks, any feedback on my approach? cuz vision api is mostly for images not videos, ive not tried the other way round though, whether image2text (gpt4v) and text2image (dall-3) can be made consistent, with the new feature?
Was this page helpful?