i mean video2text is possible? workflow is: extract image out of video, image2text using GPT4V, com
i mean video2text is possible?
workflow is: extract image out of video, image2text using GPT4V, combining text together with prompt engineering, store them in embedding/vector db, finetune a model with those embeddings, then generate such images with DALL-E 3, consume 3rd party API to stitch them together?
workflow is: extract image out of video, image2text using GPT4V, combining text together with prompt engineering, store them in embedding/vector db, finetune a model with those embeddings, then generate such images with DALL-E 3, consume 3rd party API to stitch them together?