Format of video input for vLLM model LLaVA-NeXT-Video-7B-hf

Dear Discord members,

I have a question about using the vLLM template with the HuggingFace LLaVA-NeXT-Video-7B-hf model on text+video multi-modal input. Video input is a fairly new feature in the vLLM library and I do not seem to find definitive information on how I should encode the input video so that the running model instance decodes it into the format it understands.

The online vLLM AI chatbot suggested a vector of JPEG-encoded video frames but that did not work. The vLLM GitHub gave me the impression that a NumPy array is the right solution but this does not work either.

Has anyone had success in using this (or a similar) setup?

Thank you in advance,
Ferenc

Format of video input for vLLM model LLaVA-NeXT-Video-7B-hf

Similar Threads

Similar Threads

Similar Threads