Anyone know the easiest way to finetune your own VLLM for image captioning? I've already got my own

Anyone know the easiest way to finetune your own VLLM for image captioning? I've already got my own dataset, but there doesn't seem to be a straightforward way to actually carry out the finetuning process itself... I know there's already some good captioners out there, but I want to finetune my own
Was this page helpful?