Hey! I’m running an OCR pipeline using olmOCR (Qwen2-VL-7B) on an NVIDIA A40 (48GB) using RunPod Serverless.
I have a high-volume use case (PDFs as base64 bytes) where the queue often hits 100+ jobs. Right now, even with 3 workers active, each worker is only processing one job at a time.
I’m looking to enable continuous batching so that a single worker can pick up and process multiple PDF jobs (e.g., 4 at a time) to fully utilize the A40’s VRAM.
Has anyone successfully implemented this for multimodal/OCR workloads on Serverless? I'm specifically curious if I need a custom async handler to keep the worker from blocking on the first PDF, or if the platform can natively feed multiple requests into the vLLM scheduler on a single worker instance. Any tips on getting a 'wide' worker like this to play nice with the serverless queue would be huge!