Hi : The inference my docker image locally on a RTX 4070 is way faster than in your RTX 3090 serverless. I was expecting some speed increase or at least same speed. Im using Nemo Nvidia Diarize model and 1 hour long audio takes me 85 seconds to process on my 4070 using same image as the one used by your worker while it takes 160 seconds on the 3090 on runpod. Also I use torch.multirpocess to spawn 2 process 1 for the transcirption using whisperx and one for the diarization in parallel. I don't know if there are some limitation on your part for parallel multi process on same docker image run.