Deploying bloom on runpod serverless vllm using openai compatibility, issue with CUDA?
Hey guys, I'm getting this error and I really need help: InternalServerError Traceback (most recent call last)
Cell In[42], line 4 1 def get_translation(claim, model=model): 2 user_prompt = f"Translate the following claim into English: '{claim}'. You must always make sure your final response is prefixed with 'Translated Claim:' followed by the translated claim." ----> 4 response = client.chat.completions.create( 5 model=model, 6 messages=[ 7 {"role": "user", "content": user_prompt} 8 ], 9 temperature=0, 10 #max_tokens=100, 11 ) 12 if not response or not response.choices: 13 print("Error: Model response is empty or malformed.")
InternalServerError: Error code: 500 - {'error': 'Error processing the request}
for context, what I'm trying to do is run the bloom-7b1 model for inference on vscode through runpod, and I'm using the openai compatibility to send requests. the function that triggers the error when called is attached
I was looking into possibilities, do I have an issue with the CUDA version? Bloom-7b1's documentation says it runs on CUDA 11.5 but vllm only supports 11.8 and 12, my serverless endpoint currently is 12. How can I make it work if this is the issue?