Hi, is there currently an outage to Serverless API?

The request are "IN_QUEUE" forever...
27 Replies
haris
haris4mo ago
I've had similar issues because I provided an incorrect input body, are you able to provide the body you're using for your severless endpoints? As well as what template you're using and any other information that you think could be useful
andyh3118
andyh31184mo ago
This is my endpoint 1ifuoxegzxuhb4 We are using vLLM I don't think input body is wrong though because the same service has been running smoothly for 2-3 weeks already. Things started to become unstable since weekend, and today is full outage for us...
haris
haris4mo ago
Got it, if you go look at your severless endpoint after you send a request, are you able to see if it has any workers running? We might not have any availability on the GPUs you'v chosen
andyh3118
andyh31184mo ago
You can see all the requests are pending
No description
haris
haris4mo ago
]
No description
andyh3118
andyh31184mo ago
No description
andyh3118
andyh31184mo ago
Workers are running
haris
haris4mo ago
Odd
andyh3118
andyh31184mo ago
And are boosting correctly.
No description
haris
haris4mo ago
I'll bring this up internally as I'm not too sure what the issue could be, give me a moment.
andyh3118
andyh31184mo ago
Thanks!
Alpay Ariyak
Alpay Ariyak4mo ago
What is the docker image you are using? Our worker vLLM?
andyh3118
andyh31184mo ago
Ah. sorry, I was wrong. It is not vLLM. We use our own exllama image.
River Snow
River Snow4mo ago
i think you need to debug your docker image here it appears to be broken
andyh3118
andyh31184mo ago
ok. any logs on your end that you can share? (to indicate that it is broken?)
justin
justin4mo ago
were u able to confirm exllama works on gpu pod?
andyh3118
andyh31184mo ago
Ah.. it has been working for 2-3 weeks (we used it very actively) same image / model
justin
justin4mo ago
no new docker builds? interesting
andyh3118
andyh31184mo ago
based on the logs, the requests are not getting to the handler. This is our handler code:
import logging

import runpod

from app.exllamav2_common import boot_engine, generate

logger = logging.getLogger(__name__)

async def handler(job: dict):
request_dict: dict = job.pop("input", {})

configs_dict = request_dict.copy()

full_response = ""
for full_response in generate(configs_dict):
yield {"text": full_response, "finished": False}

yield {"text": full_response, "finished": True}


boot_engine()

def concurrency_modifier(current_concurrency):
max_concurrency = 1
return max(0, max_concurrency - current_concurrency)

runpod.serverless.start({
"handler": handler,
"return_aggregate_stream": True,
})import logging

import runpod

from app.exllamav2_common import boot_engine, generate

logger = logging.getLogger(__name__)

async def handler(job: dict):
request_dict: dict = job.pop("input", {})

configs_dict = request_dict.copy()

full_response = ""
for full_response in generate(configs_dict):
yield {"text": full_response, "finished": False}

yield {"text": full_response, "finished": True}


boot_engine()

def concurrency_modifier(current_concurrency):
max_concurrency = 1
return max(0, max_concurrency - current_concurrency)

runpod.serverless.start({
"handler": handler,
"return_aggregate_stream": True,
})
import logging

import runpod

from app.exllamav2_common import boot_engine, generate

logger = logging.getLogger(__name__)

async def handler(job: dict):
request_dict: dict = job.pop("input", {})

configs_dict = request_dict.copy()

full_response = ""
for full_response in generate(configs_dict):
yield {"text": full_response, "finished": False}

yield {"text": full_response, "finished": True}


boot_engine()

def concurrency_modifier(current_concurrency):
max_concurrency = 1
return max(0, max_concurrency - current_concurrency)

runpod.serverless.start({
"handler": handler,
"return_aggregate_stream": True,
})import logging

import runpod

from app.exllamav2_common import boot_engine, generate

logger = logging.getLogger(__name__)

async def handler(job: dict):
request_dict: dict = job.pop("input", {})

configs_dict = request_dict.copy()

full_response = ""
for full_response in generate(configs_dict):
yield {"text": full_response, "finished": False}

yield {"text": full_response, "finished": True}


boot_engine()

def concurrency_modifier(current_concurrency):
max_concurrency = 1
return max(0, max_concurrency - current_concurrency)

runpod.serverless.start({
"handler": handler,
"return_aggregate_stream": True,
})
justin
justin4mo ago
It says start generating so I feel that it is reaching the handler
andyh3118
andyh31184mo ago
hmm. you are correct.
justin
justin4mo ago
I guess two things here: 1) Maybe try to create a test endpoint with like maybe 3 max workers and see if it works there. Cause then ud isolate at least if its the original endpoint or your code (if both fail)
andyh3118
andyh31184mo ago
So it get to the handler, but stuck 😆
justin
justin4mo ago
2) I just tried with my LLM, which uses an async generator too and works fine https://github.com/justinwlin/Runpod-OpenLLM-Pod-and-Serverless/blob/main/handler.py
andyh3118
andyh31184mo ago
got it.
justin
justin4mo ago
So either you got a bad endpoint somehow / or your code or input, something is off.
andyh3118
andyh31184mo ago
thanks. let me look into that. Could be the issue with ExllamaV2.