RunPod•15mo ago

Hi, is there currently an outage to Serverless API?

The request are "IN_QUEUE" forever...

27 Replies

haris•15mo ago

I've had similar issues because I provided an incorrect input body, are you able to provide the body you're using for your severless endpoints? As well as what template you're using and any other information that you think could be useful

andyh3118OP•15mo ago

This is my endpoint 1ifuoxegzxuhb4 We are using vLLM I don't think input body is wrong though because the same service has been running smoothly for 2-3 weeks already. Things started to become unstable since weekend, and today is full outage for us...

haris•15mo ago

Got it, if you go look at your severless endpoint after you send a request, are you able to see if it has any workers running? We might not have any availability on the GPUs you'v chosen

andyh3118OP•15mo ago

You can see all the requests are pending

haris•15mo ago

]

andyh3118OP•15mo ago

Workers are running

haris•15mo ago

Odd

andyh3118OP•15mo ago

And are boosting correctly.

haris•15mo ago

I'll bring this up internally as I'm not too sure what the issue could be, give me a moment.

andyh3118OP•15mo ago

Thanks!

Alpay Ariyak•15mo ago

What is the docker image you are using? Our worker vLLM?

andyh3118OP•15mo ago

Ah. sorry, I was wrong. It is not vLLM. We use our own exllama image.

River•15mo ago

i think you need to debug your docker image here it appears to be broken

andyh3118OP•15mo ago

ok. any logs on your end that you can share? (to indicate that it is broken?)

J.•15mo ago

were u able to confirm exllama works on gpu pod?

andyh3118OP•15mo ago

Ah.. it has been working for 2-3 weeks (we used it very actively) same image / model

J.•15mo ago

no new docker builds? interesting

andyh3118OP•15mo ago

based on the logs, the requests are not getting to the handler. This is our handler code:

import logging

import runpod

from app.exllamav2_common import boot_engine, generate

logger = logging.getLogger(__name__)

async def handler(job: dict):
    request_dict: dict = job.pop("input", {})

    configs_dict = request_dict.copy()

    full_response = ""
    for full_response in generate(configs_dict):
        yield {"text": full_response, "finished": False}

    yield {"text": full_response, "finished": True}


boot_engine()

def concurrency_modifier(current_concurrency):
    max_concurrency = 1
    return max(0, max_concurrency - current_concurrency)

runpod.serverless.start({
    "handler": handler,
    "return_aggregate_stream": True,
})import logging

import runpod

from app.exllamav2_common import boot_engine, generate

logger = logging.getLogger(__name__)

async def handler(job: dict):
    request_dict: dict = job.pop("input", {})

    configs_dict = request_dict.copy()

    full_response = ""
    for full_response in generate(configs_dict):
        yield {"text": full_response, "finished": False}

    yield {"text": full_response, "finished": True}


boot_engine()

def concurrency_modifier(current_concurrency):
    max_concurrency = 1
    return max(0, max_concurrency - current_concurrency)

runpod.serverless.start({
    "handler": handler,
    "return_aggregate_stream": True,
})

import logging

import runpod

from app.exllamav2_common import boot_engine, generate

logger = logging.getLogger(__name__)

async def handler(job: dict):
    request_dict: dict = job.pop("input", {})

    configs_dict = request_dict.copy()

    full_response = ""
    for full_response in generate(configs_dict):
        yield {"text": full_response, "finished": False}

    yield {"text": full_response, "finished": True}


boot_engine()

def concurrency_modifier(current_concurrency):
    max_concurrency = 1
    return max(0, max_concurrency - current_concurrency)

runpod.serverless.start({
    "handler": handler,
    "return_aggregate_stream": True,
})import logging

import runpod

from app.exllamav2_common import boot_engine, generate

logger = logging.getLogger(__name__)

async def handler(job: dict):
    request_dict: dict = job.pop("input", {})

    configs_dict = request_dict.copy()

    full_response = ""
    for full_response in generate(configs_dict):
        yield {"text": full_response, "finished": False}

    yield {"text": full_response, "finished": True}


boot_engine()

def concurrency_modifier(current_concurrency):
    max_concurrency = 1
    return max(0, max_concurrency - current_concurrency)

runpod.serverless.start({
    "handler": handler,
    "return_aggregate_stream": True,
})

J.•15mo ago

It says start generating so I feel that it is reaching the handler

andyh3118OP•15mo ago

hmm. you are correct.

J.•15mo ago

I guess two things here: 1) Maybe try to create a test endpoint with like maybe 3 max workers and see if it works there. Cause then ud isolate at least if its the original endpoint or your code (if both fail)

andyh3118OP•15mo ago

So it get to the handler, but stuck 😆

J.•15mo ago

2) I just tried with my LLM, which uses an async generator too and works fine https://github.com/justinwlin/Runpod-OpenLLM-Pod-and-Serverless/blob/main/handler.py

andyh3118OP•15mo ago

got it.

J.•15mo ago

So either you got a bad endpoint somehow / or your code or input, something is off.

andyh3118OP•15mo ago

thanks. let me look into that. Could be the issue with ExllamaV2.

Gaming

Programming

Hi, is there currently an outage to Serverless API?

Did you find this page helpful?