R
RunPod4mo ago
RobBalla

Serverless - 404 cannot return results

I'm getting the following error:
{"requestId": "sync-af9a620e-1480-4502-9287-640b30cfcdff-e1", "message": "Failed to return job results. | 404, message='Not Found', url=URL('https://api.runpod.ai/v2/mm8w337d46kypj/job-done/hevyjx14k6tl6p?gpu=NVIDIA+RTX+A4500&isStream=false')", "level": "ERROR"}
{"requestId": "sync-af9a620e-1480-4502-9287-640b30cfcdff-e1", "message": "Finished.", "level": "INFO"}
{"requestId": "sync-af9a620e-1480-4502-9287-640b30cfcdff-e1", "message": "Failed to return job results. | 404, message='Not Found', url=URL('https://api.runpod.ai/v2/mm8w337d46kypj/job-done/hevyjx14k6tl6p?gpu=NVIDIA+RTX+A4500&isStream=false')", "level": "ERROR"}
{"requestId": "sync-af9a620e-1480-4502-9287-640b30cfcdff-e1", "message": "Finished.", "level": "INFO"}
My workloads are running fine but the result will not return so get stuck in the queue. This is runpod v1.6.1 I have attempted debugging in a live worker to do an early return with a fixed result but the same error persists. Please help!
14 Replies
flash-singh
flash-singh4mo ago
does that job still exist if you use /status?
RobBalla
RobBalla4mo ago
I thought I had deleted its here (edit - on mobile, hard to do things)
{
"delayTime": 7206058,
"id": "sync-af9a620e-1480-4502-9287-640b30cfcdff-e1",
"retries": 1,
"status": "IN_PROGRESS"
}
{
"delayTime": 7206058,
"id": "sync-af9a620e-1480-4502-9287-640b30cfcdff-e1",
"retries": 1,
"status": "IN_PROGRESS"
}
I'm rebuild against RunPod 1.4.2 now because that was the last version I built that I know was working as I expect. So I can debug more easily. I've made some changes to my container but not to the worker, so I'm surprised by the breakage. Currently assuming it's my error somewhere but why would that URL be a 404?
Brever
Brever4mo ago
@RobBalla Im also getting the same error, in the url that its trying to hit /job-done is "hevyjx14k6tl6p" the worker id or status id?
RobBalla
RobBalla4mo ago
@Brever its endpointid/job-done/workerid and i have absolutely no idea what would cause this - I'm working backwards to work out what's happening. Probably a missing dependency because everything else works. I've built against other runpod versions with the same issue so it's clearly my fault, but it's taking some time to work it out. Locally it works fine - making it even more awkward to figure out
ashleyk
ashleyk4mo ago
@RobBalla if you return error as a dict instead of str then this kind of thing happens. I was also able to return error as a dict in older versions of the SDK but then it became a breaking change somewhere along the line unfortunately. Now you need to return error as an str and put the dict stuff into output. I don't like these kind of breaking changes in the SDK, it needs to be backwards compatible. I had to change my worker like this:
return {
'error': f'A1111 status code: {response.status_code}',
'output': response.json(),
'refresh_worker': True
}
return {
'error': f'A1111 status code: {response.status_code}',
'output': response.json(),
'refresh_worker': True
}
Previously it was just:
return {
'error': response.json(),
'refresh_worker': True
}
return {
'error': response.json(),
'refresh_worker': True
}
But then it broke in new SDK versions 😱 This also causes the job to show as COMPLETED (without any output) instead of FAILED 😱 Pretty critical oversight from RunPod IMHO. cc: @Justin Merrell
RobBalla
RobBalla4mo ago
Thanks @ashleyk I'll try playing around with it but it won't even let me return a simple sting as a test. It's quite frustrating! I don't think it's a RunPod issue this time though, I think it is me so I'm not blaming them because even with an older SDK that used to work with this worker code it gives me the 404. Wondering if I've manipulated an environment variable somewhere that I shouldn't have. I'll have to poke around in the sdk I think to figure out what's missing. I assume it's a POST request going to that URL although I found an old article that suggests a job $ID should follow the worker id and in my case there isn't one
RobBalla
RobBalla4mo ago
Actually it's not that old and it's one of your answers (of course it is - they should pay you!) https://www.answeroverflow.com/m/1187367068643885126
serverless: any way to figure out what gpu type a job ran on? - RunPod
trying to get data on speeds across gpu types for our jobs, and i'm wondering if the api exposes this anywhere, and if not, what the best way to sort it out would be.
ashleyk
ashleyk4mo ago
Did you maybe hard-code a version of aiohttp into a requirements.txt file?
RobBalla
RobBalla4mo ago
No the only install for the serverless environment is runpod, python magic and whatever requirements they bring.
aiodns==3.1.1
aiohttp==3.9.1
aiohttp-retry==2.8.3
aiosignal==1.3.1
annotated-types==0.6.0
anyio==4.2.0
async-timeout==4.0.3
attrs==23.2.0
backoff==2.2.1
bcrypt==4.1.2
boto3==1.34.40
botocore==1.34.40
Brotli==1.1.0
certifi==2024.2.2
cffi==1.16.0
charset-normalizer==3.3.2
click==8.1.7
colorama==0.4.6
cryptography==42.0.2
dnspython==2.5.0
email-validator==2.1.0.post1
exceptiongroup==1.2.0
fastapi==0.109.2
frozenlist==1.4.1
h11==0.14.0
httpcore==1.0.2
httptools==0.6.1
httpx==0.26.0
idna==3.6
inquirerpy==0.3.4
itsdangerous==2.1.2
Jinja2==3.1.3
jmespath==1.0.1
MarkupSafe==2.1.5
multidict==6.0.5
orjson==3.9.13
paramiko==3.4.0
pfzy==0.3.4
prettytable==3.9.0
prompt-toolkit==3.0.43
py-cpuinfo==9.0.0
pycares==4.4.0
pycparser==2.21
pydantic==2.6.1
pydantic-extra-types==2.5.0
pydantic-settings==2.1.0
pydantic_core==2.16.2
PyNaCl==1.5.0
python-dateutil==2.8.2
python-dotenv==1.0.1
python-magic @ file:///home/conda/feedstock_root/build_artifacts/python-magic_1695670772669/work
python-multipart==0.0.9
PyYAML==6.0.1
requests==2.31.0
runpod==1.4.2
s3transfer==0.10.0
six==1.16.0
sniffio==1.3.0
starlette==0.36.3
tomli==2.0.1
tomlkit==0.12.3
tqdm==4.66.2
tqdm-loggable==0.2
typing_extensions==4.9.0
ujson==5.9.0
urllib3==2.0.7
uvicorn==0.27.1
uvloop==0.19.0
watchdog==4.0.0
watchfiles==0.21.0
wcwidth==0.2.13
websockets==12.0
yarl==1.9.4
aiodns==3.1.1
aiohttp==3.9.1
aiohttp-retry==2.8.3
aiosignal==1.3.1
annotated-types==0.6.0
anyio==4.2.0
async-timeout==4.0.3
attrs==23.2.0
backoff==2.2.1
bcrypt==4.1.2
boto3==1.34.40
botocore==1.34.40
Brotli==1.1.0
certifi==2024.2.2
cffi==1.16.0
charset-normalizer==3.3.2
click==8.1.7
colorama==0.4.6
cryptography==42.0.2
dnspython==2.5.0
email-validator==2.1.0.post1
exceptiongroup==1.2.0
fastapi==0.109.2
frozenlist==1.4.1
h11==0.14.0
httpcore==1.0.2
httptools==0.6.1
httpx==0.26.0
idna==3.6
inquirerpy==0.3.4
itsdangerous==2.1.2
Jinja2==3.1.3
jmespath==1.0.1
MarkupSafe==2.1.5
multidict==6.0.5
orjson==3.9.13
paramiko==3.4.0
pfzy==0.3.4
prettytable==3.9.0
prompt-toolkit==3.0.43
py-cpuinfo==9.0.0
pycares==4.4.0
pycparser==2.21
pydantic==2.6.1
pydantic-extra-types==2.5.0
pydantic-settings==2.1.0
pydantic_core==2.16.2
PyNaCl==1.5.0
python-dateutil==2.8.2
python-dotenv==1.0.1
python-magic @ file:///home/conda/feedstock_root/build_artifacts/python-magic_1695670772669/work
python-multipart==0.0.9
PyYAML==6.0.1
requests==2.31.0
runpod==1.4.2
s3transfer==0.10.0
six==1.16.0
sniffio==1.3.0
starlette==0.36.3
tomli==2.0.1
tomlkit==0.12.3
tqdm==4.66.2
tqdm-loggable==0.2
typing_extensions==4.9.0
ujson==5.9.0
urllib3==2.0.7
uvicorn==0.27.1
uvloop==0.19.0
watchdog==4.0.0
watchfiles==0.21.0
wcwidth==0.2.13
websockets==12.0
yarl==1.9.4
That's with a version of the sdk that used to work, which is what suggests it's definitely my fault
ashleyk
ashleyk4mo ago
Thats weird, I had an issue with an SDK version that used to work that started causing issues, but coming to think of it, it was causing the worker to run forever without shutting it down, which is not what you are seeing. The root cause behind that was due to a bug in a new version of aiohttp that was causing issues, but the bug has been resolved for a while now.
RobBalla
RobBalla4mo ago
Looks like my version is a couple of micro versions behind. I'll have to see if there's an issue there but I'll be spending some time logged into an active worker and shouting at it. Hopefully get to the bottom of it. Thank you for the pointers 🙏
ashleyk
ashleyk4mo ago
Hope you get to the bottom if it soon, I hate those issues where you revert back to something and expect it to work and then it doesn't 😱
Justin Merrell
Justin Merrell4mo ago
404 is a strange error, I am however looking into the error handling this morning @ashleyk
RobBalla
RobBalla4mo ago
Well, I know what is wrong with it and as I suspected it is my fault. I have a bash script that runs over the envs at start and writes them to a file so they can be passed to supervisord processes that would otherwise not have access to them (because it locks its environment) - Anyway, this script replaces $ID in the webhook post variable with '' so it obviously doesnt work. Annoying but easy to fix.