Status endpoint only returns "COMPLETED" but no answer to the question

I'm currently using the v2/model_id/status/run_id endpoint and the results I get is follows: {"delaytime": 26083, "executionTime":35737, "id": **, "status": "COMPLETED"} My stream endpoint works fine but for my purposes I'd rather wait longer and retrieve the entire result at once, how am I supposed to do that? Thank you
Solution:
GitHub
exllama-runpod-serverless/handler.py at master · hommayushi3/exllam...
Contribute to hommayushi3/exllama-runpod-serverless development by creating an account on GitHub.
Jump to solution
132 Replies
ashleyk
ashleyk4mo ago
What kind of endpoint are you running. This is an issue with your endpoint not with the status API.
justin
justin4mo ago
justin
justin4mo ago
Ur main issue is maybe not returning properly
justin
justin4mo ago
If u want reference to functions that I made to make a /run call, and just keep polling their status: https://github.com/justinwlin/runpod_whisperx_serverless_clientside_code/blob/main/runpod_client_helper.py
GitHub
runpod_whisperx_serverless_clientside_code/runpod_client_helper.py ...
Helper functions for Runpod to automatically poll my WhisperX API. Can be adapted to other use cases - justinwlin/runpod_whisperx_serverless_clientside_code
kingclimax7569
kingclimax75694mo ago
I was using runsync instead of run, is that incorrect? I changed it to run and now I'm receiving IN_QUEUE instead So I'm supposed to keep polling that?
ashleyk
ashleyk4mo ago
Yes, /run is asynchronous, but changing it will most likely not make any difference if it does, then /runsync is broken Just tested and both work fine for me.
justin
justin4mo ago
/run is great b/c /runsync I find I get a network timeout :))) but certaintly /runsync is also great if it short enough but also /run gives u a 30 min cache on runpod's end to store ur answer vs /runsync I forget how long but its <1 min i think so i find the 30 min cache nice also u can add a /webhook if u want it to call back to ur webhook when done with the response instead of polling
kingclimax7569
kingclimax75694mo ago
Yea im still not getting the output, just a value that says "COMPLETED"
import requests
import sys
import json
import time

bearer_token = "**"
endpoint_id = "**"

prompt = """
List me all of the US presidents?

"""

# Define the URL
url = f"https://api.runpod.ai/v2/{endpoint_id}/run"

# Define the headers
headers = {
'Content-Type': 'application/json',
'Authorization': f'Bearer {bearer_token}'
}


system_message = """You are a helpful, respectful and honest assistant and chatbot."""
prompt_template = f'''[INST] <<SYS>>
{system_message}
<</SYS>>'''

# Add the initial user message
prompt_template += f'\n{prompt} [/INST]'

print("here")
request = {
'prompt': prompt_template,
'max_new_tokens': 4000,
'temperature': 0.7,
'top_k': 50,
'top_p': 0.7,
'repetition_penalty': 1.2,
'batch_size': 8,
}

response = requests.post(url, json=dict(input=request), headers = {
"Authorization": f"Bearer {bearer_token}"
})
print(response.text)
response_json = json.loads(response.text)

job = response_json['id']

while True:


status_url = f"https://api.runpod.ai/v2/{endpoint_id}/status/{response_json['id']}"
get_status = requests.get(status_url, headers=headers)
print("here",get_status.text)
status_id = json.loads(get_status.text)['id']
status = json.loads(get_status.text)['status']

if status in ["IN_QUEUE", "IN_PROGRESS"]:
time.sleep(20)

else:
if status == "COMPLETED":
print({
"status": "COMPLETED",
"output": json.loads(get_status.text).get("output")
})
else:
print("error")
import requests
import sys
import json
import time

bearer_token = "**"
endpoint_id = "**"

prompt = """
List me all of the US presidents?

"""

# Define the URL
url = f"https://api.runpod.ai/v2/{endpoint_id}/run"

# Define the headers
headers = {
'Content-Type': 'application/json',
'Authorization': f'Bearer {bearer_token}'
}


system_message = """You are a helpful, respectful and honest assistant and chatbot."""
prompt_template = f'''[INST] <<SYS>>
{system_message}
<</SYS>>'''

# Add the initial user message
prompt_template += f'\n{prompt} [/INST]'

print("here")
request = {
'prompt': prompt_template,
'max_new_tokens': 4000,
'temperature': 0.7,
'top_k': 50,
'top_p': 0.7,
'repetition_penalty': 1.2,
'batch_size': 8,
}

response = requests.post(url, json=dict(input=request), headers = {
"Authorization": f"Bearer {bearer_token}"
})
print(response.text)
response_json = json.loads(response.text)

job = response_json['id']

while True:


status_url = f"https://api.runpod.ai/v2/{endpoint_id}/status/{response_json['id']}"
get_status = requests.get(status_url, headers=headers)
print("here",get_status.text)
status_id = json.loads(get_status.text)['id']
status = json.loads(get_status.text)['status']

if status in ["IN_QUEUE", "IN_PROGRESS"]:
time.sleep(20)

else:
if status == "COMPLETED":
print({
"status": "COMPLETED",
"output": json.loads(get_status.text).get("output")
})
else:
print("error")
ashleyk
ashleyk4mo ago
How do you get a network timeout with runsync? you are doing something wrong, it eventually goes to IN_QUEUE or IN_PROGRESS if the request takes too long, it doesn't time out.
kingclimax7569
kingclimax75694mo ago
response: {"delayTime":662,"executionTime":9823,"id":"1d227fac-78f9-4e22-bb2e-1ff79718704a-u1","status":"COMPLETED"}
ashleyk
ashleyk4mo ago
Yes, I knew it would not make a difference Your worker is most likely throwing an error, and you are most likely capturing a dict in the error key which causes this to happen error only accepts an str and not a dict, RunPod made a shitty breaking change to the SDK that causes this. So now you have to do something like:
{
"error": "Some error message",
"output: someDict
}
{
"error": "Some error message",
"output: someDict
}
I had this exact same issue and had to change my error handling to fix it.
kingclimax7569
kingclimax75694mo ago
Sorry where does this change need to be made? thank you for the response
ashleyk
ashleyk4mo ago
in your endpoint handler file
kingclimax7569
kingclimax75694mo ago
Sorry I don't think I've ever modified that file, do I need the runpod python package to use it? I only have an endpoint that I set up
ashleyk
ashleyk4mo ago
Are you using the vllm worker?
kingclimax7569
kingclimax75694mo ago
Im not sure, how can I find that out?
def generator_handler():
bearer_token = "**"
endpoint_id = "**"

prompt = """
List me all of the US presidents?

"""

# Define the URL
url = f"https://api.runpod.ai/v2/{endpoint_id}/run"

# Define the headers
headers = {
'Content-Type': 'application/json',
'Authorization': f'Bearer {bearer_token}'
}


system_message = """You are a helpful, respectful and honest assistant and chatbot."""
prompt_template = f'''[INST] <<SYS>>
{system_message}
<</SYS>>'''

# Add the initial user message
prompt_template += f'\n{prompt} [/INST]'

print("here")
request = {
'prompt': prompt_template,
'max_new_tokens': 4000,
'temperature': 0.7,
'top_k': 50,
'top_p': 0.7,
'repetition_penalty': 1.2,
'batch_size': 8,
}

response = requests.post(url, json=dict(input=request), headers = {
"Authorization": f"Bearer {bearer_token}"
})
print(response.text)
response_json = json.loads(response.text)

job = response_json['id']

while True:


status_url = f"https://api.runpod.ai/v2/{endpoint_id}/status/{response_json['id']}"
get_status = requests.get(status_url, headers=headers)
print("here",get_status.text)
status_id = json.loads(get_status.text)['id']
status = json.loads(get_status.text)['status']

if status in ["IN_QUEUE", "IN_PROGRESS"]:
time.sleep(20)

else:
if status == "COMPLETED":
print("COMPLETED")
return {
"error": "error 1",
"output": json.loads(get_status.text)
}

else:
return {
"error": "error 2",
"output": json.loads(get_status.text)
}
if __name__ == '__main__':
runpod.serverless.start({
"handler": generator_handler, # Required
})
def generator_handler():
bearer_token = "**"
endpoint_id = "**"

prompt = """
List me all of the US presidents?

"""

# Define the URL
url = f"https://api.runpod.ai/v2/{endpoint_id}/run"

# Define the headers
headers = {
'Content-Type': 'application/json',
'Authorization': f'Bearer {bearer_token}'
}


system_message = """You are a helpful, respectful and honest assistant and chatbot."""
prompt_template = f'''[INST] <<SYS>>
{system_message}
<</SYS>>'''

# Add the initial user message
prompt_template += f'\n{prompt} [/INST]'

print("here")
request = {
'prompt': prompt_template,
'max_new_tokens': 4000,
'temperature': 0.7,
'top_k': 50,
'top_p': 0.7,
'repetition_penalty': 1.2,
'batch_size': 8,
}

response = requests.post(url, json=dict(input=request), headers = {
"Authorization": f"Bearer {bearer_token}"
})
print(response.text)
response_json = json.loads(response.text)

job = response_json['id']

while True:


status_url = f"https://api.runpod.ai/v2/{endpoint_id}/status/{response_json['id']}"
get_status = requests.get(status_url, headers=headers)
print("here",get_status.text)
status_id = json.loads(get_status.text)['id']
status = json.loads(get_status.text)['status']

if status in ["IN_QUEUE", "IN_PROGRESS"]:
time.sleep(20)

else:
if status == "COMPLETED":
print("COMPLETED")
return {
"error": "error 1",
"output": json.loads(get_status.text)
}

else:
return {
"error": "error 2",
"output": json.loads(get_status.text)
}
if __name__ == '__main__':
runpod.serverless.start({
"handler": generator_handler, # Required
})
Not sure if that makes sense?
kingclimax7569
kingclimax75694mo ago
No description
kingclimax7569
kingclimax75694mo ago
I get that response repeatedly
JJonahJ
JJonahJ4mo ago
I can share my code, but as far as I can see looking from what you’ve posted, your output should be in the ’tokens’ part of the json that you get back. Try just printing everything you get back. If it’s completed, it should be there… elif status == "COMPLETED": tokens = json_response['output'][0]['choices'][0]['tokens'] return tokens here's the relevant part of mine. if the status is COMPLETED, the output you want is in 'tokens'. hope this helps! ...so if I'm reading yours right, you'll want something like
LLM_response = json.loads(get_status.text)['tokens']
LLM_response = json.loads(get_status.text)['tokens']
I think, lol …unless the problem really is that all you’re getting back is ’completed’ and no tokens at all anywhere. In which case forget all I said 😅
kingclimax7569
kingclimax75694mo ago
I will try this, thank you. Sorry, I didn't see this earlier Hey the object I'm getting back doesn't have the "tokens" key. Did you use a handler function?
JJonahJ
JJonahJ4mo ago
I just used the ready made vllm endpoint. 🤷‍♂️ I’m not really the one to ask. 👀
Alpay Ariyak
Alpay Ariyak4mo ago
Hi @kingclimax7569 , what are you looking to deploy?
kingclimax7569
kingclimax75694mo ago
Hey I already have a serverless endpoint deployed I'm just trying to use the status endpoint to retrieve the entire result of a query instead of using the stream endpoint to retrieve the results gradually
Alpay Ariyak
Alpay Ariyak4mo ago
Is it for a LLM?
kingclimax7569
kingclimax75694mo ago
Yes
Alpay Ariyak
Alpay Ariyak4mo ago
Have you tried our https://github.com/runpod-workers/worker-vllm? We’re adding full OpenAI compatibility this week
GitHub
GitHub - runpod-workers/worker-vllm: The RunPod worker template for...
The RunPod worker template for serving our large language model endpoints. Powered by vLLM. - runpod-workers/worker-vllm
kingclimax7569
kingclimax75694mo ago
I'm sorry how would that help? The problem seems to be with the runpod endpoints Not the LLM
justin
justin4mo ago
Ah i think ik why, do u have return_aggregate set to true?
if mode_to_run in ["both", "serverless"]:
runpod.serverless.start({
"handler": handler,
"concurrency_modifier": adjust_concurrency,
"return_aggregate_stream": True,
})
if mode_to_run in ["both", "serverless"]:
runpod.serverless.start({
"handler": handler,
"concurrency_modifier": adjust_concurrency,
"return_aggregate_stream": True,
})
U prob need return_aggregate_stream = true, so that if u are streaming, the streaming results become avaliable on /run also i think he just sharing if u wanna use the vllm, runpod got a pretty good setup if its not custom model
Alpay Ariyak
Alpay Ariyak4mo ago
It seems more like an issue in your worker code because status should return latest stream
justin
justin4mo ago
https://github.com/justinwlin/Runpod-OpenLLM-Pod-and-Serverless/blob/main/handler.py https://github.com/justinwlin/Runpod-OpenLLM-Pod-and-Serverless/blob/main/stream_client_side.py Here is an example of my own handler.py that I wrote for my own custom llm stuff, you can ignore all the bash stuff, that is just for my own sake; but ik this works great for streaming / retrieving + i have the clientside code that I've tested and validated https://docs.runpod.io/serverless/workers/handlers/handler-async Here is the document for if you want to stream / have your end result avaliable aggregated under the /status endpoint, when the stream is done which is how I based my handler off of.
kingclimax7569
kingclimax75694mo ago
Thank you so much I'm gonna go through this and report back Sorry I'm confused here, I don't see you using rundpods serverless endpoints, but instead you're using openllm? I added return_aggregate_stream in my code but to no avail? Can you see what's wrong in the code I posted? Even if I do this:
import runpod
import requests
import sys
import json
import time
import os

os.environ["RUNPOD_AI_API_KEY"] = "***"
def generator_handler():
print("hi")




if __name__ == '__main__':
runpod.serverless.start({
"handler": generator_handler, # Required
"return_aggregate_stream": True,
})
import runpod
import requests
import sys
import json
import time
import os

os.environ["RUNPOD_AI_API_KEY"] = "***"
def generator_handler():
print("hi")




if __name__ == '__main__':
runpod.serverless.start({
"handler": generator_handler, # Required
"return_aggregate_stream": True,
})
kingclimax7569
kingclimax75694mo ago
I get this:
No description
justin
justin4mo ago
sorry i actually misunderstood ur question I thought u wanted to deploy ur own LLM submit_job_and_stream_output https://github.com/justinwlin/Runpod-OpenLLM-Pod-and-Serverless/blob/main/stream_client_side.py Is maybe what you want? Sorry, I thought before b/c of ur code, u were sharing ur python handler; i dont use the runpod endpoints that often, but probably the stream_client_side.py is prob maybe something u can try Ik it works for the way I defined stream, with yielding, so i imagine should work for runpod
kingclimax7569
kingclimax75694mo ago
it looks like in the check_job_status function you're doing what I need. But the difference is I'm only receiving {"delaytime": 26083, "executionTime":35737, "id": **, "status": "COMPLETED"} back when I do that And when I add the handler, I get this response
justin
justin4mo ago
Sorry let me ask is this an endpoint by runpod or by u? Ive been very confused by this this looks like an LLM I guess I am confused bc u share this: https://discord.com/channels/912829806415085598/1208117793925373983/1208143617814700032 Which is different than the below where u ask me: https://discord.com/channels/912829806415085598/1208117793925373983/1209961564199845999
justin
justin4mo ago
RunPod
Llama2 7B Chat
Retrieve Results & StatusNote: For information on how to check job status and retrieve results, please refer to our Status Endpoint Documentation.Streaming Token Outputs Make a POST request to the /llama2-7b-chat/run API endpoint.Retrieve the job ID.Make a GET request to /llama2-7b-chat/stream/{...
justin
justin4mo ago
But if this is ur code Like ur own deployed code 1) U dont wrap the runpod.start in main 2) Ur function is not a generator
kingclimax7569
kingclimax75694mo ago
Yea I'm not sure at all how to use the handler function lol
justin
justin4mo ago
Ok
Alpay Ariyak
Alpay Ariyak4mo ago
Yeah, could you please elaborate what your endgoal is and we can go from there
kingclimax7569
kingclimax75694mo ago
I just want to be able to retrieve the entire result of a query at once Instead of streaming it
justin
justin4mo ago
What llm do u wanna use?
Alpay Ariyak
Alpay Ariyak4mo ago
What model would you like to deploy, quantized or not, if quantized what quantization
justin
justin4mo ago
I think the problem is ur code isnt correct 😅 but there is existing code u can just deploy and have it working And only worry about calling it The main problem is ur code isnt correctly defined as a generator + also bc u printed and not returned (or actually should be yield) the return aggregate stream sees nothing is why u get nothing
kingclimax7569
kingclimax75694mo ago
if you look up further I posted the original code
def generator_handler():
bearer_token = "**"
endpoint_id = "**"

prompt = """
List me all of the US presidents?

"""

# Define the URL
url = f"https://api.runpod.ai/v2/{endpoint_id}/run"

# Define the headers
headers = {
'Content-Type': 'application/json',
'Authorization': f'Bearer {bearer_token}'
}


system_message = """You are a helpful, respectful and honest assistant and chatbot."""
prompt_template = f'''[INST] <<SYS>>
{system_message}
<</SYS>>'''

# Add the initial user message
prompt_template += f'\n{prompt} [/INST]'

print("here")
request = {
'prompt': prompt_template,
'max_new_tokens': 4000,
'temperature': 0.7,
'top_k': 50,
'top_p': 0.7,
'repetition_penalty': 1.2,
'batch_size': 8,
}

response = requests.post(url, json=dict(input=request), headers = {
"Authorization": f"Bearer {bearer_token}"
})
print(response.text)
response_json = json.loads(response.text)

job = response_json['id']

while True:


status_url = f"https://api.runpod.ai/v2/{endpoint_id}/status/{response_json['id']}"
get_status = requests.get(status_url, headers=headers)
print("here",get_status.text)
status_id = json.loads(get_status.text)['id']
status = json.loads(get_status.text)['status']

if status in ["IN_QUEUE", "IN_PROGRESS"]:
time.sleep(20)

else:
if status == "COMPLETED":
print("COMPLETED")
return {
"error": "error 1",
"output": json.loads(get_status.text)
}

else:
return {
"error": "error 2",
"output": json.loads(get_status.text)
}
if __name__ == '__main__':
runpod.serverless.start({
"handler": generator_handler, # Required
})
def generator_handler():
bearer_token = "**"
endpoint_id = "**"

prompt = """
List me all of the US presidents?

"""

# Define the URL
url = f"https://api.runpod.ai/v2/{endpoint_id}/run"

# Define the headers
headers = {
'Content-Type': 'application/json',
'Authorization': f'Bearer {bearer_token}'
}


system_message = """You are a helpful, respectful and honest assistant and chatbot."""
prompt_template = f'''[INST] <<SYS>>
{system_message}
<</SYS>>'''

# Add the initial user message
prompt_template += f'\n{prompt} [/INST]'

print("here")
request = {
'prompt': prompt_template,
'max_new_tokens': 4000,
'temperature': 0.7,
'top_k': 50,
'top_p': 0.7,
'repetition_penalty': 1.2,
'batch_size': 8,
}

response = requests.post(url, json=dict(input=request), headers = {
"Authorization": f"Bearer {bearer_token}"
})
print(response.text)
response_json = json.loads(response.text)

job = response_json['id']

while True:


status_url = f"https://api.runpod.ai/v2/{endpoint_id}/status/{response_json['id']}"
get_status = requests.get(status_url, headers=headers)
print("here",get_status.text)
status_id = json.loads(get_status.text)['id']
status = json.loads(get_status.text)['status']

if status in ["IN_QUEUE", "IN_PROGRESS"]:
time.sleep(20)

else:
if status == "COMPLETED":
print("COMPLETED")
return {
"error": "error 1",
"output": json.loads(get_status.text)
}

else:
return {
"error": "error 2",
"output": json.loads(get_status.text)
}
if __name__ == '__main__':
runpod.serverless.start({
"handler": generator_handler, # Required
})
justin
justin4mo ago
Yes but this doesnt tell us what model u want Do u want llama? mistral? so on And if there is an end goal? Like u want a custom model, u wanna do custom logic later? so on Also this is weird bc ur mixing up clientside and server side cose Ur defining a url to call inside of what looks to be the handler definition vs it should just be calling the model
kingclimax7569
kingclimax75694mo ago
hommayushi3/exllama-runpod-serverless:latest is the docker container im using If im mixing it up please correct it lol im not sure exactly what server any of this is supposed to go on
justin
justin4mo ago
Ok great! So one 1) If u have no reason to deploy ur own model I recommend use runpod’s managed endpoint https://doc.runpod.io/reference/llama2-13b-chat 2) I recommend u can use runpod vllm model instead and deploy for official support https://github.com/runpod-workers/worker-vllm Option #1 and deploy it by just modifying env variable 3) Use my model which is UNOFFICIAL but what I use: https://github.com/justinwlin/Runpod-OpenLLM-Pod-and-Serverless And i have a picture how to set it up. Instead of serverlessllm u would point it at justinwlin/whatever i have the mistral name as
RunPod
Llama2 13B Chat
Retrieve Results & StatusNote: For information on how to check job status and retrieve results, please refer to our Status Endpoint Documentation.Streaming Token Outputs Make a POST request to the /llama2-13b-chat/run API endpoint.Retrieve the job ID.Make a GET request to /llama2-13b-chat/stream...
GitHub
GitHub - runpod-workers/worker-vllm: The RunPod worker template for...
The RunPod worker template for serving our large language model endpoints. Powered by vLLM. - runpod-workers/worker-vllm
GitHub
GitHub - justinwlin/Runpod-OpenLLM-Pod-and-Serverless: A repo for O...
A repo for OpenLLM to run pod. Contribute to justinwlin/Runpod-OpenLLM-Pod-and-Serverless development by creating an account on GitHub.
justin
justin4mo ago
Yeah here is a tutorial https://discord.com/channels/912829806415085598/948767517332107274/1209990744094408774 I recommend to do the tutorial first cause i think u have how u define ur server code (what gets put on runpod) confused with what u call from ur computer or client
kingclimax7569
kingclimax75694mo ago
ok i don't know if there's something im missing here but when my company set up a serverless endpoint it was on your website
kingclimax7569
kingclimax75694mo ago
No description
kingclimax7569
kingclimax75694mo ago
im not sure what server im supposed to be uploading code to All I want to do is query my existing endpoints so I can retrieve the result of a prompt all at once instead of streaming it
kingclimax7569
kingclimax75694mo ago
No description
kingclimax7569
kingclimax75694mo ago
Im suppsed to have access to these endpoints, the one I am trying to get working 9s /status
justin
justin4mo ago
Okay this is a veryyyy diff situation then
kingclimax7569
kingclimax75694mo ago
yes im getting that impression lol I thought it was straight forward
justin
justin4mo ago
I didnt realize ur under a company bc that means u arent the one deploying it
justin
justin4mo ago
GitHub
exllama-runpod-serverless/handler.py at master · hommayushi3/exllam...
Contribute to hommayushi3/exllama-runpod-serverless development by creating an account on GitHub.
kingclimax7569
kingclimax75694mo ago
when I say company is was just my boss who did it
justin
justin4mo ago
return aggregate has to be defined here On the server
kingclimax7569
kingclimax75694mo ago
correct
justin
justin4mo ago
Yup so u can tell ur boss to add the return aggregate stuff we talked about and ull be able to have the end result in the future The problem isnt something u on the clientside (who is calling the function) can fix The problem is what got deployed I thought the problem is: 1) U deployed it 2) Ur looking for a solution to a deployment to what u want but the problem is: 1) someone else deployed it 2) u want a different behavior than what is defined
kingclimax7569
kingclimax75694mo ago
well I have full access so I can do it. Are you saying there was an issue when setting up the new endpoint? I could just set up a new one with new configs no?
justin
justin4mo ago
Yes the runpod.serverless.start({"handler": inference}) Needs to have the return aggregate stuff we ralked about
justin
justin4mo ago
Needs to be runpod.serverless.start({"handler": inference, "return_aggregate_stream": True })
justin
justin4mo ago
GitHub
exllama-runpod-serverless/handler.py at master · hommayushi3/exllam...
Contribute to hommayushi3/exllama-runpod-serverless development by creating an account on GitHub.
kingclimax7569
kingclimax75694mo ago
Yea I was looking at that and was confused at where that goes
justin
justin4mo ago
yea hopefully is clear now
kingclimax7569
kingclimax75694mo ago
okay so does that code go somewhere in the config on the website??
Solution
justin
justin4mo ago
Okay… 1) What is deployed to runpod is: https://github.com/hommayushi3/exllama-runpod-serverless/blob/master/handler.py 2) U need to change the line i specified on the bottom of the file. u should have a copy of this github repo locally 3) U have to rebuild and redeploy to runpod the built image 4) When u call it in the future will work
GitHub
exllama-runpod-serverless/handler.py at master · hommayushi3/exllam...
Contribute to hommayushi3/exllama-runpod-serverless development by creating an account on GitHub.
kingclimax7569
kingclimax75694mo ago
Okay forking the repo and editing the file then deploying on runpod makes sense
kingclimax7569
kingclimax75694mo ago
Hey sorry to bother you again, got it deployed and I'm not getting the same error but I'm continuously getting "IN_QUEUE" as a response
justin
justin4mo ago
It means its in queue unless ur thing is responding if its not responding u gotta check ur own logs if its crashing somewhere or something gotta check the UI console on runpod
kingclimax7569
kingclimax75694mo ago
yea im trying to view the logs but apparently there aren't any available
kingclimax7569
kingclimax75694mo ago
No description
kingclimax7569
kingclimax75694mo ago
when building the container image I just reference my repo with the following format correct?: "<github-username>/exllama-runpod-serverless:latest"
kingclimax7569
kingclimax75694mo ago
No description
justin
justin4mo ago
yea it loos like ur stuff still initializing so u gotta check the initialization logs also it usually not ur github username its usually ur *dockerhub username it should be pushed to dockerhub it usually means if stuck initializing u didnt push it, or u tagged it wrong, or wrong platform
kingclimax7569
kingclimax75694mo ago
ahh that makes more sense, I thought using git was too easy but I oculdn't find the original repo on dockerub so I tried git. Ill give that a shot thank you
justin
justin4mo ago
Just to reiterate: https://discord.com/channels/912829806415085598/948767517332107274/1209990744094408774 I highly recommend again to check out this tutorial I think will be helpful 🙂 main thing is if ur on a mac, as i said in that thread to append a --platform flag to your docker build command Cause it goes through the process of taking code + building + shipping it to dockerhub + then using it on runpod
kingclimax7569
kingclimax75694mo ago
Thank you I'll commit it to docker hub and try it out Does my docker hub repo need to be public? My endpoint still seems to be stuck on initializing
justin
justin4mo ago
If it is not public u need to add a docker registry credentials under settings otherwise it is impossible for it to find it lol If there is no reason to have it private, id just have it public too, personally unless u bundled some sort of trademark secret sauce but it seems ur just using like normal llama and modified the handler.py a bit
kingclimax7569
kingclimax75694mo ago
Hey I changed it to public right after I asked that and it's running now Stupid question haha Thank you for your patience bro
kingclimax7569
kingclimax75694mo ago
Same result unfortunately
No description
kingclimax7569
kingclimax75694mo ago
Same when I use the console
kingclimax7569
kingclimax75694mo ago
No description
justin
justin4mo ago
Can you share ur github repo? or ur handler.py
kingclimax7569
kingclimax75694mo ago
GitHub
exllama-runpod-serverless/handler.py at master · enpro-github/exlla...
For use with runpod. Contribute to enpro-github/exllama-runpod-serverless development by creating an account on GitHub.
justin
justin4mo ago
what input are u sending to it? weird try with /run so that the output is actually persisted for a bit longer. When u do it u do see the results comign out when u click the stream button?
kingclimax7569
kingclimax75694mo ago
ypu im using run
kingclimax7569
kingclimax75694mo ago
No description
kingclimax7569
kingclimax75694mo ago
ill try the stream on sec
justin
justin4mo ago
I find it weird that its completed but ur UI isn't green
kingclimax7569
kingclimax75694mo ago
No description
kingclimax7569
kingclimax75694mo ago
results of stream which part should be green? I assume it works for you?
justin
justin4mo ago
Nvm, I get it, b/c it wont turn green till the stream is done
No description
kingclimax7569
kingclimax75694mo ago
wtf why am I not getting that from /status lol
justin
justin4mo ago
what variable are u passing down? well this is my own stuff can i see ur input? Are you passing down a stream: True? variable? def inference(event) -> Union[str, Generator[str, None, None]]: logging.info(event) job_input = event["input"] if not job_input: raise ValueError("No input provided") prompt: str = job_input.pop("prompt_prefix", prompt_prefix) + job_input.pop("prompt") + job_input.pop("prompt_suffix", prompt_suffix) max_new_tokens = job_input.pop("max_new_tokens", 100) stream: bool = job_input.pop("stream", False) generator, default_settings = load_model() settings = copy(default_settings) settings.update(job_input) for key, value in settings.items(): setattr(generator.settings, key, value) if stream: output: Union[str, Generator[str, None, None]] = generate_with_streaming(prompt, max_new_tokens) for res in output: yield res else: output_text = generator.generate_simple(prompt, max_new_tokens = max_new_tokens) yield output_text[len(prompt):] runpod.serverless.start({"handler": inference, "return_aggregate_stream": True})
kingclimax7569
kingclimax75694mo ago
No description
kingclimax7569
kingclimax75694mo ago
you mean that?
justin
justin4mo ago
It seems like ur code needs it
stream: bool = job_input.pop("stream", False)
stream: bool = job_input.pop("stream", False)
specifically this part U might have been just running it as an all in one output I find it weird that your output is not persisted still but that is a great place to start first
kingclimax7569
kingclimax75694mo ago
sorry what line am I supposed to change exactly? am I supposed to change it to equal True?
justin
justin4mo ago
{
"input": {
"prompt": "Hello, world! Today is a great day to",
"stream": true,
}
}
{
"input": {
"prompt": "Hello, world! Today is a great day to",
"stream": true,
}
}
I assume something like the above Since ur code is looking for a stream variable
if stream:
output: Union[str, Generator[str, None, None]] = generate_with_streaming(prompt, max_new_tokens)
for res in output:
yield res
else:
output_text = generator.generate_simple(prompt, max_new_tokens = max_new_tokens)
yield output_text[len(prompt):]
if stream:
output: Union[str, Generator[str, None, None]] = generate_with_streaming(prompt, max_new_tokens)
for res in output:
yield res
else:
output_text = generator.generate_simple(prompt, max_new_tokens = max_new_tokens)
yield output_text[len(prompt):]
Otherwise it says ur just going to get it all in one shot Other than that, Idk why ur code is going wrong, u can refer to my code and try to break it down if you want: https://github.com/justinwlin/Runpod-OpenLLM-Pod-and-Serverless/blob/main/handler.py My code is a bit weird cause I wrote it to work on both serverless / gpu pod depending on env variables but yeah other than that, I really dont know what else is going on with ur code 1) Should do /run 2) Need to pass down a stream variable that is true
kingclimax7569
kingclimax75694mo ago
I actually want to wait to get it all in one shot
justin
justin4mo ago
ah got it HM I recommend to maybe just do a print statement on the output_text / what you are yielding before you slice it out
output_text = generator.generate_simple(prompt, max_new_tokens = max_new_tokens)
yield output_text[len(prompt):]
output_text = generator.generate_simple(prompt, max_new_tokens = max_new_tokens)
yield output_text[len(prompt):]
kingclimax7569
kingclimax75694mo ago
yea that sounds like a good idea
justin
justin4mo ago
I find it weird your code can work like this tbh lol idk how it is ending up in /stream
kingclimax7569
kingclimax75694mo ago
will I be able to see the print output in the logs
justin
justin4mo ago
Or what is this output? yea Tbh, if your just doing it all in one shot and u dont care about stream u could just return
kingclimax7569
kingclimax75694mo ago
yea me neither I honestly just followed a tutorial and it used this
justin
justin4mo ago
directly and not yield Ah. Very weird Or if ur company might use stream then nvm But yeah honestly cannot say, my best bet is u can look to my repo for guidance, ik mine works And mine implements both stream / one shot it isnt the same library but the structure will be the same otherwise i really got no clue without diving deeper and debugging myself; but ull prob need to check thro all that u can probably just return directly is my guess too for the one shot 🤔
kingclimax7569
kingclimax75694mo ago
So I printed off the output:
kingclimax7569
kingclimax75694mo ago
No description
kingclimax7569
kingclimax75694mo ago
It does indeed look like it's actually generating the text result Does handler.py need to return a dictionary or something? I think Ashley alluded to this, I noticed your handler.py does
justin
justin4mo ago
Hm i think its that runpod doesnt handle dictionaries well but im not too sure. maybe try to json.stringify() what ur yielding out. but honestly not too sure yield json.stringify(xxxx) or something like this. but maybe make a new post and see if runpod staff can help u out with ur repo / handler, im not too sure what the issue may be. i dont return a dictionary i yield back out a string which gets put into the output by runpod automatically honestly ur code doesnt seem too bad to me so not too sure without deep diving myself and trying diff structures out
ashleyk
ashleyk4mo ago
RunPod error field does not handle dictionaries, output is fine.
justin
justin4mo ago
when ur printing are u printing it with the output[len(prompt):] thingv or just printing the output_text? i wonder if maybe ur string slicing is yielding an empty string but i find it weird it shows it shows in stream for u but not normally sounds like something to ask runpod tho if u can see in /stream but not under /run the only other advice i can give is strip the code to a simple ex from the docs and build it back up until it breaks
kingclimax7569
kingclimax75694mo ago
Hey sorry for the late reply, im trying this out now but the logs keep ending here
kingclimax7569
kingclimax75694mo ago
No description
kingclimax7569
kingclimax75694mo ago
although im still getting the "COMPLETED" response, but sometimes the logs don't seem to finish all the way through
justin
justin4mo ago
to be honest, im not too sure, this seems like something u should ask runpod in a new question / investigate and play around with different structure. I shared how my code works before and ik that works But I dont know why urs dont work, it looks about the same to me which is why i said maybe u just gotta run it in a GPU pod, and not just in serverless and build it up step by step
kingclimax7569
kingclimax75694mo ago
hmm ok thanks again you helped a lot. I feel like im a lot closer
justin
justin4mo ago
Yeah, I would consider if u dont need streaming just do a direct return If u do a direct return then less things to worry about so dont yield
kingclimax7569
kingclimax75694mo ago
Tried that but I havent tried returning the output text without the splicing so ill try that next
justin
justin4mo ago
Yeah, I mean you should be able to theoretically, even if this ends up not working be able to do like: return "Hello World" and if that doesn't work something reallllllyyy is wrong somewhere. and there something being missed cause that be the most fundamental thing
kingclimax7569
kingclimax75694mo ago
Yea that's exactly what I'm thinking, at the very least I should be able to affect the outcome, if it means breaking it If not I'm just gonna use a new LLM lol
justin
justin4mo ago
if u fall to that point can just use mine xD
kingclimax7569
kingclimax75694mo ago
Will do lol
justin
justin4mo ago
https://github.com/justinwlin/Runpod-OpenLLM-Pod-and-Serverless/blob/main/README.md Yeh, i mean my runpod stuff, as long u set the env variable as my readme describe, u have the clientside code / docker image ready to go lol
GitHub
Runpod-OpenLLM-Pod-and-Serverless/README.md at main · justinwlin/Ru...
A repo for OpenLLM to run pod. Contribute to justinwlin/Runpod-OpenLLM-Pod-and-Serverless development by creating an account on GitHub.
justin
justin4mo ago
just gotta change the way ur prompting it if u want to have that system / user thing, but that is an easy thing u can preprompt it with anyways gl gl lol