Topics

RunPod•15mo ago

Status endpoint only returns "COMPLETED" but no answer to the question

I'm currently using the v2/model_id/status/run_id endpoint and the results I get is follows: {"delaytime": 26083, "executionTime":35737, "id": **, "status": "COMPLETED"} My stream endpoint works fine but for my purposes I'd rather wait longer and retrieve the entire result at once, how am I supposed to do that? Thank you

Solution:

Okay… 1) What is deployed to runpod is: https://github.com/hommayushi3/exllama-runpod-serverless/blob/master/handler.py ...

GitHub

exllama-runpod-serverless/handler.py at master · hommayushi3/exllam...

Contribute to hommayushi3/exllama-runpod-serverless development by creating an account on GitHub.

Jump to solution

132 Replies

ashleyk•15mo ago

What kind of endpoint are you running. This is an issue with your endpoint not with the status API.

J.•15mo ago

https://docs.runpod.io/serverless/endpoints/invoke-jobs Run and status should be correct

Invoke a Job | RunPod Documentation

Asynchronous Endpoints

J.•15mo ago

Ur main issue is maybe not returning properly

J.•15mo ago

If u want reference to functions that I made to make a /run call, and just keep polling their status: https://github.com/justinwlin/runpod_whisperx_serverless_clientside_code/blob/main/runpod_client_helper.py

GitHub

runpod_whisperx_serverless_clientside_code/runpod_client_helper.py ...

Helper functions for Runpod to automatically poll my WhisperX API. Can be adapted to other use cases - justinwlin/runpod_whisperx_serverless_clientside_code

kingclimax7569OP•15mo ago

I was using runsync instead of run, is that incorrect? I changed it to run and now I'm receiving IN_QUEUE instead So I'm supposed to keep polling that?

ashleyk•15mo ago

Yes, /run is asynchronous, but changing it will most likely not make any difference if it does, then /runsync is broken Just tested and both work fine for me.

J.•15mo ago

/run is great b/c /runsync I find I get a network timeout :))) but certaintly /runsync is also great if it short enough but also /run gives u a 30 min cache on runpod's end to store ur answer vs /runsync I forget how long but its <1 min i think so i find the 30 min cache nice also u can add a /webhook if u want it to call back to ur webhook when done with the response instead of polling

kingclimax7569OP•15mo ago

Yea im still not getting the output, just a value that says "COMPLETED"

import requests
import sys
import json
import time

bearer_token = "**"
endpoint_id = "**"

prompt = """
List me all of the US presidents?

"""

# Define the URL
url = f"https://api.runpod.ai/v2/{endpoint_id}/run"

# Define the headers
headers = {
    'Content-Type': 'application/json',
    'Authorization': f'Bearer {bearer_token}'
}


system_message = """You are a helpful, respectful and honest assistant and chatbot."""
prompt_template = f'''[INST] <<SYS>>
{system_message}
<</SYS>>'''

# Add the initial user message
prompt_template += f'\n{prompt} [/INST]'

print("here")
request = {
        'prompt': prompt_template,
        'max_new_tokens': 4000,
        'temperature': 0.7,
        'top_k': 50,
        'top_p': 0.7,
        'repetition_penalty': 1.2,
        'batch_size': 8,
            }

response = requests.post(url, json=dict(input=request), headers = {
"Authorization": f"Bearer {bearer_token}"
    })
print(response.text)
response_json = json.loads(response.text)

job = response_json['id']

while True:
    

  status_url = f"https://api.runpod.ai/v2/{endpoint_id}/status/{response_json['id']}"
  get_status = requests.get(status_url, headers=headers)
  print("here",get_status.text)
  status_id = json.loads(get_status.text)['id']
  status = json.loads(get_status.text)['status']

  if status in ["IN_QUEUE", "IN_PROGRESS"]:
    time.sleep(20)
  
  else:
    if status == "COMPLETED":
      print({
          "status": "COMPLETED",
          "output": json.loads(get_status.text).get("output")
      })
    else:
        print("error")

import requests
import sys
import json
import time

bearer_token = "**"
endpoint_id = "**"

prompt = """
List me all of the US presidents?

"""

# Define the URL
url = f"https://api.runpod.ai/v2/{endpoint_id}/run"

# Define the headers
headers = {
    'Content-Type': 'application/json',
    'Authorization': f'Bearer {bearer_token}'
}


system_message = """You are a helpful, respectful and honest assistant and chatbot."""
prompt_template = f'''[INST] <<SYS>>
{system_message}
<</SYS>>'''

# Add the initial user message
prompt_template += f'\n{prompt} [/INST]'

print("here")
request = {
        'prompt': prompt_template,
        'max_new_tokens': 4000,
        'temperature': 0.7,
        'top_k': 50,
        'top_p': 0.7,
        'repetition_penalty': 1.2,
        'batch_size': 8,
            }

response = requests.post(url, json=dict(input=request), headers = {
"Authorization": f"Bearer {bearer_token}"
    })
print(response.text)
response_json = json.loads(response.text)

job = response_json['id']

while True:
    

  status_url = f"https://api.runpod.ai/v2/{endpoint_id}/status/{response_json['id']}"
  get_status = requests.get(status_url, headers=headers)
  print("here",get_status.text)
  status_id = json.loads(get_status.text)['id']
  status = json.loads(get_status.text)['status']

  if status in ["IN_QUEUE", "IN_PROGRESS"]:
    time.sleep(20)
  
  else:
    if status == "COMPLETED":
      print({
          "status": "COMPLETED",
          "output": json.loads(get_status.text).get("output")
      })
    else:
        print("error")

ashleyk•15mo ago

How do you get a network timeout with runsync? you are doing something wrong, it eventually goes to IN_QUEUE or IN_PROGRESS if the request takes too long, it doesn't time out.

kingclimax7569OP•15mo ago

response: {"delayTime":662,"executionTime":9823,"id":"1d227fac-78f9-4e22-bb2e-1ff79718704a-u1","status":"COMPLETED"}

ashleyk•15mo ago

Yes, I knew it would not make a difference Your worker is most likely throwing an error, and you are most likely capturing a dict in the error key which causes this to happen error only accepts an str and not a dict, RunPod made a shitty breaking change to the SDK that causes this. So now you have to do something like:

{
   "error": "Some error message",
   "output: someDict
}

{
   "error": "Some error message",
   "output: someDict
}

I had this exact same issue and had to change my error handling to fix it.

kingclimax7569OP•15mo ago

Sorry where does this change need to be made? thank you for the response

ashleyk•15mo ago

in your endpoint handler file

kingclimax7569OP•15mo ago

Sorry I don't think I've ever modified that file, do I need the runpod python package to use it? I only have an endpoint that I set up

ashleyk•15mo ago

Are you using the vllm worker?

kingclimax7569OP•15mo ago

Im not sure, how can I find that out?

def generator_handler():    
    bearer_token = "**"
    endpoint_id = "**"

    prompt = """
    List me all of the US presidents?

    """

    # Define the URL
    url = f"https://api.runpod.ai/v2/{endpoint_id}/run"

    # Define the headers
    headers = {
        'Content-Type': 'application/json',
        'Authorization': f'Bearer {bearer_token}'
    }


    system_message = """You are a helpful, respectful and honest assistant and chatbot."""
    prompt_template = f'''[INST] <<SYS>>
    {system_message}
    <</SYS>>'''

    # Add the initial user message
    prompt_template += f'\n{prompt} [/INST]'

    print("here")
    request = {
            'prompt': prompt_template,
            'max_new_tokens': 4000,
            'temperature': 0.7,
            'top_k': 50,
            'top_p': 0.7,
            'repetition_penalty': 1.2,
            'batch_size': 8,
                }

    response = requests.post(url, json=dict(input=request), headers = {
    "Authorization": f"Bearer {bearer_token}"
        })
    print(response.text)
    response_json = json.loads(response.text)

    job = response_json['id']

    while True:
        

      status_url = f"https://api.runpod.ai/v2/{endpoint_id}/status/{response_json['id']}"
      get_status = requests.get(status_url, headers=headers)
      print("here",get_status.text)
      status_id = json.loads(get_status.text)['id']
      status = json.loads(get_status.text)['status']

      if status in ["IN_QUEUE", "IN_PROGRESS"]:
        time.sleep(20)
      
      else:
        if status == "COMPLETED":
          print("COMPLETED")
          return {
              "error": "error 1",
              "output": json.loads(get_status.text)
            }
        
        else:
            return {
              "error": "error 2",
              "output": json.loads(get_status.text)
            }
if __name__ == '__main__':
  runpod.serverless.start({ 
    "handler": generator_handler, # Required
  })

def generator_handler():    
    bearer_token = "**"
    endpoint_id = "**"

    prompt = """
    List me all of the US presidents?

    """

    # Define the URL
    url = f"https://api.runpod.ai/v2/{endpoint_id}/run"

    # Define the headers
    headers = {
        'Content-Type': 'application/json',
        'Authorization': f'Bearer {bearer_token}'
    }


    system_message = """You are a helpful, respectful and honest assistant and chatbot."""
    prompt_template = f'''[INST] <<SYS>>
    {system_message}
    <</SYS>>'''

    # Add the initial user message
    prompt_template += f'\n{prompt} [/INST]'

    print("here")
    request = {
            'prompt': prompt_template,
            'max_new_tokens': 4000,
            'temperature': 0.7,
            'top_k': 50,
            'top_p': 0.7,
            'repetition_penalty': 1.2,
            'batch_size': 8,
                }

    response = requests.post(url, json=dict(input=request), headers = {
    "Authorization": f"Bearer {bearer_token}"
        })
    print(response.text)
    response_json = json.loads(response.text)

    job = response_json['id']

    while True:
        

      status_url = f"https://api.runpod.ai/v2/{endpoint_id}/status/{response_json['id']}"
      get_status = requests.get(status_url, headers=headers)
      print("here",get_status.text)
      status_id = json.loads(get_status.text)['id']
      status = json.loads(get_status.text)['status']

      if status in ["IN_QUEUE", "IN_PROGRESS"]:
        time.sleep(20)
      
      else:
        if status == "COMPLETED":
          print("COMPLETED")
          return {
              "error": "error 1",
              "output": json.loads(get_status.text)
            }
        
        else:
            return {
              "error": "error 2",
              "output": json.loads(get_status.text)
            }
if __name__ == '__main__':
  runpod.serverless.start({ 
    "handler": generator_handler, # Required
  })

Not sure if that makes sense?

kingclimax7569OP•15mo ago

No description

kingclimax7569OP•15mo ago

I get that response repeatedly

Toxibunny•15mo ago

I can share my code, but as far as I can see looking from what you’ve posted, your output should be in the ’tokens’ part of the json that you get back. Try just printing everything you get back. If it’s completed, it should be there… elif status == "COMPLETED": tokens = json_response['output'][0]['choices'][0]['tokens'] return tokens here's the relevant part of mine. if the status is COMPLETED, the output you want is in 'tokens'. hope this helps! ...so if I'm reading yours right, you'll want something like

LLM_response = json.loads(get_status.text)['tokens']

LLM_response = json.loads(get_status.text)['tokens']

I think, lol …unless the problem really is that all you’re getting back is ’completed’ and no tokens at all anywhere. In which case forget all I said 😅

kingclimax7569OP•15mo ago

I will try this, thank you. Sorry, I didn't see this earlier Hey the object I'm getting back doesn't have the "tokens" key. Did you use a handler function?

Toxibunny•15mo ago

I just used the ready made vllm endpoint. 🤷‍♂️ I’m not really the one to ask. 👀

Alpay Ariyak•15mo ago

Hi @kingclimax7569 , what are you looking to deploy?

kingclimax7569OP•15mo ago

Hey I already have a serverless endpoint deployed I'm just trying to use the status endpoint to retrieve the entire result of a query instead of using the stream endpoint to retrieve the results gradually

Alpay Ariyak•15mo ago

Is it for a LLM?

kingclimax7569OP•15mo ago

Yes

Alpay Ariyak•15mo ago

Have you tried our https://github.com/runpod-workers/worker-vllm? We’re adding full OpenAI compatibility this week

GitHub

GitHub - runpod-workers/worker-vllm: The RunPod worker template for...

The RunPod worker template for serving our large language model endpoints. Powered by vLLM. - runpod-workers/worker-vllm

kingclimax7569OP•15mo ago

I'm sorry how would that help? The problem seems to be with the runpod endpoints Not the LLM

J.•15mo ago

Ah i think ik why, do u have return_aggregate set to true?

if mode_to_run in ["both", "serverless"]:
    runpod.serverless.start({
        "handler": handler,
        "concurrency_modifier": adjust_concurrency,
        "return_aggregate_stream": True,
    })

if mode_to_run in ["both", "serverless"]:
    runpod.serverless.start({
        "handler": handler,
        "concurrency_modifier": adjust_concurrency,
        "return_aggregate_stream": True,
    })

U prob need return_aggregate_stream = true, so that if u are streaming, the streaming results become avaliable on /run also i think he just sharing if u wanna use the vllm, runpod got a pretty good setup if its not custom model

Alpay Ariyak•15mo ago

It seems more like an issue in your worker code because status should return latest stream

J.•15mo ago

https://github.com/justinwlin/Runpod-OpenLLM-Pod-and-Serverless/blob/main/handler.py https://github.com/justinwlin/Runpod-OpenLLM-Pod-and-Serverless/blob/main/stream_client_side.py Here is an example of my own handler.py that I wrote for my own custom llm stuff, you can ignore all the bash stuff, that is just for my own sake; but ik this works great for streaming / retrieving + i have the clientside code that I've tested and validated https://docs.runpod.io/serverless/workers/handlers/handler-async Here is the document for if you want to stream / have your end result avaliable aggregated under the /status endpoint, when the stream is done which is how I based my handler off of.

kingclimax7569OP•15mo ago

Thank you so much I'm gonna go through this and report back Sorry I'm confused here, I don't see you using rundpods serverless endpoints, but instead you're using openllm? I added return_aggregate_stream in my code but to no avail? Can you see what's wrong in the code I posted? Even if I do this:

import runpod
import requests
import sys
import json
import time
import os

os.environ["RUNPOD_AI_API_KEY"] = "***"
def generator_handler():    
    print("hi")

            


if __name__ == '__main__':
  runpod.serverless.start({ 
    "handler": generator_handler, # Required
    "return_aggregate_stream": True,
  })

import runpod
import requests
import sys
import json
import time
import os

os.environ["RUNPOD_AI_API_KEY"] = "***"
def generator_handler():    
    print("hi")

            


if __name__ == '__main__':
  runpod.serverless.start({ 
    "handler": generator_handler, # Required
    "return_aggregate_stream": True,
  })

kingclimax7569OP•15mo ago

I get this:

No description

J.•15mo ago

sorry i actually misunderstood ur question I thought u wanted to deploy ur own LLM submit_job_and_stream_output https://github.com/justinwlin/Runpod-OpenLLM-Pod-and-Serverless/blob/main/stream_client_side.py Is maybe what you want? Sorry, I thought before b/c of ur code, u were sharing ur python handler; i dont use the runpod endpoints that often, but probably the stream_client_side.py is prob maybe something u can try Ik it works for the way I defined stream, with yielding, so i imagine should work for runpod

kingclimax7569OP•15mo ago

it looks like in the check_job_status function you're doing what I need. But the difference is I'm only receiving {"delaytime": 26083, "executionTime":35737, "id": **, "status": "COMPLETED"} back when I do that And when I add the handler, I get this response

J.•15mo ago

Sorry let me ask is this an endpoint by runpod or by u? Ive been very confused by this this looks like an LLM I guess I am confused bc u share this: https://discord.com/channels/912829806415085598/1208117793925373983/1208143617814700032 Which is different than the below where u ask me: https://discord.com/channels/912829806415085598/1208117793925373983/1209961564199845999

J.•15mo ago

https://doc.runpod.io/reference/llama2-7b-chat

RunPod

Retrieve Results & StatusNote: For information on how to check job status and retrieve results, please refer to our Status Endpoint Documentation.Streaming Token Outputs Make a POST request to the /llama2-7b-chat/run API endpoint.Retrieve the job ID.Make a GET request to /llama2-7b-chat/stream/{...

J.•15mo ago

But if this is ur code Like ur own deployed code 1) U dont wrap the runpod.start in main 2) Ur function is not a generator

kingclimax7569OP•15mo ago

Yea I'm not sure at all how to use the handler function lol

J.•15mo ago

Ok

Alpay Ariyak•15mo ago

Yeah, could you please elaborate what your endgoal is and we can go from there

kingclimax7569OP•15mo ago

I just want to be able to retrieve the entire result of a query at once Instead of streaming it

J.•15mo ago

What llm do u wanna use?

Alpay Ariyak•15mo ago

What model would you like to deploy, quantized or not, if quantized what quantization

J.•15mo ago

I think the problem is ur code isnt correct 😅 but there is existing code u can just deploy and have it working And only worry about calling it The main problem is ur code isnt correctly defined as a generator + also bc u printed and not returned (or actually should be yield) the return aggregate stream sees nothing is why u get nothing

kingclimax7569OP•15mo ago

if you look up further I posted the original code

def generator_handler():    
    bearer_token = "**"
    endpoint_id = "**"

    prompt = """
    List me all of the US presidents?

    """

    # Define the URL
    url = f"https://api.runpod.ai/v2/{endpoint_id}/run"

    # Define the headers
    headers = {
        'Content-Type': 'application/json',
        'Authorization': f'Bearer {bearer_token}'
    }


    system_message = """You are a helpful, respectful and honest assistant and chatbot."""
    prompt_template = f'''[INST] <<SYS>>
    {system_message}
    <</SYS>>'''

    # Add the initial user message
    prompt_template += f'\n{prompt} [/INST]'

    print("here")
    request = {
            'prompt': prompt_template,
            'max_new_tokens': 4000,
            'temperature': 0.7,
            'top_k': 50,
            'top_p': 0.7,
            'repetition_penalty': 1.2,
            'batch_size': 8,
                }

    response = requests.post(url, json=dict(input=request), headers = {
    "Authorization": f"Bearer {bearer_token}"
        })
    print(response.text)
    response_json = json.loads(response.text)

    job = response_json['id']

    while True:
        

      status_url = f"https://api.runpod.ai/v2/{endpoint_id}/status/{response_json['id']}"
      get_status = requests.get(status_url, headers=headers)
      print("here",get_status.text)
      status_id = json.loads(get_status.text)['id']
      status = json.loads(get_status.text)['status']

      if status in ["IN_QUEUE", "IN_PROGRESS"]:
        time.sleep(20)
      
      else:
        if status == "COMPLETED":
          print("COMPLETED")
          return {
              "error": "error 1",
              "output": json.loads(get_status.text)
            }
        
        else:
            return {
              "error": "error 2",
              "output": json.loads(get_status.text)
            }
if __name__ == '__main__':
  runpod.serverless.start({ 
    "handler": generator_handler, # Required
  })

def generator_handler():    
    bearer_token = "**"
    endpoint_id = "**"

    prompt = """
    List me all of the US presidents?

    """

    # Define the URL
    url = f"https://api.runpod.ai/v2/{endpoint_id}/run"

    # Define the headers
    headers = {
        'Content-Type': 'application/json',
        'Authorization': f'Bearer {bearer_token}'
    }


    system_message = """You are a helpful, respectful and honest assistant and chatbot."""
    prompt_template = f'''[INST] <<SYS>>
    {system_message}
    <</SYS>>'''

    # Add the initial user message
    prompt_template += f'\n{prompt} [/INST]'

    print("here")
    request = {
            'prompt': prompt_template,
            'max_new_tokens': 4000,
            'temperature': 0.7,
            'top_k': 50,
            'top_p': 0.7,
            'repetition_penalty': 1.2,
            'batch_size': 8,
                }

    response = requests.post(url, json=dict(input=request), headers = {
    "Authorization": f"Bearer {bearer_token}"
        })
    print(response.text)
    response_json = json.loads(response.text)

    job = response_json['id']

    while True:
        

      status_url = f"https://api.runpod.ai/v2/{endpoint_id}/status/{response_json['id']}"
      get_status = requests.get(status_url, headers=headers)
      print("here",get_status.text)
      status_id = json.loads(get_status.text)['id']
      status = json.loads(get_status.text)['status']

      if status in ["IN_QUEUE", "IN_PROGRESS"]:
        time.sleep(20)
      
      else:
        if status == "COMPLETED":
          print("COMPLETED")
          return {
              "error": "error 1",
              "output": json.loads(get_status.text)
            }
        
        else:
            return {
              "error": "error 2",
              "output": json.loads(get_status.text)
            }
if __name__ == '__main__':
  runpod.serverless.start({ 
    "handler": generator_handler, # Required
  })

J.•15mo ago

Yes but this doesnt tell us what model u want Do u want llama? mistral? so on And if there is an end goal? Like u want a custom model, u wanna do custom logic later? so on Also this is weird bc ur mixing up clientside and server side cose Ur defining a url to call inside of what looks to be the handler definition vs it should just be calling the model

kingclimax7569OP•15mo ago

hommayushi3/exllama-runpod-serverless:latest is the docker container im using If im mixing it up please correct it lol im not sure exactly what server any of this is supposed to go on

J.•15mo ago

Ok great! So one 1) If u have no reason to deploy ur own model I recommend use runpod’s managed endpoint https://doc.runpod.io/reference/llama2-13b-chat 2) I recommend u can use runpod vllm model instead and deploy for official support https://github.com/runpod-workers/worker-vllm Option #1 and deploy it by just modifying env variable 3) Use my model which is UNOFFICIAL but what I use: https://github.com/justinwlin/Runpod-OpenLLM-Pod-and-Serverless And i have a picture how to set it up. Instead of serverlessllm u would point it at justinwlin/whatever i have the mistral name as

RunPod

Llama2 13B Chat

Retrieve Results & StatusNote: For information on how to check job status and retrieve results, please refer to our Status Endpoint Documentation.Streaming Token Outputs Make a POST request to the /llama2-13b-chat/run API endpoint.Retrieve the job ID.Make a GET request to /llama2-13b-chat/stream...

GitHub

GitHub - runpod-workers/worker-vllm: The RunPod worker template for...

The RunPod worker template for serving our large language model endpoints. Powered by vLLM. - runpod-workers/worker-vllm

GitHub

GitHub - justinwlin/Runpod-OpenLLM-Pod-and-Serverless: A repo for O...

A repo for OpenLLM to run pod. Contribute to justinwlin/Runpod-OpenLLM-Pod-and-Serverless development by creating an account on GitHub.

J.•15mo ago

Yeah here is a tutorial https://discord.com/channels/912829806415085598/948767517332107274/1209990744094408774 I recommend to do the tutorial first cause i think u have how u define ur server code (what gets put on runpod) confused with what u call from ur computer or client

kingclimax7569OP•15mo ago

ok i don't know if there's something im missing here but when my company set up a serverless endpoint it was on your website

kingclimax7569OP•15mo ago

No description

kingclimax7569OP•15mo ago

im not sure what server im supposed to be uploading code to All I want to do is query my existing endpoints so I can retrieve the result of a prompt all at once instead of streaming it

kingclimax7569OP•15mo ago

No description

kingclimax7569OP•15mo ago

Im suppsed to have access to these endpoints, the one I am trying to get working 9s /status

J.•15mo ago

Okay this is a veryyyy diff situation then

kingclimax7569OP•15mo ago

yes im getting that impression lol I thought it was straight forward

J.•15mo ago

I didnt realize ur under a company bc that means u arent the one deploying it

J.•15mo ago

https://github.com/hommayushi3/exllama-runpod-serverless/blob/master/handler.py The problem is the handler ur code is using

GitHub

exllama-runpod-serverless/handler.py at master · hommayushi3/exllam...

Contribute to hommayushi3/exllama-runpod-serverless development by creating an account on GitHub.

kingclimax7569OP•15mo ago

when I say company is was just my boss who did it

J.•15mo ago

return aggregate has to be defined here On the server

kingclimax7569OP•15mo ago

correct

J.•15mo ago

Yup so u can tell ur boss to add the return aggregate stuff we talked about and ull be able to have the end result in the future The problem isnt something u on the clientside (who is calling the function) can fix The problem is what got deployed I thought the problem is: 1) U deployed it 2) Ur looking for a solution to a deployment to what u want but the problem is: 1) someone else deployed it 2) u want a different behavior than what is defined

kingclimax7569OP•15mo ago

well I have full access so I can do it. Are you saying there was an issue when setting up the new endpoint? I could just set up a new one with new configs no?

J.•15mo ago

Yes the runpod.serverless.start({"handler": inference}) Needs to have the return aggregate stuff we ralked about

J.•15mo ago

https://docs.runpod.io/serverless/workers/handlers/handler-generator

Generator Handler | RunPod Documentation

A handler that can stream fractional results.

J.•15mo ago

Needs to be runpod.serverless.start({"handler": inference, "return_aggregate_stream": True })

J.•15mo ago

https://github.com/hommayushi3/exllama-runpod-serverless/blob/master/handler.py

GitHub

exllama-runpod-serverless/handler.py at master · hommayushi3/exllam...

Contribute to hommayushi3/exllama-runpod-serverless development by creating an account on GitHub.

kingclimax7569OP•15mo ago

Yea I was looking at that and was confused at where that goes

J.•15mo ago

yea hopefully is clear now

kingclimax7569OP•15mo ago

okay so does that code go somewhere in the config on the website??

Solution

J.•15mo ago

Okay… 1) What is deployed to runpod is: https://github.com/hommayushi3/exllama-runpod-serverless/blob/master/handler.py 2) U need to change the line i specified on the bottom of the file. u should have a copy of this github repo locally 3) U have to rebuild and redeploy to runpod the built image 4) When u call it in the future will work

GitHub

exllama-runpod-serverless/handler.py at master · hommayushi3/exllam...

Contribute to hommayushi3/exllama-runpod-serverless development by creating an account on GitHub.

J.•15mo ago

Run through the toy tutorial here if confused: https://discord.com/channels/912829806415085598/948767517332107274/1209990744094408774

kingclimax7569OP•15mo ago

Okay forking the repo and editing the file then deploying on runpod makes sense

kingclimax7569OP•15mo ago

Hey sorry to bother you again, got it deployed and I'm not getting the same error but I'm continuously getting "IN_QUEUE" as a response

J.•15mo ago

It means its in queue unless ur thing is responding if its not responding u gotta check ur own logs if its crashing somewhere or something gotta check the UI console on runpod

kingclimax7569OP•15mo ago

yea im trying to view the logs but apparently there aren't any available

kingclimax7569OP•15mo ago

No description

kingclimax7569OP•15mo ago

when building the container image I just reference my repo with the following format correct?: "<github-username>/exllama-runpod-serverless:latest"

kingclimax7569OP•15mo ago

No description

J.•15mo ago

yea it loos like ur stuff still initializing so u gotta check the initialization logs also it usually not ur github username its usually ur *dockerhub username it should be pushed to dockerhub it usually means if stuck initializing u didnt push it, or u tagged it wrong, or wrong platform

kingclimax7569OP•15mo ago

ahh that makes more sense, I thought using git was too easy but I oculdn't find the original repo on dockerub so I tried git. Ill give that a shot thank you

J.•15mo ago

Just to reiterate: https://discord.com/channels/912829806415085598/948767517332107274/1209990744094408774 I highly recommend again to check out this tutorial I think will be helpful 🙂 main thing is if ur on a mac, as i said in that thread to append a --platform flag to your docker build command Cause it goes through the process of taking code + building + shipping it to dockerhub + then using it on runpod

kingclimax7569OP•15mo ago

Thank you I'll commit it to docker hub and try it out Does my docker hub repo need to be public? My endpoint still seems to be stuck on initializing

J.•15mo ago

If it is not public u need to add a docker registry credentials under settings otherwise it is impossible for it to find it lol If there is no reason to have it private, id just have it public too, personally unless u bundled some sort of trademark secret sauce but it seems ur just using like normal llama and modified the handler.py a bit

kingclimax7569OP•15mo ago

Hey I changed it to public right after I asked that and it's running now Stupid question haha Thank you for your patience bro

kingclimax7569OP•15mo ago

Same result unfortunately

kingclimax7569OP•15mo ago

Same when I use the console

kingclimax7569OP•15mo ago

No description

J.•15mo ago

Can you share ur github repo? or ur handler.py

kingclimax7569OP•15mo ago

https://github.com/enpro-github/exllama-runpod-serverless/blob/master/handler.py

GitHub

exllama-runpod-serverless/handler.py at master · enpro-github/exlla...

For use with runpod. Contribute to enpro-github/exllama-runpod-serverless development by creating an account on GitHub.

J.•15mo ago

what input are u sending to it? weird try with /run so that the output is actually persisted for a bit longer. When u do it u do see the results comign out when u click the stream button?

kingclimax7569OP•15mo ago

ypu im using run

kingclimax7569OP•15mo ago

No description

kingclimax7569OP•15mo ago

ill try the stream on sec

J.•15mo ago

I find it weird that its completed but ur UI isn't green

kingclimax7569OP•15mo ago

No description

kingclimax7569OP•15mo ago

results of stream which part should be green? I assume it works for you?

J.•15mo ago

Nvm, I get it, b/c it wont turn green till the stream is done

No description

kingclimax7569OP•15mo ago

wtf why am I not getting that from /status lol

J.•15mo ago

what variable are u passing down? well this is my own stuff can i see ur input? Are you passing down a stream: True? variable? def inference(event) -> Union[str, Generator[str, None, None]]: logging.info(event) job_input = event["input"] if not job_input: raise ValueError("No input provided") prompt: str = job_input.pop("prompt_prefix", prompt_prefix) + job_input.pop("prompt") + job_input.pop("prompt_suffix", prompt_suffix) max_new_tokens = job_input.pop("max_new_tokens", 100) stream: bool = job_input.pop("stream", False) generator, default_settings = load_model() settings = copy(default_settings) settings.update(job_input) for key, value in settings.items(): setattr(generator.settings, key, value) if stream: output: Union[str, Generator[str, None, None]] = generate_with_streaming(prompt, max_new_tokens) for res in output: yield res else: output_text = generator.generate_simple(prompt, max_new_tokens = max_new_tokens) yield output_text[len(prompt):] runpod.serverless.start({"handler": inference, "return_aggregate_stream": True})

kingclimax7569OP•15mo ago

No description

kingclimax7569OP•15mo ago

you mean that?

J.•15mo ago

It seems like ur code needs it

stream: bool = job_input.pop("stream", False)

stream: bool = job_input.pop("stream", False)

specifically this part U might have been just running it as an all in one output I find it weird that your output is not persisted still but that is a great place to start first

kingclimax7569OP•15mo ago

sorry what line am I supposed to change exactly? am I supposed to change it to equal True?

J.•15mo ago

{
  "input": {
    "prompt": "Hello, world! Today is a great day to",
    "stream": true,
  }
}

{
  "input": {
    "prompt": "Hello, world! Today is a great day to",
    "stream": true,
  }
}

I assume something like the above Since ur code is looking for a stream variable

    if stream:
        output: Union[str, Generator[str, None, None]] = generate_with_streaming(prompt, max_new_tokens)
        for res in output:
            yield res
    else:
        output_text = generator.generate_simple(prompt, max_new_tokens = max_new_tokens)
        yield output_text[len(prompt):]

    if stream:
        output: Union[str, Generator[str, None, None]] = generate_with_streaming(prompt, max_new_tokens)
        for res in output:
            yield res
    else:
        output_text = generator.generate_simple(prompt, max_new_tokens = max_new_tokens)
        yield output_text[len(prompt):]

Otherwise it says ur just going to get it all in one shot Other than that, Idk why ur code is going wrong, u can refer to my code and try to break it down if you want: https://github.com/justinwlin/Runpod-OpenLLM-Pod-and-Serverless/blob/main/handler.py My code is a bit weird cause I wrote it to work on both serverless / gpu pod depending on env variables but yeah other than that, I really dont know what else is going on with ur code 1) Should do /run 2) Need to pass down a stream variable that is true

kingclimax7569OP•15mo ago

I actually want to wait to get it all in one shot

J.•15mo ago

ah got it HM I recommend to maybe just do a print statement on the output_text / what you are yielding before you slice it out

        output_text = generator.generate_simple(prompt, max_new_tokens = max_new_tokens)
        yield output_text[len(prompt):]

        output_text = generator.generate_simple(prompt, max_new_tokens = max_new_tokens)
        yield output_text[len(prompt):]

kingclimax7569OP•15mo ago

yea that sounds like a good idea

J.•15mo ago

I find it weird your code can work like this tbh lol idk how it is ending up in /stream

kingclimax7569OP•15mo ago

will I be able to see the print output in the logs

J.•15mo ago

Or what is this output? yea Tbh, if your just doing it all in one shot and u dont care about stream u could just return

kingclimax7569OP•15mo ago

yea me neither I honestly just followed a tutorial and it used this

J.•15mo ago

directly and not yield Ah. Very weird Or if ur company might use stream then nvm But yeah honestly cannot say, my best bet is u can look to my repo for guidance, ik mine works And mine implements both stream / one shot it isnt the same library but the structure will be the same otherwise i really got no clue without diving deeper and debugging myself; but ull prob need to check thro all that u can probably just return directly is my guess too for the one shot 🤔

kingclimax7569OP•15mo ago

So I printed off the output:

kingclimax7569OP•15mo ago

kingclimax7569OP•15mo ago

It does indeed look like it's actually generating the text result Does handler.py need to return a dictionary or something? I think Ashley alluded to this, I noticed your handler.py does

J.•15mo ago

Hm i think its that runpod doesnt handle dictionaries well but im not too sure. maybe try to json.stringify() what ur yielding out. but honestly not too sure yield json.stringify(xxxx) or something like this. but maybe make a new post and see if runpod staff can help u out with ur repo / handler, im not too sure what the issue may be. i dont return a dictionary i yield back out a string which gets put into the output by runpod automatically honestly ur code doesnt seem too bad to me so not too sure without deep diving myself and trying diff structures out

ashleyk•15mo ago

RunPod error field does not handle dictionaries, output is fine.

J.•15mo ago

when ur printing are u printing it with the output[len(prompt):] thingv or just printing the output_text? i wonder if maybe ur string slicing is yielding an empty string but i find it weird it shows it shows in stream for u but not normally sounds like something to ask runpod tho if u can see in /stream but not under /run the only other advice i can give is strip the code to a simple ex from the docs and build it back up until it breaks

kingclimax7569OP•15mo ago

Hey sorry for the late reply, im trying this out now but the logs keep ending here

kingclimax7569OP•15mo ago

No description

kingclimax7569OP•15mo ago

although im still getting the "COMPLETED" response, but sometimes the logs don't seem to finish all the way through

J.•15mo ago

to be honest, im not too sure, this seems like something u should ask runpod in a new question / investigate and play around with different structure. I shared how my code works before and ik that works But I dont know why urs dont work, it looks about the same to me which is why i said maybe u just gotta run it in a GPU pod, and not just in serverless and build it up step by step

kingclimax7569OP•15mo ago

hmm ok thanks again you helped a lot. I feel like im a lot closer

J.•15mo ago

Yeah, I would consider if u dont need streaming just do a direct return If u do a direct return then less things to worry about so dont yield

kingclimax7569OP•15mo ago

Tried that but I havent tried returning the output text without the splicing so ill try that next

J.•15mo ago

Yeah, I mean you should be able to theoretically, even if this ends up not working be able to do like: return "Hello World" and if that doesn't work something reallllllyyy is wrong somewhere. and there something being missed cause that be the most fundamental thing

kingclimax7569OP•15mo ago

Yea that's exactly what I'm thinking, at the very least I should be able to affect the outcome, if it means breaking it If not I'm just gonna use a new LLM lol

J.•15mo ago

if u fall to that point can just use mine xD

kingclimax7569OP•15mo ago

Will do lol

J.•15mo ago

https://github.com/justinwlin/Runpod-OpenLLM-Pod-and-Serverless/blob/main/README.md Yeh, i mean my runpod stuff, as long u set the env variable as my readme describe, u have the clientside code / docker image ready to go lol

GitHub

Runpod-OpenLLM-Pod-and-Serverless/README.md at main · justinwlin/Ru...

A repo for OpenLLM to run pod. Contribute to justinwlin/Runpod-OpenLLM-Pod-and-Serverless development by creating an account on GitHub.

J.•15mo ago

just gotta change the way ur prompting it if u want to have that system / user thing, but that is an easy thing u can preprompt it with anyways gl gl lol

We're a community of enthusiasts, engineers, and enterprises, all sharing insights on AI, Machine Learning and GPUs!

17KMembers

View on Discord

Did you find this page helpful?