400 error on Load balancing endpoint

Hello, first time here I am using llama.cpp server image to host a model via the load balancer serveless endpoint.

The worker is running and I can check the log an see that my server is running. But when I am trying to hit the endpoint it returning 400 error.

Here is how I am making the request.

headers = {
        "Content-Type": "application/json",
    }
data = {
    "prompt": [
        {"role": "system", "content": ""},
        # just try to limit the characters
        {"role": "user", "content": "who are you? I am trying to connect to you"},
    ],
        "n_predict": 512,
        "temperature": 0.3,
        "top_k": 40,
        "top_p": 0.90,
        "stopped_eos": True,
        "repeat_penalty": 1.05,
        "stop": [
            "assistant",
            "<|im_end|>",
        ],
        "seed": 42,
    }
headers = {
        "Content-Type": "application/json",
        'Authorization': 'Bearer ' + RUNPOD_API_KEY,
    }

BASE_URL = 'https://id.api.runpod.ai/completion'

response = requests.post(
    f"{BASE_URL}",
    headers=headers,
    json=data,
    timeout=3000,
)

headers = {
        "Content-Type": "application/json",
    }
data = {
    "prompt": [
        {"role": "system", "content": ""},
        # just try to limit the characters
        {"role": "user", "content": "who are you? I am trying to connect to you"},
    ],
        "n_predict": 512,
        "temperature": 0.3,
        "top_k": 40,
        "top_p": 0.90,
        "stopped_eos": True,
        "repeat_penalty": 1.05,
        "stop": [
            "assistant",
            "<|im_end|>",
        ],
        "seed": 42,
    }
headers = {
        "Content-Type": "application/json",
        'Authorization': 'Bearer ' + RUNPOD_API_KEY,
    }

BASE_URL = 'https://id.api.runpod.ai/completion'

response = requests.post(
    f"{BASE_URL}",
    headers=headers,
    json=data,
    timeout=3000,
)

This request is taking 2 minutes and then return a 400 error.

For more context, I am running the following image:

ghcr.io/ggerganov/llama.cpp:server

ghcr.io/ggerganov/llama.cpp:server

400 error on Load balancing endpoint

headers = {
        "Content-Type": "application/json",
    }
data = {
    "prompt": [
        {"role": "system", "content": ""},
        # just try to limit the characters
        {"role": "user", "content": "who are you? I am trying to connect to you"},
    ],
        "n_predict": 512,
        "temperature": 0.3,
        "top_k": 40,
        "top_p": 0.90,
        "stopped_eos": True,
        "repeat_penalty": 1.05,
        "stop": [
            "assistant",
            "<|im_end|>",
        ],
        "seed": 42,
    }
headers = {
        "Content-Type": "application/json",
        'Authorization': 'Bearer ' + RUNPOD_API_KEY,
    }

BASE_URL = 'https://id.api.runpod.ai/completion'

response = requests.post(
    f"{BASE_URL}",
    headers=headers,
    json=data,
    timeout=3000,
)

headers = {
        "Content-Type": "application/json",
    }
data = {
    "prompt": [
        {"role": "system", "content": ""},
        # just try to limit the characters
        {"role": "user", "content": "who are you? I am trying to connect to you"},
    ],
        "n_predict": 512,
        "temperature": 0.3,
        "top_k": 40,
        "top_p": 0.90,
        "stopped_eos": True,
        "repeat_penalty": 1.05,
        "stop": [
            "assistant",
            "<|im_end|>",
        ],
        "seed": 42,
    }
headers = {
        "Content-Type": "application/json",
        'Authorization': 'Bearer ' + RUNPOD_API_KEY,
    }

BASE_URL = 'https://id.api.runpod.ai/completion'

response = requests.post(
    f"{BASE_URL}",
    headers=headers,
    json=data,
    timeout=3000,
)

This request is taking 2 minutes and then return a 400 error.

For more context, I am running the following image:

ghcr.io/ggerganov/llama.cpp:server

ghcr.io/ggerganov/llama.cpp:server

400 error on Load balancing endpoint

Similar Threads

400 error on Load balancing endpoint

Similar Threads

Similar Threads

Similar Threads