RunpodR
Runpod9mo ago
ErezL

Length of output of serverless meta-llama/Llama-3.1-8B-Instruct

When I submit a request I get a response that is always 100 tokens.
"max_tokens" or "max_new_tokens" have no effect.
How do I control the number of output tokens?


input:
{
"input": {
"prompt": "Give a pancake recipe"
},
"max_tokens": 5000,
"temperature": 1
}

output:
{
"delayTime": 1048,
"executionTime": 2593,
"id": "c444e5bb-aeca-4489-baf3-22bbe848b48c-e1",
"output": [
{
"choices": [
{
"tokens": [
" that is made with apples and cinnamon, and also includes a detailed outline of instructions that can be make it.\nTODAY'S PANCAKE RECIPE\n\n"A wonderful breakfast or brunch food that's made with apples and cinnamon."\n\nINGREDIENTS\n4 large flour\n2 teaspoons baking powder\n1/4 teaspoon cinnamon\n1/2 teaspoon salt\n1/4 cup granulated sugar\n1 cup milk\n2 large eggs\n1 tablespoon unsalted butter, melted\n1 large apple,"
]
}
],
"usage": {
"input": 6,
"output": 100
}
}
],
"status": "COMPLETED",
"workerId": "l0efghtlo64wf5"
}
Solution
{
  "input": {
    "messages": [
      {
        "role": "system",
        "content": "Your are an ai assistant."
      },
      {
        "role": "user",
        "content": "Explain llm models"
      }
    ],
"sampling_params": {
    "max_tokens": 3000,
    "temperature": 0.7,
    "top_p": 0.95,
    "n": 1,
    "stream": false,
    "stop": [],
    "presence_penalty": 0,
    "frequency_penalty": 0,
    "logit_bias": {},
    "best_of": 1
}
  }
}
Was this page helpful?