Text-generation-inference on serverless endpoints

Hi, I don't have much experience neither with llms nor with python, so I always just use this image 'ghcr.io/huggingface/text-generation-inference:latest' and run my models on Pods. Now, I wanna try serverless endpoints, but I don't know how to launch text-generation-inference on serverless endpoints, can someone give some tips or maybe there are some docs which could help me.

ashley•3/6/24, 10:08 AM

https://github.com/runpod-workers/worker-vllm

GitHub

GitHub - runpod-workers/worker-vllm: The RunPod worker template for...

The RunPod worker template for serving our large language model endpoints. Powered by vLLM. - runpod-workers/worker-vllm

HashsetOP•3/6/24, 10:11 AM

Thanks @ashleyk, I think this can help. I'll take a look at it

HashsetOP•3/6/24, 11:13 AM

For now, everything works well! I managed to deploy llama-2-7b, but I have few more questions:

HashsetOP•3/6/24, 11:13 AM

1. How can I set a temperature or other fields when sending a request:

HashsetOP•3/6/24, 11:15 AM

2. Why am I seeing this deprecation notification? Am I doing something wrong?

HashsetOP•3/6/24, 11:16 AM

2024-03-06T11:04:57.237074977Z
2024-03-06T11:04:57.237196083Z ==========
2024-03-06T11:04:57.237225766Z == CUDA ==
2024-03-06T11:04:57.237320751Z ==========
2024-03-06T11:04:57.239764246Z
2024-03-06T11:04:57.239767808Z CUDA Version 12.1.0
2024-03-06T11:04:57.240376901Z
2024-03-06T11:04:57.240384025Z Container image Copyright (c) 2016-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
2024-03-06T11:04:57.240869637Z
2024-03-06T11:04:57.240873199Z This container image and its contents are governed by the NVIDIA Deep Learning Container License.
2024-03-06T11:04:57.240876761Z By pulling and using the container, you accept the terms and conditions of this license:
2024-03-06T11:04:57.240879135Z https://developer.nvidia.com/ngc/nvidia-deep-learning-container-license
2024-03-06T11:04:57.240883884Z
2024-03-06T11:04:57.240886259Z A copy of this license is made available in this container at /NGC-DL-CONTAINER-LICENSE for your convenience.
2024-03-06T11:04:57.249862363Z
2024-03-06T11:04:57.249875424Z **
2024-03-06T11:04:57.249882548Z DEPRECATION NOTICE!
2024-03-06T11:04:57.249979908Z **
2024-03-06T11:04:57.250003654Z THIS IMAGE IS DEPRECATED and is scheduled for DELETION.
2024-03-06T11:04:57.250009591Z https://gitlab.com/nvidia/container-images/cuda/blob/master/doc/support-policy.md
2024-03-06T11:04:57.250039273Z

ashley•3/6/24, 11:17 AM

Why are you using CUDA 12.1.0 base image and not 12.1.1. Use 12.1.1 instead.