update worker-vllm to vllm 0.5.0

vLLM just got bumped to 0.5.0 with significant features being ready for production. @Alpay Ariyak

FP8 is very significant but so is speculative decoding and prefix caching.

- FP8 support is ready for testing. By quantizing the portion model weights to 8 bit precision float point, the inference speed gets 1.5x boost.
- Add OpenAI Vision API support. Currently only LLaVA and LLaVA-NeXT are supported.
- Speculative Decoding and Automatic Prefix Caching is also ready for testing, we plan to turn them on by default in upcoming releases.

Solution

For sure, already in progress!

Jump to solution

update worker-vllm to vllm 0.5.0

Similar Threads

update worker-vllm to vllm 0.5.0

Similar Threads

Similar Threads

Similar Threads