RunpodR
Runpod2y ago
3 replies
Casper.

update worker-vllm to vllm 0.5.0

vLLM just got bumped to 0.5.0 with significant features being ready for production. @Alpay Ariyak

FP8 is very significant but so is speculative decoding and prefix caching.

- FP8 support is ready for testing. By quantizing the portion model weights to 8 bit precision float point, the inference speed gets 1.5x boost.
- Add OpenAI Vision API support. Currently only LLaVA and LLaVA-NeXT are supported.
- Speculative Decoding and Automatic Prefix Caching is also ready for testing, we plan to turn them on by default in upcoming releases.
Solution
For sure, already in progress!
Was this page helpful?