update worker-vllm to vllm 0.5.0
vLLM just got bumped to 0.5.0 with significant features being ready for production. @Alpay Ariyak
FP8 is very significant but so is speculative decoding and prefix caching.
- FP8 support is ready for testing. By quantizing the portion model weights to 8 bit precision float point, the inference speed gets 1.5x boost.
- Add OpenAI Vision API support. Currently only LLaVA and LLaVA-NeXT are supported.
- Speculative Decoding and Automatic Prefix Caching is also ready for testing, we plan to turn them on by default in upcoming releases.
FP8 is very significant but so is speculative decoding and prefix caching.
- FP8 support is ready for testing. By quantizing the portion model weights to 8 bit precision float point, the inference speed gets 1.5x boost.
- Add OpenAI Vision API support. Currently only LLaVA and LLaVA-NeXT are supported.
- Speculative Decoding and Automatic Prefix Caching is also ready for testing, we plan to turn them on by default in upcoming releases.
