It's entirely use-case based though. Serving at scale is not the same as an end-user device. On an e

It's entirely use-case based though. Serving at scale is not the same as an end-user device. On an end-user device you're concerned about speed, but with a large-scale deployment you need to be able to run many concurrent requests in parallel. Although they're a touch old, see the benchmarks for 70b models here, for example. With 1 concurrent user, llama.cpp wins, but with 32 concurrent users, vLLM wins by a large margin (40 req/min vs 172). https://github.com/ggerganov/llama.cpp/discussions/6730
Was this page helpful?