vLLM and Triton were a response to a fast growing ecosystem and the end production inference server

vLLM and Triton were a response to a fast growing ecosystem and the end production inference server of choice will not be written in Python

TTrojan llama.cpp is indeed more performant which is why it’s better suited for edge and...

Isaac McFadyen•1/4/25, 2:02 AM

It's entirely use-case based though. Serving at scale is not the same as an end-user device. On an end-user device you're concerned about speed, but with a large-scale deployment you need to be able to run many concurrent requests in parallel. Although they're a touch old, see the benchmarks for 70b models here, for example. With 1 concurrent user, llama.cpp wins, but with 32 concurrent users, vLLM wins by a large margin (40 req/min vs 172). https://github.com/ggerganov/llama.cpp/discussions/6730

TTrojan vLLM and Triton were a response to a fast growing ecosystem and the end producti...

Isaac McFadyen•1/4/25, 2:04 AM

As for this argument, that might be the case but vLLM calls into C++ for inference (as does any major framework) for most parts excluding the actual HTTP serving. llama.cpp might be faster on the HTTP side but vLLM is still very much trusted by large organizations for deploying models, as can be seen from their Github where they list organizations that support it: https://github.com/vllm-project/vllm

IIsaac McFadyen It's entirely use-case based though. Serving at scale is not the same as an end-...

TrojanOP•1/4/25, 2:30 AM

right. looks like llama.cpp multi gpu code still needs some work ): https://www.reddit.com/r/LocalLLaMA/comments/1ge1ojk/updated_with_corrected_settings_for_llamacpp/

From the LocalLLaMA community on Reddit: Updated with corrected set...

Explore this post and more from the LocalLLaMA community

IIsaac McFadyen As for this argument, that might be the case but vLLM calls into C++ for inferen...

TrojanOP•1/4/25, 2:31 AM

i wonder how much gains can be seen by better parallelism code and whether we’ll see orgs move to a more maintainable solution with less dependencies

TTrojan i wonder how much gains can be seen by better parallelism code and whether we’ll...

Isaac McFadyen•1/4/25, 2:32 AM

Yeah. I'm certainly not arguing that it's nice to have a compiled framework sometimes, I just don't know that we're quite there yet

IIsaac McFadyen Yeah. I'm certainly not arguing that it's nice to have a compiled framework some...

TrojanOP•1/4/25, 2:33 AM

i’ve certainly suffered from dependency hell. i encourage you to check out GBNF though. these grammars are a hugely underutilized feature. people are still stuck using langchain when there’s so much more expressive power out there

c11network•1/6/25, 1:42 AM

Hey there,
I posted in General Support forum but I'm not sure where my post went from earlier, should I post my question in here?
I even tried to search the title but nothing came up?

c11network•1/6/25, 1:46 AM

Nevermind, found it... :/ It's on CF 0T + OLLAMA API if anyone has any suggestions

TrojanOP•1/6/25, 1:24 PM

will we see cloudflare train any native models?

TrojanOP•1/6/25, 1:25 PM

even if it was something like a network traffic sentiment analyzer. it would be really cool to me

SomeyAI•1/14/25, 10:06 PM

Hey, quick question, where is the best place to inform Cloudflare about jailbreaks for the AI's? Some users of my app discovered a disturbing jailbreak, I fixed it on my app but where can I report it to cluodflare so other apps don't suffer the same?

SSomeyAI Hey, quick question, where is the best place to inform Cloudflare about jailbrea...

Isaac McFadyen•1/14/25, 10:14 PM

Cloudflare doesn't produce the AI models they host, so they can't do anything to fix it really. You'd have to look and see if there's a specific spot to report those for the models themselves. Honestly though, it's entirely up to how a model is prompted and I doubt there's a method to report jailbreaks because they're kind of a given with LLMs at the moment (any LLM can be jailbroken with enough time and effort).

IIsaac McFadyen Cloudflare doesn't produce the AI models they host, so they can't do anything to...

SomeyAI•1/14/25, 10:16 PM

Yes, I understand that. However the models Cloudflare is providing are capable of producing content which should 100% not be possible with any model out there. I don't want to exactly discuss this in public, however my website got targeted by multiple prompts generating very very bad images.

SSomeyAI Yes, I understand that. However the models Cloudflare is providing are capable o...

Isaac McFadyen•1/14/25, 10:17 PM

Yes, but even then, Cloudflare cannot do anything about it. They are given the models by the people who train/produce them (Meta, independent authors, etc) and just host the models. They have no control over anything the model does/does not generate.

SomeyAI•1/14/25, 10:17 PM

Is there any way I can check images Cloudflare produces against prohibited contents?

SSomeyAI Is there any way I can check images Cloudflare produces against prohibited conte...

Isaac McFadyen•1/14/25, 10:19 PM

If it's related to CSAM then you can enable the CSAM scanning tool on your site https://developers.cloudflare.com/cache/reference/csam-scanning/ but that isn't realtime/instant. Other than that, there's models like LLaMa Guard (on Workers AI) for filtering text content but I don't believe there's a model on Workers AI for filtering image content.

Muhammadreza Haghiri•1/20/25, 8:19 PM

Hello. Just a question, it is still impossible to host our own models. Will it be possible anytime soon?

Original message was deleted

Isaac McFadyen•1/27/25, 4:34 PM

Try cloudflare-go, this channel is for Cloudflare's AI products.

Spotnag•1/27/25, 4:36 PM

did you have to delete it? i can't copy paste it there now and wil have to rewrite it... -.-

SSpotnag did you have to delete it? i can't copy paste it there now and wil have to rewri...

Isaac McFadyen•1/27/25, 4:39 PM

Oh whoops, sorry.

lfdepombo•1/27/25, 10:40 PM

Hi, I ran the example here and it works without issue - https://developers.cloudflare.com/workers-ai/models/llava-1.5-7b-hf/ I then used an image downloaded from the whatsapp api and I can download it correctly without a problem in my local machine or KV. However, the AI worker returns

  (error) Error processing message: AiError: 3010: Invalid or incomplete input for the model: model returned: Failed to decode image: cannot identify image file <_io.BytesIO object at 0x7ee1b5d57a10>

  (error) Error processing message: AiError: 3010: Invalid or incomplete input for the model: model returned: Failed to decode image: cannot identify image file <_io.BytesIO object at 0x7ee1b5d57a10>

Cloudflare Docs

llava-1.5-7b-hf · Cloudflare Workers AI docs

LLaVA is an open-source chatbot trained by fine-tuning LLaMA/Vicuna on GPT-generated multimodal instruction-following data. It is an auto-regressive language model, based on the transformer architecture.

lfdepombo•1/27/25, 10:46 PM

any idea why this would happen in the worker, but not locally?

Vero•1/28/25, 9:31 AM

------------------------------

This channel is archived, please use workers-ai ai-gateway vectorize

------------------------------

vLLM and Triton were a response to a fast growing ecosystem and the end production inference server

Similar Threads

Similar Threads

Similar Threads