Cloudflare AI Gateway has a 100-second origin timeout. But when using inference models like Gemini 2

K

ksanderer•4/21/25, 2:10 PM

Hi everyone,

I'm wondering if there are plans to support cost estimation for streaming requests in AI Gateway?

We're looking at AI Gateway + Logpush for our SaaS application where we need to track costs across different tenants. Currently, we're using various models from different AI providers, and managing cost tracking has become quite messy.

AI Gateway looks really promising for our use case, but we specifically need to understand how streaming requests would be handled in terms of cost estimation.

Is there a public roadmap available that might indicate when this feature could be expected? Or any information about the development timeline for streaming request cost tracking?

Aabuiles Minor nit, but it seems like the logs filter UI is assuming strings in the metad...

K

Kathy•4/21/25, 7:39 PM

nice catch - thanks for raising

Kksanderer Hi everyone, I'm wondering if there are plans to support cost estimation for st...

K

Kathy•4/21/25, 7:39 PM

which models and providers are you using for streaming that do not have tokens?

面面条 Cloudflare AI Gateway has a 100-second origin timeout. But when using inference ...

K

Kathy•4/21/25, 8:58 PM

fyi this has been raised

M

mr.niko.la•4/22/25, 7:28 AM

import Anthropic from "@anthropic-ai/sdk";

const anthropic = new Anthropic({
  apiKey: "<api key>",
  baseURL:
    "https://gateway.ai.cloudflare.com/v1/{account_id}/{gateway_id}/grok",
});

const msg = await anthropic.messages.create({
  model: "grok-beta",
  max_tokens: 128,
  system:
    "You are Grok, a chatbot inspired by the Hitchhiker's Guide to the Galaxy.",
  messages: [
    {
      role: "user",
      content: "What is the meaning of life, the universe, and everything?",
    },
  ],
});

console.log(msg);

import Anthropic from "@anthropic-ai/sdk";

const anthropic = new Anthropic({
  apiKey: "<api key>",
  baseURL:
    "https://gateway.ai.cloudflare.com/v1/{account_id}/{gateway_id}/grok",
});

const msg = await anthropic.messages.create({
  model: "grok-beta",
  max_tokens: 128,
  system:
    "You are Grok, a chatbot inspired by the Hitchhiker's Guide to the Galaxy.",
  messages: [
    {
      role: "user",
      content: "What is the meaning of life, the universe, and everything?",
    },
  ],
});

console.log(msg);

import Anthropic from "@anthropic-ai/sdk";

const anthropic = new Anthropic({
  apiKey: "<api key>",
  baseURL:
    "https://gateway.ai.cloudflare.com/v1/{account_id}/{gateway_id}/grok",
});

const msg = await anthropic.messages.create({
  model: "grok-beta",
  max_tokens: 128,
  system:
    "You are Grok, a chatbot inspired by the Hitchhiker's Guide to the Galaxy.",
  messages: [
    {
      role: "user",
      content: "What is the meaning of life, the universe, and everything?",
    },
  ],
});

console.log(msg);

import Anthropic from "@anthropic-ai/sdk";

const anthropic = new Anthropic({
  apiKey: "<api key>",
  baseURL:
    "https://gateway.ai.cloudflare.com/v1/{account_id}/{gateway_id}/grok",
});

const msg = await anthropic.messages.create({
  model: "grok-beta",
  max_tokens: 128,
  system:
    "You are Grok, a chatbot inspired by the Hitchhiker's Guide to the Galaxy.",
  messages: [
    {
      role: "user",
      content: "What is the meaning of life, the universe, and everything?",
    },
  ],
});

console.log(msg);

https://developers.cloudflare.com/ai-gateway/providers/grok/

here is the api-key from grok? im kinda confused by the documentation....

I am trying to use grok 3 and grok 3 mini...

Cloudflare Docs

Grok · Cloudflare AI Gateway docs

Grok is s a general purpose model that can be used for a variety of tasks, including generating and understanding text, code, and function calling.

M

mysterious_dragon_70032•4/22/25, 7:35 AM

#ai-gateway

面面条 Cloudflare AI Gateway has a 100-second origin timeout. But when using inference ...

I

ItsWendell•4/22/25, 10:01 PM

I've been running into the same problem, I think the 100-second limit also applies for outgoing requests from workers, whether you use AI Gateway or not.

I also did long thinking, non- streaming requests with high output token limits to the Gemini 2.5 APIs where I ran into this problem quite often. I needed to use JSON / structured outputs.

I wrote my engine to use streaming text endpoints. I prompt engineered it in a way where it streams occasional [markers] for real time updates I needed and for structured output I needed for the request, which honestly have been working quite well. It prevents the 100 seconds proxy timeout since the streams start before that time, meanwhile I can comfortably stream up to 40/50k tokens like this using 1 API call.

I also built in some fault tolerance and output recovery, by occasionally storing the output during streaming into KV (Durable Objects storage is also a possibility). If it fails midway and retries in my workflow, it attempts to restore the state and resume where it left off.

What I really like is that I can send real-time updates this way to consumers, since I prompted Gemini to occasionally send an [status: {...}] marker which I can parse in real-time, since outputting 40/50k tokens can time some time.

I

ItsWendell•4/23/25, 12:59 PM

https://discord.com/channels/595317990191398933/1040420029080018945/1364574169672450112

Awesome that the proxy limit has been increased beyond 100 seconds for AI Gateway, what is the new limit? Just so I can adjust my internal timeouts / retry logic!

I

ItsWendell•4/23/25, 10:08 PM

@Gabriel nice! Is that only for requests that are

workers <> ai gateway <> model provider

workers <> ai gateway <> model provider

? Or also

workers <> model providers

workers <> model providers

, another case that I'm hitting is that I'm creating context caches at Google Gemini APIs that can take quite a long time, with larger contexts, e.g. a lot of big files, this takes over 100 seconds and I can't use the API gateway for this as far as I know.

IItsWendell @Gabriel nice! Is that only for requests that are `workers <> ai gateway <> mod...

K

Kathy•4/24/25, 4:21 PM

if requests are not going through AI Gateway, you might need to increase this limit for your domain. Looks like it may be an enterprise only capability https://developers.cloudflare.com/fundamentals/reference/connection-limits/#between-cloudflare-and-origin-server

Cloudflare Docs

Connection limits · Cloudflare Fundamentals docs

When HTTP/HTTPS traffic is proxied through Cloudflare, there are often two established TCP connections: the first is between the requesting client to Cloudflare and the second is between Cloudflare and the origin server. Each connection has their own set of TCP and HTTP limits, which are documented below.

KKathy if requests are not going through AI Gateway, you might need to increase this li...

I

ItsWendell•4/24/25, 4:41 PM

Thanks for coming back to this! I was aware of the limits and the ability of getting an enterprise plan to upgrade these limits, but that would be for our business in an MVP / prototyping phase not feasable.

If context caching would be supported in AI Gateway, that would be great since than we could also use features like tracking costs of that (in google's case it returns usage, unsure if other providers offer similar methods).

IItsWendell Thanks for coming back to this! I was aware of the limits and the ability of get...

K

Kathy•4/24/25, 4:49 PM

explain more what you meant by your last sentence on context caching? initially i thought that was re: semantic caching, but seems you're talking about something different

I

ItsWendell•4/24/25, 5:02 PM

@Kathy | Browser Rendering PM Ah fair, it's related to "context caching" at Google / Vertex AI. For huge contexts and long repeated runs with the same context this can siginificantly speed up inference and cost savings. https://ai.google.dev/gemini-api/docs/caching?lang=rest.

The problem I'm having here is that Google is taking an increasingly long time to more context you give it to create this Context Cache. We often hit 2+ minutes for bigger context (20/30+ PDF files, some of them 100+ pages, which on their side I think are being transformed into images, and then into tokens).

Google AI for Developers

Context caching | Gemini API | Google AI for Developers

Learn how to use Context Caching in the Gemini API

E

Erisa•4/24/25, 9:45 PM

Is it expected or a bug that requesting workers ai openai-compatible api thru ai gateway (specifically with gemma3 in this case) would not log the request and only the response as shown in the screenshot?

IItsWendell @Kathy | Browser Rendering PM Ah fair, it's related to "context caching" at Goog...

K

Kathy•4/24/25, 11:49 PM

thanks for sharing so the ask here is for ai gateway to support tracking the costs of context caching through providers, including google

what does the response look like from google when using conext caching? Curious to see how it splits out input, output, and context caching because that is how we would track tokens to then calculate

EErisa Is it expected or a bug that requesting workers ai openai-compatible api thru ai...

K

Kathy•4/24/25, 11:49 PM

thisssss seems like a bug. could you share your account id + gateway name through DM please

KKathy thisssss seems like a bug. could you share your account id + gateway name throug...

M

mr.niko.la•4/26/25, 10:04 AM

Hi will you share whisper turbo sample code to test it. Could use a repo with examples.

X

xpawlik92x•4/26/25, 5:12 PM

Hey!

I'm currently checking AI Gateway for the company I work at — mainly to track token usage per our customer.

While testing GPT-4.1 with streaming enabled, I noticed that if a client cancels a request mid-stream, there’s no trace of that request in the AI Gateway logs.
Is this the intended behavior, or could this be a bug or a limitation of the current implementation?

A

Azuredush•5/1/25, 10:23 AM

When using Aistudio's OpenAI-compatible endpoint through AI Gateway, a lot of "��" characters appear when the model outputs long content. However, when I use the official endpoint directly, everything works fine. Why is this happening?

AAzuredush When using Aistudio's OpenAI-compatible endpoint through AI Gateway, a lot of "�...

I

Isaac McFadyen•5/1/25, 4:16 PM

That's probabably not long content but because it's a UTF-8 character that can't be rendered on the Cloudflare dashboard because it's missing in the font they use.

R

rob•5/2/25, 2:16 PM

Hi

D

dave•5/6/25, 10:16 PM

Does AI Gateway retry requests automatically?

D

dave•5/6/25, 10:17 PM

I'm seeing multiple requests logged on OpenAI that I don't see logged in AI Gateway, nor did I send intentionally.

�

😈 Donkey 💫•5/7/25, 11:46 AM

Hello !
What if the endpoint is router? like this:

https://router.huggingface.co/novita/v3/openai/chat/completions

https://router.huggingface.co/novita/v3/openai/chat/completions

M

mongj•5/9/25, 7:52 PM

hey we're seeing all of our openai STT requests proxied through CF AI gateway fail with 400

{
  "error": {
    "message": "The audio file could not be decoded or its format is not supported.",
    "type": "invalid_request_error",
    "param": null,
    "code": null
  }
}

{
  "error": {
    "message": "The audio file could not be decoded or its format is not supported.",
    "type": "invalid_request_error",
    "param": null,
    "code": null
  }
}

{
  "error": {
    "message": "The audio file could not be decoded or its format is not supported.",
    "type": "invalid_request_error",
    "param": null,
    "code": null
  }
}

{
  "error": {
    "message": "The audio file could not be decoded or its format is not supported.",
    "type": "invalid_request_error",
    "param": null,
    "code": null
  }
}

Same request work when the original url

https://api.openai.com/v1/audio/transcriptions

https://api.openai.com/v1/audio/transcriptions

is used instead

is anyone else having this problem?

I

ItsWendell•5/10/25, 1:42 PM

None of the newer gemini models (2.5 Flash / Pro) and even 2.0 Flash (and 2.0 Flash Lite) properly track input / output tokens. We started sending metadata and pricing information into the AI Gateway, but just realized that tokens are not tracked and we don't have control over that.

Google does provide in both streaming and non-streaming use-cases the used tokens in detail, also visible in the output json of AI Gateway.

Couple of Log IDs: 01JTX5HT6G2VCPXCZQAWJ5X39W, 01JTX5JZS78V9Y2XDE8A4WAZ8F, 01JTVQBX08S1WA6DZ0GVS8CV2K

A

abufoysal2004•5/12/25, 10:10 AM

Yes

Mmongj hey we're seeing all of our openai STT requests proxied through CF AI gateway fa...

K

kyoya•5/16/25, 2:02 AM

Hello! I'm also having this problem. Hopefully CF can give us an answer

M

mr.niko.la•5/16/25, 10:02 AM

Which model is the beta for structured outpu? Lllama 4 is

Mmr.niko.la Which model is the beta for structured outpu? Lllama 4 is 😵‍💫🥴

H

Hygi•5/19/25, 5:31 AM

Hello , i want to ask . So i use the ai gateway its use model llama-4-scout-17b-16e-instruct . So , if i test , i get error like this on my dashboard
{
"errors": [
{
"message": "AiError: AiError: unknown internal error (e06783e9-9892-4065-853d-849bf7b684a9)",
"code": "7000"
}
],
"success": false,
"result": {},
"messages": []
}

why this error ?

A

ac•5/19/25, 5:45 PM

Hi -- are there plans to add more evaluations to the AI gateway any time soon? e.g. the ability to run model graded evals, similar to the evals that OpenAI provides? This would be huge for us.

R

Razmjoo•5/22/25, 9:43 AM

Hi there; are there any plans for Google OpenAI-compatible endpoints? Google AI Studio https://ai.google.dev/gemini-api/docs/openai Vertex AI: https://cloud.google.com/vertex-ai/docs/reference/rest

Google AI for Developers

OpenAI compatibility | Gemini API | Google AI for Developers

Google Cloud

Vertex AI API | Google Cloud

R

Razmjoo•5/22/25, 9:48 AM

I am unsure if this is the best place for a bug report... AI Gateway openai comptable for workers-ai (https://developers.cloudflare.com/ai-gateway/providers/workersai/#openai-compatible-endpoints) isn't working properly when sending an image to llama4 (https://developers.cloudflare.com/workers-ai/models/llama-4-scout-17b-16e-instruct/) while if using AI Gateway + Direct model call (@cf/meta/llama-4-scout-17b-16e-instruct) it will work.

Cloudflare Docs

llama-4-scout-17b-16e-instruct

Meta's Llama 4 Scout is a 17 billion parameter model with 16 experts that is natively multimodal. These models leverage a mixture-of-experts architecture to offer industry-leading performance in text and image understanding.

Cloudflare Docs

Workers AI

Use AI Gateway for analytics, caching, and security on requests to Workers AI. Workers AI integrates seamlessly with AI Gateway, allowing you to execute AI inference via API requests or through an environment binding for Workers scripts. The binding simplifies the process by routing requests through your AI Gateway with minimal setup.

Y

yinxingmaiming6409•6/1/25, 2:03 PM

Is there a way to obtain the image embedding on CF?

N

Nico•6/4/25, 12:19 AM

I'm trying out the new OpenAI-compatible endpoint and running into a weird difference between using the OpenAI SDK and curl. I've tested the exact code from the docs for both methods and curl works but the typescript example does not.

Here's the typescript:

import OpenAI from "openai";

const client = new OpenAI({
    apiKey: process.env.GEMINI_API_KEY,
    baseURL: "<my gateway url>"
});

const response = await client.chat.completions.create({
    model: "google-ai-studio/gemini-2.0-flash",
    messages: [{ role: "user", content: "What is Cloudflare?" }],
});

console.log(response.choices[0]?.message.content);

import OpenAI from "openai";

const client = new OpenAI({
    apiKey: process.env.GEMINI_API_KEY,
    baseURL: "<my gateway url>"
});

const response = await client.chat.completions.create({
    model: "google-ai-studio/gemini-2.0-flash",
    messages: [{ role: "user", content: "What is Cloudflare?" }],
});

console.log(response.choices[0]?.message.content);

import OpenAI from "openai";

const client = new OpenAI({
    apiKey: process.env.GEMINI_API_KEY,
    baseURL: "<my gateway url>"
});

const response = await client.chat.completions.create({
    model: "google-ai-studio/gemini-2.0-flash",
    messages: [{ role: "user", content: "What is Cloudflare?" }],
});

console.log(response.choices[0]?.message.content);

import OpenAI from "openai";

const client = new OpenAI({
    apiKey: process.env.GEMINI_API_KEY,
    baseURL: "<my gateway url>"
});

const response = await client.chat.completions.create({
    model: "google-ai-studio/gemini-2.0-flash",
    messages: [{ role: "user", content: "What is Cloudflare?" }],
});

console.log(response.choices[0]?.message.content);

This code responds with a 500 erorr:

error: 500 [{"code":2002,"message":"Internal server error"}]
    status: 500,
   headers: Headers {
  "date": "Wed, 04 Jun 2025 00:17:33 GMT",
  "content-type": "application/json",
  "content-length": "101",
  "connection": "keep-alive",
  "vary": "Accept-Encoding",
  "set-cookie": [ "__cf_bm=lxBSjxIyTfBLOugJfEcUnFQKJkr6p7.457_228mHwaY-1748996253-1.0.1.1-X7FS2Pm_wVKIaxeYigqVUhnTxNN9iz.UjHnzyc1mlSb.XVfZET4Qu.h1ipuCSntWPgT3rlOGusU1gjNJAKHq_SKTPaueVJM6jp6n4pkt.kU; path=/; expires=Wed, 04-Jun-25 00:47:33 GMT; domain=.gateway.ai.cloudflare.com; HttpOnly; Secure; SameSite=None" ],
  "server": "cloudflare",
  "cf-ray": "94a33f75f8e2000e-ORD",
},
 requestID: null,
     error: [
  [Object ...]
],

error: 500 [{"code":2002,"message":"Internal server error"}]
    status: 500,
   headers: Headers {
  "date": "Wed, 04 Jun 2025 00:17:33 GMT",
  "content-type": "application/json",
  "content-length": "101",
  "connection": "keep-alive",
  "vary": "Accept-Encoding",
  "set-cookie": [ "__cf_bm=lxBSjxIyTfBLOugJfEcUnFQKJkr6p7.457_228mHwaY-1748996253-1.0.1.1-X7FS2Pm_wVKIaxeYigqVUhnTxNN9iz.UjHnzyc1mlSb.XVfZET4Qu.h1ipuCSntWPgT3rlOGusU1gjNJAKHq_SKTPaueVJM6jp6n4pkt.kU; path=/; expires=Wed, 04-Jun-25 00:47:33 GMT; domain=.gateway.ai.cloudflare.com; HttpOnly; Secure; SameSite=None" ],
  "server": "cloudflare",
  "cf-ray": "94a33f75f8e2000e-ORD",
},
 requestID: null,
     error: [
  [Object ...]
],

error: 500 [{"code":2002,"message":"Internal server error"}]
    status: 500,
   headers: Headers {
  "date": "Wed, 04 Jun 2025 00:17:33 GMT",
  "content-type": "application/json",
  "content-length": "101",
  "connection": "keep-alive",
  "vary": "Accept-Encoding",
  "set-cookie": [ "__cf_bm=lxBSjxIyTfBLOugJfEcUnFQKJkr6p7.457_228mHwaY-1748996253-1.0.1.1-X7FS2Pm_wVKIaxeYigqVUhnTxNN9iz.UjHnzyc1mlSb.XVfZET4Qu.h1ipuCSntWPgT3rlOGusU1gjNJAKHq_SKTPaueVJM6jp6n4pkt.kU; path=/; expires=Wed, 04-Jun-25 00:47:33 GMT; domain=.gateway.ai.cloudflare.com; HttpOnly; Secure; SameSite=None" ],
  "server": "cloudflare",
  "cf-ray": "94a33f75f8e2000e-ORD",
},
 requestID: null,
     error: [
  [Object ...]
],

error: 500 [{"code":2002,"message":"Internal server error"}]
    status: 500,
   headers: Headers {
  "date": "Wed, 04 Jun 2025 00:17:33 GMT",
  "content-type": "application/json",
  "content-length": "101",
  "connection": "keep-alive",
  "vary": "Accept-Encoding",
  "set-cookie": [ "__cf_bm=lxBSjxIyTfBLOugJfEcUnFQKJkr6p7.457_228mHwaY-1748996253-1.0.1.1-X7FS2Pm_wVKIaxeYigqVUhnTxNN9iz.UjHnzyc1mlSb.XVfZET4Qu.h1ipuCSntWPgT3rlOGusU1gjNJAKHq_SKTPaueVJM6jp6n4pkt.kU; path=/; expires=Wed, 04-Jun-25 00:47:33 GMT; domain=.gateway.ai.cloudflare.com; HttpOnly; Secure; SameSite=None" ],
  "server": "cloudflare",
  "cf-ray": "94a33f75f8e2000e-ORD",
},
 requestID: null,
     error: [
  [Object ...]
],

The equivalent curl request works exactly as expected

curl -X POST "<my gateway url>" \
  --header "Authorization: Bearer $GEMINI_API_KEY" \
  --header 'Content-Type: application/json' \
  --data '{
    "model": "google-ai-studio/gemini-2.0-flash",
    "messages": [
      {
        "role": "user",
        "content": "What is Cloudflare?"
      }
    ]
  }'

curl -X POST "<my gateway url>" \
  --header "Authorization: Bearer $GEMINI_API_KEY" \
  --header 'Content-Type: application/json' \
  --data '{
    "model": "google-ai-studio/gemini-2.0-flash",
    "messages": [
      {
        "role": "user",
        "content": "What is Cloudflare?"
      }
    ]
  }'

curl -X POST "<my gateway url>" \
  --header "Authorization: Bearer $GEMINI_API_KEY" \
  --header 'Content-Type: application/json' \
  --data '{
    "model": "google-ai-studio/gemini-2.0-flash",
    "messages": [
      {
        "role": "user",
        "content": "What is Cloudflare?"
      }
    ]
  }'

curl -X POST "<my gateway url>" \
  --header "Authorization: Bearer $GEMINI_API_KEY" \
  --header 'Content-Type: application/json' \
  --data '{
    "model": "google-ai-studio/gemini-2.0-flash",
    "messages": [
      {
        "role": "user",
        "content": "What is Cloudflare?"
      }
    ]
  }'

Cloudflare Docs

OpenAI Compatibility

Cloudflare's AI Gateway offers an OpenAI-compatible /chat/completions endpoint, enabling integration with multiple AI providers using a single URL. This feature simplifies the integration process, allowing for seamless switching between different models without significant code modifications.

面

面条OP•6/4/25, 2:25 AM

Is the OpenAI compatible endpoint not supporting streaming output?

N

Nico•6/4/25, 3:06 AM

Maybe?

N

Nico•6/4/25, 3:06 AM

But also the curl didn't stream it's responsd

Cloudflare AI Gateway has a 100-second origin timeout. But when using inference models like Gemini 2

Similar Threads

Similar Threads

Similar Threads