Well I thought that maybe it has to load the model into memory the first time it runs on a worker so it takes so long and following requests are faster
I'm using Llama via api.cloudflare with Bearer Auth. Where can I check my usage? If I exceed 10,000 free neurons, am I charged automatically or does it stop working?
Check the model page(s) (https://developers.cloudflare.com/workers-ai/models/llama-3.1-8b-instruct-fast#Parameters). Beware that for the most part, models within a category list the same set of parameters. E.g. @hf/nousresearch/hermes-2-pro-mistral-7b doesn't support temperature or seed, even though those parameters are listed on the model page. (Reported it here workers-ai a while ago. Never got any response so no idea if it's intended to be that way). Also, expect some quirks. E.g. a subset of the models will break if you set max_tokens to 597 or higher (Reported here workers-ai)