RuntimeError: memory access out of bounds
Som of my Rust queue consumers throw exceptions very frequently before even starting the queue event handler, see https://github.com/cloudflare/workers-rs/issues/374. Or it times out without anything happening. It somehow feels like there's an issue outside of my code. Any idea what could cause this? I've tried instrumenting for a core dump, but the recordCoredump in wasm-coredump expects a request object, see https://github.com/cloudflare/wasm-coredump/issues/3.
18 Replies
i'm using wrangler 3.6.0, worker-rs 0.0.18, compatibility date 2023-08-15 and the following settings for the queue consumer:
During deployment I do get the following warning:
The warnings just indicate that anything > 1MiB have slow cold starts, that's about it - that said, I don't think I've ever had a workers-rs project get that big.
I'd usually say you should be using https://github.com/rustwasm/console_error_panic_hook but if you're not getting to the point of your handler running then it might not do much.
You can register it in the
start
eventI understood though that the console error panic hook even further increases the size, but I can indeed try this although there might be no panic to log
It increases it by a few kb at best iirc
https://discord.com/channels/595317990191398933/1101864360185442387/1101867823820709980
Unfortunately it doesn't help. I get the following log:
The
Queue
and timeout after 120s
are both logged from my entry.mjs
while the rust code doesn't log anything. It doesn't fail every time and redeploying it sometimes fixes it sometime it doesn't.
request is not defined
stems from the recordCoredump
so isn't the actual reason it fails. it just times out without doing anything.So you get nothing logged from your
#[event(queue)]
handler? Have you tried logging in the #[queue(start)]
handler?
WASM observability on Workers isn't the best, but it is just WASM ran by V8 at it's core and workers-rs is just a lot of wasm-bindgen and esbuild to abstract those away from you.Nothing gets logged indeed. Not in the
#[event(queue)]
handler, nor the #[event(start)]
function.I'm assuming this isn't happening in a fresh, plain
workers-rs
project? I'd suggest to look through the issues on the wasm-bindgen
repo but they're very non-descript and usually just projects having their own issues.i don't have it in all of my queue consumers either, just in 2 of them. but intermittently.
Unfortunately there's no way to check memory usage in Workers, and I don't know how the Queue consumers differ, but a typical Worker invoked via
fetch
can be pretty long-lived (upwards of 20+ hours sometimes).
I've peeked around the WASM/Rust Discords & GH orgs for memory access out of bounds
and it's pretty much as generic as described - ideally there'd be a stack trace or something to give you more of a hint but I guess that's part of what wasm-coredump
would help with when it supports other handlers.
FWIW, you can probably just pass new Request("https://example.com")
to the request
parameter of recordCoredump
There's nothing special about the request, it's just there to give a URL/headers for identifying what request it's associated with.
You could add headers to identify the queue/schedule run if you wanted.
thx i'll give that a try to see if I can get the core dump to work. btw as soon as i comment out the function call that does most of the work in my queue consumer it executes fine, although it does very little. it massively shrinks the uploaded size of the wasm file.
I think the core dump is not going to work well because as soon as I to a
dev
build the size of my wasm binary becomes too large. I already managed to shrink it by a factor 15 by no longer using chrono-tz
's parse function but only support a couple of specific timezones.
what i did just notice though is that as soon as a queue event leads to an exception all following executions no longer log from the rust code.
the particular exception i see is caused by too many kv invocations:
after this exception my process doesn't stop until it times out 99 seconds later (with the 120s timeout I use). This might happen because I run multiple async tasks concurrently with:
although I would expect the try_join_all
to let the error bubble up.
This could be the race condition discussed in https://blog.cloudflare.com/wasm-coredumps/ due to a panic not rejecting the promise.
is there a way to kill the instance on a panic and ensure the next run is a fresh instance?
Ah, here's a ticket for my issue! https://github.com/cloudflare/workers-rs/issues/166
although according to that issue it's been fixed by an update to wasm-bindgen
but I might have another dependency bringing in a faulty version.Cause a 1102 Resources Exceeded to reset the Worker instance
aka exceed CPU/RAM
lol. that's sounds like a blunt approach. love it. i'll first try to avoid the panic that causes this if it turns out to be within my reach.
A single Worker can do 1,000 in-house calls
i.e KV, R2, etc
You would want to reduce your consumer's batch so that it doesn't hit that
yup working on that, but when it does don't want it to panic. might be https://github.com/zebp/worker-kv/blob/3c53503d21248b0b00ac3d7802a94848f2e22178/src/builder.rs#L174 that throws a panic on the limit reached error.
It's usually all done in the handler
i've seemed to have stabilized things by reducing redundant list operations on KV, so is stable for quite a while now.
many many thanks for all your help! I'll add some more detail to the github issues and close them if it remains solved.