RuntimeError: memory access out of bounds

Som of my Rust queue consumers throw exceptions very frequently before even starting the queue event handler, see https://github.com/cloudflare/workers-rs/issues/374. Or it times out without anything happening. It somehow feels like there's an issue outside of my code. Any idea what could cause this? I've tried instrumenting for a core dump, but the recordCoredump in wasm-coredump expects a request object, see https://github.com/cloudflare/wasm-coredump/issues/3.
19 Replies
Jorrit Salverda
Jorrit SalverdaOP3y ago
i'm using wrangler 3.6.0, worker-rs 0.0.18, compatibility date 2023-08-15 and the following settings for the queue consumer:
[[queues.consumers]]
queue = "***"
max_batch_size = 1
max_concurrency = 1
max_retries = 0
max_batch_timeout = 0
[[queues.consumers]]
queue = "***"
max_batch_size = 1
max_concurrency = 1
max_retries = 0
max_batch_timeout = 0
During deployment I do get the following warning:
Total Upload: 15789.56 KiB / gzip: 6358.20 KiB
▲ [WARNING] We recommend keeping your script less than 1MiB (1024 KiB) after gzip. Exceeding past this can affect cold start time
Total Upload: 15789.56 KiB / gzip: 6358.20 KiB
▲ [WARNING] We recommend keeping your script less than 1MiB (1024 KiB) after gzip. Exceeding past this can affect cold start time
kian
kian3y ago
The warnings just indicate that anything > 1MiB have slow cold starts, that's about it - that said, I don't think I've ever had a workers-rs project get that big. I'd usually say you should be using https://github.com/rustwasm/console_error_panic_hook but if you're not getting to the point of your handler running then it might not do much. You can register it in the start event
Jorrit Salverda
Jorrit SalverdaOP3y ago
I understood though that the console error panic hook even further increases the size, but I can indeed try this although there might be no panic to log
Jorrit Salverda
Jorrit SalverdaOP3y ago
Unfortunately it doesn't help. I get the following log:
{
"outcome": "exception",
"scriptName": "***",
"diagnosticsChannelEvents": [],
"exceptions": [
{
"name": "ReferenceError",
"message": "request is not defined",
"timestamp": 1693580862741
}
],
"logs": [
{
"message": [
"Queue"
],
"level": "log",
"timestamp": 1693580742711
},
{
"message": [
"timeout after 120s"
],
"level": "error",
"timestamp": 1693580862711
}
],
"eventTimestamp": 1693580742709,
"event": {
"batchSize": 1,
"queue": "location-trigger-aggregate"
},
"id": 0
}
{
"outcome": "exception",
"scriptName": "***",
"diagnosticsChannelEvents": [],
"exceptions": [
{
"name": "ReferenceError",
"message": "request is not defined",
"timestamp": 1693580862741
}
],
"logs": [
{
"message": [
"Queue"
],
"level": "log",
"timestamp": 1693580742711
},
{
"message": [
"timeout after 120s"
],
"level": "error",
"timestamp": 1693580862711
}
],
"eventTimestamp": 1693580742709,
"event": {
"batchSize": 1,
"queue": "location-trigger-aggregate"
},
"id": 0
}
The Queue and timeout after 120s are both logged from my entry.mjs while the rust code doesn't log anything. It doesn't fail every time and redeploying it sometimes fixes it sometime it doesn't. request is not defined stems from the recordCoredump so isn't the actual reason it fails. it just times out without doing anything.
kian
kian3y ago
So you get nothing logged from your #[event(queue)] handler? Have you tried logging in the #[queue(start)] handler? WASM observability on Workers isn't the best, but it is just WASM ran by V8 at it's core and workers-rs is just a lot of wasm-bindgen and esbuild to abstract those away from you.
Jorrit Salverda
Jorrit SalverdaOP3y ago
Nothing gets logged indeed. Not in the #[event(queue)] handler, nor the #[event(start)] function.
kian
kian3y ago
I'm assuming this isn't happening in a fresh, plain workers-rs project? I'd suggest to look through the issues on the wasm-bindgen repo but they're very non-descript and usually just projects having their own issues.
Jorrit Salverda
Jorrit SalverdaOP3y ago
i don't have it in all of my queue consumers either, just in 2 of them. but intermittently.
kian
kian3y ago
Unfortunately there's no way to check memory usage in Workers, and I don't know how the Queue consumers differ, but a typical Worker invoked via fetch can be pretty long-lived (upwards of 20+ hours sometimes). I've peeked around the WASM/Rust Discords & GH orgs for memory access out of bounds and it's pretty much as generic as described - ideally there'd be a stack trace or something to give you more of a hint but I guess that's part of what wasm-coredump would help with when it supports other handlers. FWIW, you can probably just pass new Request("https://example.com") to the request parameter of recordCoredump There's nothing special about the request, it's just there to give a URL/headers for identifying what request it's associated with. You could add headers to identify the queue/schedule run if you wanted.
new Request("https://example.com", {
headers: {
"x-queue-id": "whatever"
}
});
new Request("https://example.com", {
headers: {
"x-queue-id": "whatever"
}
});
Jorrit Salverda
Jorrit SalverdaOP3y ago
thx i'll give that a try to see if I can get the core dump to work. btw as soon as i comment out the function call that does most of the work in my queue consumer it executes fine, although it does very little. it massively shrinks the uploaded size of the wasm file. I think the core dump is not going to work well because as soon as I to a dev build the size of my wasm binary becomes too large. I already managed to shrink it by a factor 15 by no longer using chrono-tz's parse function but only support a couple of specific timezones. what i did just notice though is that as soon as a queue event leads to an exception all following executions no longer log from the rust code. the particular exception i see is caused by too many kv invocations:
js error: JsValue(Error: Too many API requests by single worker invocation.\nError: Too many API requests by single worker invocation.
js error: JsValue(Error: Too many API requests by single worker invocation.\nError: Too many API requests by single worker invocation.
after this exception my process doesn't stop until it times out 99 seconds later (with the 120s timeout I use). This might happen because I run multiple async tasks concurrently with:
let group_by_futures: Vec<_> = MeasurementGroupBy::iter()
.map(|group_by| {
self.execute_by_group(
&location,
&filtered_assets,
start,
&timezone,
group_by,
)
})
.collect();

let kv_requests_vec = try_join_all(group_by_futures)
.await
.map_err(map_to_boxed_error)?;
let group_by_futures: Vec<_> = MeasurementGroupBy::iter()
.map(|group_by| {
self.execute_by_group(
&location,
&filtered_assets,
start,
&timezone,
group_by,
)
})
.collect();

let kv_requests_vec = try_join_all(group_by_futures)
.await
.map_err(map_to_boxed_error)?;
although I would expect the try_join_all to let the error bubble up. This could be the race condition discussed in https://blog.cloudflare.com/wasm-coredumps/ due to a panic not rejecting the promise. is there a way to kill the instance on a panic and ensure the next run is a fresh instance? Ah, here's a ticket for my issue! https://github.com/cloudflare/workers-rs/issues/166 although according to that issue it's been fixed by an update to wasm-bindgen but I might have another dependency bringing in a faulty version.
kian
kian3y ago
Cause a 1102 Resources Exceeded to reset the Worker instance aka exceed CPU/RAM
Jorrit Salverda
Jorrit SalverdaOP3y ago
lol. that's sounds like a blunt approach. love it. i'll first try to avoid the panic that causes this if it turns out to be within my reach.
kian
kian3y ago
A single Worker can do 1,000 in-house calls i.e KV, R2, etc You would want to reduce your consumer's batch so that it doesn't hit that
Jorrit Salverda
Jorrit SalverdaOP3y ago
yup working on that, but when it does don't want it to panic. might be https://github.com/zebp/worker-kv/blob/3c53503d21248b0b00ac3d7802a94848f2e22178/src/builder.rs#L174 that throws a panic on the limit reached error.
kian
kian3y ago
No description
kian
kian3y ago
It's usually all done in the handler
Jorrit Salverda
Jorrit SalverdaOP3y ago
i've seemed to have stabilized things by reducing redundant list operations on KV, so is stable for quite a while now. many many thanks for all your help! I'll add some more detail to the github issues and close them if it remains solved.
Shivek
Shivek2w ago
I built a global rate limiter using durable objects. The rate limiter runs in a workflow that embeds chunks into a vector space. each workflow could have 100s of embeddings, and I'm hitting the rate limit :/

Did you find this page helpful?