W
Windmill•7mo ago
IceCactus

crash

windmilldev-db-1 | 2023-11-29 19:07:23.202 UTC [1] LOG: server process (PID 439) was terminated by signal 9: Killed windmilldev-db-1 | 2023-11-29 19:07:23.202 UTC [1] DETAIL: Failed process was running: UPDATE queue windmilldev-db-1 | SET running = true windmilldev-db-1 | , started_at = coalesce(started_at, now()) windmilldev-db-1 | , last_ping = now() windmilldev-db-1 | , suspend_until = null windmilldev-db-1 | WHERE id = ( windmilldev-db-1 | SELECT id windmilldev-db-1 | FROM queue windmilldev-db-1 | WHERE running = false AND scheduled_for <= now() AND tag = ANY($1) windmilldev-db-1 | ORDER BY priority DESC NULLS LAST, scheduled_for, created_at windmilldev-db-1 | FOR UPDATE SKIP LOCKED windmilldev-db-1 | LIMIT 1 windmilldev-db-1 | ) windmilldev-db-1 | RETURNING * Any idea what is cuasing this? self hosted
45 Replies
Sindre
Sindre•7mo ago
Signal 9 means that it received a signal to shutdown/kill. Doing a quick search on google with signal 9 and docker, suggest that you do not have enough memory. https://duckduckgo.com/?q=signal+9+docker&t=ffab&ia=web
signal 9 docker at DuckDuckGo
DuckDuckGo. Privacy, Simplified.
IceCactus
IceCactus•7mo ago
hmm, its been working fine, i wonder why i all of sudden dont have enough memory So when I got 8 gigs of ram, when i do docker compose up the ram skyrockets to the limit
Sindre
Sindre•7mo ago
have you set any resouces on the pods? See docker-compose in windmill as an example:
Sindre
Sindre•7mo ago
GitHub
windmill/docker-compose.yml at main · windmill-labs/windmill
Open-source developer platform to turn scripts into workflows and UIs. Open-source alternative to Airplane and Retool. - windmill-labs/windmill
IceCactus
IceCactus•7mo ago
I think it has to do with a migration to new windmill versio i changed memory for works to 1024, and now it seems that my works are constantly restarting
Sindre
Sindre•7mo ago
can you try to stop all workers, and only have the db and 1 instance of the server running and see you you get the same error?
IceCactus
IceCactus•7mo ago
No errors, and hardly any ram use
Sindre
Sindre•7mo ago
And in the logs do you see any info about migration on server? Could you please start one worker and show us the logs?
IceCactus
IceCactus•7mo ago
windmilldev-windmill_server-1 | 2023-11-29T20:04:55.629048Z INFO windmill: Last migration version: Some(20231128105015). Starting potential migration of the db if first connection on a new windmill version (can take a while depending on the migration) ... windmilldev-windmill_server-1 | 2023-11-29T20:04:55.635642Z INFO windmill: Completed potential migration of the db. Last migration version: Some(20231128105015) windmilldev-windmill_server-1 | 2023-11-29T20:04:55.635695Z INFO windmill: windmilldev-windmill_server-1 | ############################## windmilldev-windmill_server-1 | Windmill Community Edition v1.216.0-31-g72bb15f6a windmilldev-windmill_server-1 | ##############################
Sindre
Sindre•7mo ago
Seems like migration is ok. Let's hope the worker logs have a error in them.
IceCactus
IceCactus•7mo ago
Looks like its just repeating this
IceCactus
IceCactus•7mo ago
IceCactus
IceCactus•7mo ago
db keeps saying connection to client lost i just told docker compose to use image 1.216.0 and it seems to be working fine went back to main and its doing it again
Sindre
Sindre•7mo ago
strange. try again tomorrow then. maybe ruben can help if it still fails tomorrow on master.
IceCactus
IceCactus•7mo ago
I keep getting this output over and over again: windmilldev-db-1 | 2023-11-30 15:43:29.785 UTC [188] LOG: could not send data to client: Connection reset by peer windmilldev-db-1 | 2023-11-30 15:43:29.785 UTC [188] STATEMENT: UPDATE queue windmilldev-db-1 | SET running = true windmilldev-db-1 | , started_at = coalesce(started_at, now()) windmilldev-db-1 | , last_ping = now() windmilldev-db-1 | , suspend_until = null windmilldev-db-1 | WHERE id = ( windmilldev-db-1 | SELECT id windmilldev-db-1 | FROM queue windmilldev-db-1 | WHERE running = false AND scheduled_for <= now() AND tag = ANY($1) windmilldev-db-1 | ORDER BY priority DESC NULLS LAST, scheduled_for, created_at windmilldev-db-1 | FOR UPDATE SKIP LOCKED windmilldev-db-1 | LIMIT 1 windmilldev-db-1 | ) windmilldev-db-1 | RETURNING * windmilldev-db-1 | 2023-11-30 15:43:30.055 UTC [188] FATAL: connection to client lost Just upgraded to 1.218 and still same issue, seems worker is using memory, os is killing it and its restarting. I get above message from db over and over worker exits with code 137
IceCactus
IceCactus•7mo ago
Sindre
Sindre•7mo ago
Try to ping ruben for help
IceCactus
IceCactus•7mo ago
@rubenf
rubenf
rubenf•7mo ago
how big is your vm ?
IceCactus
IceCactus•7mo ago
8 gig 2 cpu
rubenf
rubenf•7mo ago
do you have anything else than windmill on it ?
IceCactus
IceCactus•7mo ago
no, just windmill
rubenf
rubenf•7mo ago
is the worker crashing after executing a job?
IceCactus
IceCactus•7mo ago
i only have 1 job that runs on the hour that uses like < 3mb
rubenf
rubenf•7mo ago
hmm ok so not that
IceCactus
IceCactus•7mo ago
htop is showing im using 633mb of 7.75g right now ive limited it to only 1 worker and 1 native at the moment if you look at the log output, it seems to be occuring with a listen event from the db
rubenf
rubenf•7mo ago
the crash happen at start or after the job get executed ?
IceCactus
IceCactus•7mo ago
rubenf
rubenf•7mo ago
so it seems to crash upon deserializing the job row and based on the fact that it's an oom, for some reason that job is huge ?
IceCactus
IceCactus•7mo ago
Its possible, i was parsing some very large csv files
rubenf
rubenf•7mo ago
can you psql in the db and lookup the jobs in the queue ah yeah don't do that as an input pull them from within the job
IceCactus
IceCactus•7mo ago
well, it wasnt setup a a job nor was it input, the file was the input or filename, but its worked for days so not sure why it would cause issue now
rubenf
rubenf•7mo ago
yes that's what i'm saying, don't take big file input directly file is bigger than it was before ? anyway, only way to debug this, psql and inspect your queue table
IceCactus
IceCactus•7mo ago
ok so i was using javascript in bun to process a large csv file, it was using streams and got to about 230mb
rubenf
rubenf•7mo ago
you can also look up the jobs through the server was that file taken as an input or outputted as a result ?
IceCactus
IceCactus•7mo ago
it was read through a stream, modified then wrote back to a stream so file from a shared folded on the vm, then written back out to a shared folder on the VM
rubenf
rubenf•7mo ago
yeah that wouldn't impact the queue table so I would just suggest to look what's inside the queue table
IceCactus
IceCactus•7mo ago
ok, ill take a look, would a large logout cuase an issue when testing a script during editing? so like, console.log() many many times
rubenf
rubenf•7mo ago
depends how many is many more than 1 billion yes around a million, probably not
IceCactus
IceCactus•7mo ago
should be less then million Ill look at the table and see if i see anything well using pgadmin i cant view any rows on that table, i get bad request error. can i truncate that table?
rubenf
rubenf•7mo ago
I would recommend playing around with that table in the sql playground of pgadmin or use psql
IceCactus
IceCactus•7mo ago
I think i got it fixed, there was 4 records in the table, two of which were from the script we talked about. the log between the two was about 100mb. I just deleted the two records and all seems well now.
rubenf
rubenf•7mo ago
if that's the reason, then I probably have a fix for this to fail in less spectacular ways
IceCactus
IceCactus•7mo ago
I cant verify it, but i think what is i had run the script without limiting the number of rows processed. So it ran for 30 seconds or so and generated huge output since i was debugging. I hit the cancel button and the browser crashed for a bit then came back. Im guessing the output was to big. The table size was 1.94mb and the TOAST size was 904mb. so something large in there. Went i looked at queue table, and saw the script in there, i just deleted the two rows since that script was just a test and not scheduled or anything. Thx for the help
rubenf
rubenf•7mo ago
yeah so the fix will be to not try to pull the full log but truncate it at a certain length this is special code path where the job has to be retried and pulling the full logs when the logs are huge doesn't seem right that's why I needed the investigation, it's really hard to guess what are the issues 🙂