R
RunPod4mo ago
andmtn

"Too many open files in system"

I am using many cpu3c-2-4 in RO region, all working off of the same volume and keep running into "Too many open files" error. Error only happens in CPU pods, and only when many different pods are working with many different files, such as large apt-get installs and large tar gzips. I have tried setting ulimit -n [LARGE_NUMBER], but this does not fix the error. Any ideas?
13 Replies
flash-singh
flash-singh4mo ago
will try to reproduce this, if you have easy command to reproduce please share
andmtn
andmtn4mo ago
@flash-singh Hmm yeah, error is occurring inconsistently. Running fine at the moment... Something interesting: after "Too many open files" error occurs, running lsof -u root | wc -l also errors with lsof: can't fopen(/proc/mounts, "r"): No such file or directory I then try ps -e and get an error telling me to run mount -t proc proc /proc. I run this, then lsof starts working again During large gzip, before error, lsof -u root | wc -l returns about 344 open files, so nowhere near limit set by ulimit
kopyl
kopyl4mo ago
@flash-singh can a host's machine's limit influence Docker container's limit? If you have more than one pod on a host, then it would makes sense
flash-singh
flash-singh4mo ago
no cpu pods run in a vm, hence why you can use docker in cpu pods, i would have to debug more to see why that errror occurs since the container has root access
kopyl
kopyl4mo ago
By the way, why don’t you want to provide GPU pods as a VM too?
flash-singh
flash-singh4mo ago
we plan to, we initially launched without doing that and thats why currently they dont do the same, also gpus in vm require more work than cpus
andmtn
andmtn4mo ago
CPU pods have been running fine this morning after switching from runpod/base:0.5.1-cpu to ubuntu:latest
flash-singh
flash-singh4mo ago
@Merrell we use ubuntu:latest for runpod/base:0.5.1-cpu?
ashleyk
ashleyk4mo ago
GitHub
containers/official-templates/base/docker-bake.hcl at main · runpod...
🐳 | Dockerfiles for the RunPod container images used for our official templates. - runpod/containers
Justin Merrell
Justin Merrell4mo ago
ubuntu:20.04, would latest be preferred?
flash-singh
flash-singh4mo ago
we can run tests, big issue here is too many open files error
justin
justin3mo ago
@flash-singh / @Merrell Just wanted to note that I came across this issue helping my friend with CPU Pods, when she was processing a bunch of images using Keras Unable to set ulimit too. Can't share the data b/c it's for private research, but just wanted to flag, that this basically makes using tensorflow / keras not that helpful on CPU Pods. We were processing about 9000 images with keras I can DM for more info if necessary, not sure how to easily share a repro cause of the amt of files / time it takes to run the script We ran it three times on CPU Pods, to see various attempts to unblock it, and just ended up moving to GPU Pods where it is not an issue
flash-singh
flash-singh3mo ago
got it, will run some tests, on gpu pods we set higher limits, on cpu pods we dont set any limit, maybe theres a default limit applied