2 days in a row, akmods NVIDIA broken, out of space
2 days in a row, akmods has been having issues running out of space
16 Replies
so... there's not an easy (or any) way to free space on the runner, because we run in a container, thus we can't run anything directly on the host
I am starting to look at what actually could have changed to trigger this problem
the issue, running out of space, occurs:
1) only on nvidia akmods builds
2) seemingly for all our Fedora kernels (bazzite, main, longterm, coreos) tho CentOS kernels seem fine
3) during the "Test Image" step
@M2 may be able to confirm
but I believe the "Test Image" (which runs
just test in devcontainer)
does the bulk of the work in the build... it
runs build-prep and test-prep shell scripts
it actually builds the akmods kmod binaries
makes sure they are signed, and then tests the install of those packages and that the signatures are good
really just test will duplicate a lot of the same effort as just build but it uses layer caching so if just test runs first, then just build can run very quickly and the only extra behavior there is to tag the localhost/akmods-image-namehereYepp exactly
so, where we are running out of space is AFTER the
test-prep.sh script runs, and a layer is being preserved/copied back to the container store
but i'm not sure why this started 2 days ago
there've been no code changes
and this is happening both on F41 and F2
so the common elements seem to be:
1) the github runners themselves
2) changes to nvidia packages on negativo17Or could be something with failing to cache hit somehow.
sure, but its very specific to nvidia kmods
success and fail did not change based on nvidia driver version... the last good build of CoreOS-stable kernel's nvidia
was the same version as the broken ones
nvidia-driver 580.82.09-1.fc42
i see no changes in negativo17's nvidia or multimedia repos which could be related along this timeline
honestly, i think the github runners just have less space than before
though, i'm shocked this image takes 22GB
Problem occurs AFTER this RUN completes and the layer is being copied:
https://github.com/ublue-os/akmods/blob/main/Containerfile.in#L208
BEFORE COPYing the check-signatures.sh
instead of using a devcontainer, you could have used the test image instead
instead of having two containers
the pr broke and is not running the checks anymore
so, to use fancy just features instead of a simple bash script, you introduce a 3-4gb container image and block the maximize space script
and its rust too, so we cant pull it from github
because it will take an eternity to compile
Are you talking about this PR https://github.com/ublue-os/akmods/pull/407 ? It seems to contain no changes, but it also looks like you're removing stuff to test
i switched to a branch and im doing manual runs
i think i figured it out
how to buy us some time at least
this fixes it
i ran the bazzite nvidia kmod and it worked
Ah, I wonder what weakdeps it was pulling in to push it past the limit
i dont know what happened but it was borderline
some firmware and stuff got removed
and composefs
pipewire and mesa too
do some acks and let get this over with
LGTM. Just needs another ack
@Kyle Gospo ack it or I'll yolo it
Thank you for diving in. I had overlooked the weak deps. That’s an important change in this context regardless of any other long term questions.
Also thank you for diving in. A lot of real life hit yesterday and I wasn’t able to get back to this.
np