2 days in a row, akmods NVIDIA broken, out of space

2 days in a row, akmods has been having issues running out of space
16 Replies
bsherman
bshermanOP4w ago
so... there's not an easy (or any) way to free space on the runner, because we run in a container, thus we can't run anything directly on the host I am starting to look at what actually could have changed to trigger this problem the issue, running out of space, occurs: 1) only on nvidia akmods builds 2) seemingly for all our Fedora kernels (bazzite, main, longterm, coreos) tho CentOS kernels seem fine 3) during the "Test Image" step @M2 may be able to confirm but I believe the "Test Image" (which runs just test in devcontainer) does the bulk of the work in the build... it runs build-prep and test-prep shell scripts it actually builds the akmods kmod binaries makes sure they are signed, and then tests the install of those packages and that the signatures are good really just test will duplicate a lot of the same effort as just build but it uses layer caching so if just test runs first, then just build can run very quickly and the only extra behavior there is to tag the localhost/akmods-image-namehere
M2
M24w ago
Yepp exactly
bsherman
bshermanOP4w ago
so, where we are running out of space is AFTER the test-prep.sh script runs, and a layer is being preserved/copied back to the container store but i'm not sure why this started 2 days ago there've been no code changes and this is happening both on F41 and F2 so the common elements seem to be: 1) the github runners themselves 2) changes to nvidia packages on negativo17
M2
M24w ago
Or could be something with failing to cache hit somehow.
bsherman
bshermanOP4w ago
sure, but its very specific to nvidia kmods success and fail did not change based on nvidia driver version... the last good build of CoreOS-stable kernel's nvidia was the same version as the broken ones nvidia-driver 580.82.09-1.fc42 i see no changes in negativo17's nvidia or multimedia repos which could be related along this timeline honestly, i think the github runners just have less space than before though, i'm shocked this image takes 22GB Problem occurs AFTER this RUN completes and the layer is being copied: https://github.com/ublue-os/akmods/blob/main/Containerfile.in#L208 BEFORE COPYing the check-signatures.sh
antheas
antheas4w ago
instead of using a devcontainer, you could have used the test image instead instead of having two containers the pr broke and is not running the checks anymore so, to use fancy just features instead of a simple bash script, you introduce a 3-4gb container image and block the maximize space script and its rust too, so we cant pull it from github because it will take an eternity to compile
ledif
ledif4w ago
Are you talking about this PR https://github.com/ublue-os/akmods/pull/407 ? It seems to contain no changes, but it also looks like you're removing stuff to test
antheas
antheas4w ago
i switched to a branch and im doing manual runs i think i figured it out how to buy us some time at least
antheas
antheas4w ago
this fixes it i ran the bazzite nvidia kmod and it worked
ledif
ledif4w ago
Ah, I wonder what weakdeps it was pulling in to push it past the limit
antheas
antheas4w ago
i dont know what happened but it was borderline some firmware and stuff got removed and composefs pipewire and mesa too do some acks and let get this over with
ledif
ledif4w ago
LGTM. Just needs another ack
antheas
antheas4w ago
@Kyle Gospo ack it or I'll yolo it
bsherman
bshermanOP3w ago
Thank you for diving in. I had overlooked the weak deps. That’s an important change in this context regardless of any other long term questions. Also thank you for diving in. A lot of real life hit yesterday and I wasn’t able to get back to this.
antheas
antheas3w ago
np

Did you find this page helpful?