eGPU crashes randomly and Rog Ally X needs a force reboot

I'm running a RX 6800 XT with a AOOSTAR AG 02 to my Rog Ally X. I've used all-ways-egpu methond 2 and 3 and I'm very happy with the setup and performance, expect from some random eGPU crashes. I've tried with different games, different game settings with no luck. Whenever the crash happens I have to restart the Rog Ally X. About 10 seconds before I lose the display the games tend to get very slow (like 5-10 frames per second) although the performance overaly keeps saying I'm running at 60+ fps. The GPU is under 60C and the VRAM usage never goes above 12GB (out of 16). I tried stress testing it and it does just fine so I don't think the GPU is faulty. I have attached dmseg and journalctl logs. It seems like the thunderbolt controller thinks the eGPU was disconnected briefly? These are my kargs
rhgb quiet root=UUID=71350456-f59b-44b7-bd9e-eb986329111c rootflags=subvol=root rw ostree=/ostree/boot.1/default/e91de861bae555263d9c9f2e37f233242b11a8a402626ba6e831b602d6fd650f/0 amdgpu.gttsize=12192 amdgpu.sg_display=0 bluetooth.disable_ertm=1 preempt=full amdgpu.ppfeaturemask=0xfff7ffff amdgpu.runpm=0 thunderbolt.host_reset=0 vt.global_cursor_default=0 pcie_aspm=off dcdebugmask=0x10 amdgpu.gfxoff=0
rhgb quiet root=UUID=71350456-f59b-44b7-bd9e-eb986329111c rootflags=subvol=root rw ostree=/ostree/boot.1/default/e91de861bae555263d9c9f2e37f233242b11a8a402626ba6e831b602d6fd650f/0 amdgpu.gttsize=12192 amdgpu.sg_display=0 bluetooth.disable_ertm=1 preempt=full amdgpu.ppfeaturemask=0xfff7ffff amdgpu.runpm=0 thunderbolt.host_reset=0 vt.global_cursor_default=0 pcie_aspm=off dcdebugmask=0x10 amdgpu.gfxoff=0
22 Replies
Zetarancio
Zetarancio5mo ago
Set the correct clock and memory limits using lact
mindxpert
mindxpertOP5mo ago
I've got LACT because I setup some custom profiles for fan control (was thinking 80C+ might be the reason why it was crashing). I'm not sure what the limits of the clock speed and memory limits would be. I think those are the defaults (attached screenshot). Should I look at manufacturers site to check what the usual values are? Or am I supposed to have them lower than suggested so that it does not crash?
No description
mindxpert
mindxpertOP5mo ago
I tried these settings based on the GPU specs but the device got very slow and had to force a reboot. I think it doesn't like the min memory clock.
Core Clock: 2000 to 2360Mhz
Memory Clock: 2000 to 2000MHz
Core Clock: 2000 to 2360Mhz
Memory Clock: 2000 to 2000MHz
I ended up just updating my Max GPU Clock from 2444 to 2350. The manufacturer list the max as 2360. The Min GPU Clock and the Memory ones were left as is.
Core Clock: 500 to 2350Mhz
Memory Clock: 1348 to 2000MHz
Core Clock: 500 to 2350Mhz
Memory Clock: 1348 to 2000MHz
I've had no luck with the clocks I set above. Would you mind sharing what your suggestion would be? Have you had such an issue that you resolved via setting clocks speeds? From my understanding the 2444MHz that was set as default is too high as 2360 is the Boost that my card supports, but even setting that to 10 lower like 2350 I'm having backouts.
Zetarancio
Zetarancio5mo ago
I fixed my eGPU Rx 6800 by simply setting the limits according to the specs... I see your w limit is wrong. Did you add thunderbolt.host_reset=0? Btw check you wattage limitations.
mindxpert
mindxpertOP5mo ago
Yeah this is the GPU I have https://www.techpowerup.com/gpu-specs/sapphire-nitro-rx-6800-xt.b8324 and it lists 2360MHz as the max. I tried setting it to 2350MHz. The power usage limit seems to be broken on lact (seems to show the APU limits). I can see that it is pulling way more than that, like up to 270W when GPU is at 100%. Yeah I had to add thunderbolt.host_reset=0 because it was suggested from the all-ways-egpu dev to help with auto-switching to eGPU after reboots. Is it okay to leave it like that?
Zetarancio
Zetarancio5mo ago
I am not using any parameter that you added after featuremask. I added PCI=nommconf. I experienced the powerlimit issue like one year ago and I solved by removing the option in hdd to write Tdp to /sys You can read more here https://universal-blue.discourse.group/t/ayaneo-geek-1s-2s-linux-bazzite-support-is-already-almost-there-lets-add-them-to-the-officially-supported-devices/1046/36
Universal Blue
Ayaneo Geek 1S/2S Linux/Bazzite support is already almost there, le...
Hello guys. I am still in contact with Ayaneo. Some devices have already updated bios and EC but 1s have not been updated yet. A quick update: -Audio jack fixed -Egpu working -Resume, works with hibernation workaround Great device overall nowadays. Bazzite is at its peak performance. Can’t wait for kernel 6.14 for NTSYNC!
Zetarancio
Zetarancio5mo ago
GitHub
AMD Performance Fixes
Configure eGPU as primary under Linux Wayland desktops - ewagner12/all-ways-egpu
mindxpert
mindxpertOP5mo ago
I'll have a look at PCI and playing around with the kernel args later today when I'm home. Would there be any issue with HHD not being able to write to /sys? I can deactivate it to try but hopefully HHD will still be able to apply TDP in handheld mode. After a reboot my external display would not display anything but the sound would come out of it. I was looking at this similar issue that the dev suggested the kargs and that made reboot always switch to the external display: https://github.com/ewagner12/all-ways-egpu/issues/42#issuecomment-2764261679
Zetarancio
Zetarancio5mo ago
It will. You only will not be able to apply the limit in steam performance overlay. Would you try rerunning the all-way-egpu setup? If you used PCI=nommconf you need to rerun
mindxpert
mindxpertOP5mo ago
So far I removed all these kargs that I had added recently due to try to solve the GPU crashes and added the pci=nommconf in. Rebooting does not automatically switch to the external monitor (but sound through eARC does). I'll try running all-ways-egpu again now and see how it goes. Method 2 and 3 were not able to switch to external monitor after boot.
Apr 23 23:25:16 fedora systemd[1]: Starting all-ways-egpu-boot-vga.service - Configure eGPU as primary using boot_vga under Wayland desktops...
Apr 23 23:25:16 fedora all-ways-egpu-entry.sh[3022]: No eGPU detected, retry 1
Apr 23 23:25:17 fedora all-ways-egpu-entry.sh[3022]: No eGPU detected, retry 2
Apr 23 23:25:16 fedora systemd[1]: Starting all-ways-egpu-boot-vga.service - Configure eGPU as primary using boot_vga under Wayland desktops...
Apr 23 23:25:16 fedora all-ways-egpu-entry.sh[3022]: No eGPU detected, retry 1
Apr 23 23:25:17 fedora all-ways-egpu-entry.sh[3022]: No eGPU detected, retry 2
I'll try adding the host reset back in and see if that will fix it again (I used to have this before that karg)
I experienced the powerlimit issue like one year ago and I solved by removing the option in hdd to write Tdp to /sys
I did this and can now see the slider show the correct values. Do I set it to the max? The Max is 332W, while clicking Default sets it to 289W. Re-adding thunderbolt.host_reset=0 makes the external monitor the default after a reboot. I have to stick with it.
Zetarancio
Zetarancio5mo ago
Ok. Then you should test with or without nommconf
mindxpert
mindxpertOP5mo ago
Played for about an hour and a half with no crashes. This is with nommconf on. Will try again tomorrow. If it does not crash then I’m not removing nommconf or trying without it anymore. Thanks for the help!
Zetarancio
Zetarancio5mo ago
Ok. Report here in case you need something else, close the thread when needed
mindxpert
mindxpertOP4mo ago
I've got to play a bit more with my setup. So far I haven't seen any crashes. But, the pci=nommconf resulted to freeze my system upon waking from sleep. So after wake it would not display anything on the internal or external display until a force restart. Buttons were also non responsive like I could not put it to sleep again. Now sleep/wake are working again. I'll play some more and in the upcoming days and conclude if the wattage + clock limit will have fixed my crashes. Appreciate your help! I got freezes when waking up today without pci=nommconf so that might not be the culprit on the freezes. Will not be re-adding it until I figure out the freeze just in case. I don't think the lact watt/clock changes could have affected that and I'm seeing some HHD exceptions so I started another post on that. I think I have a better understanding of these crashes. They only happen if the device is sleeping, and then I wake it up and start playing. Within 20 minutes, I'll get a freeze. I can feel when the freeze is about to happen because in the last 20 seconds or so it will start to slow down, like get down to 5 fps (although the perormance overaly says 60+). If i shutdown the device and turn it on then I can play fine, tested up to 3 hours with no freezes at all. Once I put it to sleep and wake it up then the freeze will happen for sure in the next 20-30 minutes. I want to get this fixed because I usually just put the device to sleep between short gaming sessions and don't want to spend 2-3 minutes to get to the game from a cold start. I'll try to collect some logs when that happens again. From what I've seen there are some issues that LACT says it cannot pick up the GPU to set clocks after a wake up, but I can see the correct clocks being set by it in the overlay. Maybe I'll give this a try without LACT and see if the freeze will happen.
Zetarancio
Zetarancio4mo ago
Ahhhh...if it's related to sleeping I am afraid it will not be fixed. I always had trouble with that since 2017 with many eGPU and GPU combinations There are always sync issues between the devices
mindxpert
mindxpertOP4mo ago
Do you turn it off after you are done? What I find odd is that it doesn’t freeze immediately after waking up. Like why it can play fine for half an hour and then decide its no good? I wont give up and will see if I can pull something meaningful from the logs.
Zetarancio
Zetarancio4mo ago
I will test it out in the next few weeks. Keep me posted if you find something meaningful The only thing I could come up with is idle=nomwait. This should help with amd sleeping. Can you reproduce the issue? Would you be able to grab a log when this happens?
mindxpert
mindxpertOP4mo ago
I'll give that a try later today. I have some logs that I can pull out once I'm home. Thanks!
mindxpert
mindxpertOP4mo ago
Just had a freeze again with idle=nomwait. Should in the last minute or so of this log file.
Zetarancio
Zetarancio4mo ago
The issue seems to arise from: May 05 23:53:03 bazzite kernel: amdgpu 0000:08:00.0: amdgpu: Failed to export SMU metrics table!
May 05 23:53:03 bazzite kernel: amdgpu 0000:08:00.0: amdgpu: SMU: response:0xFFFFFFFF for index:18 param:0x00000005 message:TransferTableSmu2Dram? Can you test with amdgpu.smu_metrics=0? If it happens again please try with thunderbolt.host_reset=1?
mindxpert
mindxpertOP4mo ago
kernel: amdgpu: unknown parameter 'smu_metrics' ignored - seems like smu_metrics is not a param for amdgpu on this kernel version. I'll try again with thunderbolt.host_reset=1 but the last time eGPU did not display to the external display on reboot. I'm not able to get my device to recognize the eGPU with thunderbolt.host_reset=1 so I'm reverting it back to 0. Also tried amdgpu.ppfeaturemask=0xfffd3fff based on some research I was doing and it made it worse because getting the device to sleep was hit and miss. Sometimes it would freeze while trying to get to sleep so I reverted it back to 0xfff7ffff.
Zetarancio
Zetarancio4mo ago
Ok. It's something related to smu but I am having trouble thinking about something else

Did you find this page helpful?