eGPU crashes randomly and Rog Ally X needs a force reboot

I'm running a RX 6800 XT with a AOOSTAR AG 02 to my Rog Ally X.

I've used

all-ways-egpu

all-ways-egpu

all-ways-egpu

all-ways-egpu methond 2 and 3 and I'm very happy with the setup and performance, expect from some random eGPU crashes. I've tried with different games, different game settings with no luck. Whenever the crash happens I have to restart the Rog Ally X. About 10 seconds before I lose the display the games tend to get very slow (like 5-10 frames per second) although the performance overaly keeps saying I'm running at 60+ fps.

The GPU is under 60C and the VRAM usage never goes above 12GB (out of 16). I tried stress testing it and it does just fine so I don't think the GPU is faulty.

I have attached

dmseg

dmseg

dmseg

dmseg and

journalctl

journalctl

journalctl

journalctl logs. It seems like the thunderbolt controller thinks the eGPU was disconnected briefly? These are my

kargs

kargs

kargs

kargs

rhgb quiet root=UUID=71350456-f59b-44b7-bd9e-eb986329111c rootflags=subvol=root rw ostree=/ostree/boot.1/default/e91de861bae555263d9c9f2e37f233242b11a8a402626ba6e831b602d6fd650f/0 amdgpu.gttsize=12192 amdgpu.sg_display=0 bluetooth.disable_ertm=1 preempt=full amdgpu.ppfeaturemask=0xfff7ffff amdgpu.runpm=0 thunderbolt.host_reset=0 vt.global_cursor_default=0 pcie_aspm=off dcdebugmask=0x10 amdgpu.gfxoff=0

rhgb quiet root=UUID=71350456-f59b-44b7-bd9e-eb986329111c rootflags=subvol=root rw ostree=/ostree/boot.1/default/e91de861bae555263d9c9f2e37f233242b11a8a402626ba6e831b602d6fd650f/0 amdgpu.gttsize=12192 amdgpu.sg_display=0 bluetooth.disable_ertm=1 preempt=full amdgpu.ppfeaturemask=0xfff7ffff amdgpu.runpm=0 thunderbolt.host_reset=0 vt.global_cursor_default=0 pcie_aspm=off dcdebugmask=0x10 amdgpu.gfxoff=0

rhgb quiet root=UUID=71350456-f59b-44b7-bd9e-eb986329111c rootflags=subvol=root rw ostree=/ostree/boot.1/default/e91de861bae555263d9c9f2e37f233242b11a8a402626ba6e831b602d6fd650f/0 amdgpu.gttsize=12192 amdgpu.sg_display=0 bluetooth.disable_ertm=1 preempt=full amdgpu.ppfeaturemask=0xfff7ffff amdgpu.runpm=0 thunderbolt.host_reset=0 vt.global_cursor_default=0 pcie_aspm=off dcdebugmask=0x10 amdgpu.gfxoff=0

rhgb quiet root=UUID=71350456-f59b-44b7-bd9e-eb986329111c rootflags=subvol=root rw ostree=/ostree/boot.1/default/e91de861bae555263d9c9f2e37f233242b11a8a402626ba6e831b602d6fd650f/0 amdgpu.gttsize=12192 amdgpu.sg_display=0 bluetooth.disable_ertm=1 preempt=full amdgpu.ppfeaturemask=0xfff7ffff amdgpu.runpm=0 thunderbolt.host_reset=0 vt.global_cursor_default=0 pcie_aspm=off dcdebugmask=0x10 amdgpu.gfxoff=0

dmseg_log_2025-04-22_14-42-48.log166.45KB

journal_log_2025-04-22_14-42-48.log513.03KB

Zetarancio•4/22/25, 1:49 PM

Set the correct clock and memory limits using lact

mindxpertOP•4/22/25, 2:21 PM

I've got LACT because I setup some custom profiles for fan control (was thinking 80C+ might be the reason why it was crashing).

I'm not sure what the limits of the clock speed and memory limits would be. I think those are the defaults (attached screenshot). Should I look at manufacturers site to check what the usual values are? Or am I supposed to have them lower than suggested so that it does not crash?

mindxpertOP•4/22/25, 2:43 PM

I tried these settings based on the GPU specs but the device got very slow and had to force a reboot. I think it doesn't like the min memory clock.

Core Clock: 2000 to 2360Mhz
Memory Clock: 2000 to 2000MHz

Core Clock: 2000 to 2360Mhz
Memory Clock: 2000 to 2000MHz

mindxpertOP•4/22/25, 3:08 PM

I ended up just updating my Max GPU Clock from 2444 to 2350. The manufacturer list the max as 2360. The Min GPU Clock and the Memory ones were left as is.

Core Clock: 500 to 2350Mhz
Memory Clock: 1348 to 2000MHz

Core Clock: 500 to 2350Mhz
Memory Clock: 1348 to 2000MHz

ZZetarancio Set the correct clock and memory limits using lact

mindxpertOP•4/22/25, 9:20 PM

I've had no luck with the clocks I set above. Would you mind sharing what your suggestion would be? Have you had such an issue that you resolved via setting clocks speeds? From my understanding the 2444MHz that was set as default is too high as 2360 is the Boost that my card supports, but even setting that to 10 lower like 2350 I'm having backouts.

Zetarancio•4/22/25, 9:32 PM

I fixed my eGPU Rx 6800 by simply setting the limits according to the specs...
I see your w limit is wrong.

Zetarancio•4/22/25, 9:33 PM

Did you add thunderbolt.host_reset=0?

Btw check you wattage limitations.

ZZetarancio I fixed my eGPU Rx 6800 by simply setting the limits according to the specs... I...

mindxpertOP•4/22/25, 10:00 PM

Yeah this is the GPU I have https://www.techpowerup.com/gpu-specs/sapphire-nitro-rx-6800-xt.b8324 and it lists 2360MHz as the max. I tried setting it to 2350MHz.

The power usage limit seems to be broken on lact (seems to show the APU limits). I can see that it is pulling way more than that, like up to 270W when GPU is at 100%.

ZZetarancio Did you add thunderbolt.host_reset=0? Btw check you wattage limitations.

mindxpertOP•4/22/25, 10:01 PM

Yeah I had to add thunderbolt.host_reset=0thunderbolt.host_reset=0 because it was suggested from the

all-ways-egpu

all-ways-egpu

dev to help with auto-switching to eGPU after reboots. Is it okay to leave it like that?

Zetarancio•4/22/25, 11:58 PM

I am not using any parameter that you added after featuremask.
I added PCI=nommconf.
I experienced the powerlimit issue like one year ago and I solved by removing the option in hdd to write Tdp to /sys

You can read more here
https://universal-blue.discourse.group/t/ayaneo-geek-1s-2s-linux-bazzite-support-is-already-almost-there-lets-add-them-to-the-officially-supported-devices/1046/36

Universal Blue

Ayaneo Geek 1S/2S Linux/Bazzite support is already almost there, le...

Hello guys. I am still in contact with Ayaneo. Some devices have already updated bios and EC but 1s have not been updated yet. A quick update: -Audio jack fixed -Egpu working -Resume, works with hibernation workaround Great device overall nowadays. Bazzite is at its peak performance. Can’t wait for kernel 6.14 for NTSYNC!

Zetarancio•4/23/25, 12:00 AM

I don't see the host reset suggestion here
https://github.com/ewagner12/all-ways-egpu/wiki/AMD-Performance-Fixes

GitHub

AMD Performance Fixes

Configure eGPU as primary under Linux Wayland desktops - ewagner12/all-ways-egpu

ZZetarancio I am not using any parameter that you added after featuremask. I added PCI=nommc...

mindxpertOP•4/23/25, 10:38 AM

I'll have a look at PCI and playing around with the kernel args later today when I'm home. Would there be any issue with HHD not being able to write to /sys? I can deactivate it to try but hopefully HHD will still be able to apply TDP in handheld mode.

ZZetarancio I don't see the host reset suggestion here https://github.com/ewagner12/all-way...

mindxpertOP•4/23/25, 10:40 AM

After a reboot my external display would not display anything but the sound would come out of it. I was looking at this similar issue that the dev suggested the kargs and that made reboot always switch to the external display: https://github.com/ewagner12/all-ways-egpu/issues/42#issuecomment-2764261679

Mmindxpert I'll have a look at PCI and playing around with the kernel args later today when...

Zetarancio•4/23/25, 8:45 PM

It will. You only will not be able to apply the limit in steam performance overlay.

Mmindxpert After a reboot my external display would not display anything but the sound woul...

Zetarancio•4/23/25, 8:46 PM

Would you try rerunning the all-way-egpu setup?

Zetarancio•4/23/25, 8:47 PM

If you used PCI=nommconf you need to rerun

mindxpertOP•4/23/25, 9:23 PM

So far I removed all these kargs that I had added recently due to try to solve the GPU crashes and added the pci=nommconf in. Rebooting does not automatically switch to the external monitor (but sound through eARC does). I'll try running all-ways-egpu again now and see how it goes.

mindxpertOP•4/23/25, 9:27 PM

Method 2 and 3 were not able to switch to external monitor after boot.

Apr 23 23:25:16 fedora systemd[1]: Starting all-ways-egpu-boot-vga.service - Configure eGPU as primary using boot_vga under Wayland desktops...
Apr 23 23:25:16 fedora all-ways-egpu-entry.sh[3022]: No eGPU detected, retry 1
Apr 23 23:25:17 fedora all-ways-egpu-entry.sh[3022]: No eGPU detected, retry 2

Apr 23 23:25:16 fedora systemd[1]: Starting all-ways-egpu-boot-vga.service - Configure eGPU as primary using boot_vga under Wayland desktops...
Apr 23 23:25:16 fedora all-ways-egpu-entry.sh[3022]: No eGPU detected, retry 1
Apr 23 23:25:17 fedora all-ways-egpu-entry.sh[3022]: No eGPU detected, retry 2

I'll try adding the host reset back in and see if that will fix it again (I used to have this before that karg)

mindxpertOP•4/23/25, 9:34 PM

I experienced the powerlimit issue like one year ago and I solved by removing the option in hdd to write Tdp to /sys

I did this and can now see the slider show the correct values. Do I set it to the max? The Max is 332W, while clicking Default sets it to 289W.

mindxpertOP•4/23/25, 9:36 PM

Re-adding thunderbolt.host_reset=0thunderbolt.host_reset=0 makes the external monitor the default after a reboot. I have to stick with it.

Zetarancio•4/23/25, 10:16 PM

Ok. Then you should test with or without nommconf

mindxpertOP•4/23/25, 11:00 PM

Played for about an hour and a half with no crashes. This is with nommconf on. Will try again tomorrow. If it does not crash then I’m not removing nommconf or trying without it anymore.

mindxpertOP•4/23/25, 11:00 PM

Thanks for the help!

Zetarancio•4/26/25, 12:53 PM

Ok. Report here in case you need something else, close the thread when needed

mindxpertOP•4/26/25, 7:35 PM

I've got to play a bit more with my setup. So far I haven't seen any crashes. But, the pci=nommconf resulted to freeze my system upon waking from sleep. So after wake it would not display anything on the internal or external display until a force restart. Buttons were also non responsive like I could not put it to sleep again. Now sleep/wake are working again. I'll play some more and in the upcoming days and conclude if the wattage + clock limit will have fixed my crashes. Appreciate your help!

mindxpertOP•4/27/25, 7:37 PM

I got freezes when waking up today without pci=nommconf so that might not be the culprit on the freezes. Will not be re-adding it until I figure out the freeze just in case. I don't think the lact watt/clock changes could have affected that and I'm seeing some HHD exceptions so I started another post on that.

ZZetarancio Ok. Report here in case you need something else, close the thread when needed

mindxpertOP•5/1/25, 12:17 PM

I think I have a better understanding of these crashes. They only happen if the device is sleeping, and then I wake it up and start playing. Within 20 minutes, I'll get a freeze. I can feel when the freeze is about to happen because in the last 20 seconds or so it will start to slow down, like get down to 5 fps (although the perormance overaly says 60+).

If i shutdown the device and turn it on then I can play fine, tested up to 3 hours with no freezes at all. Once I put it to sleep and wake it up then the freeze will happen for sure in the next 20-30 minutes.

mindxpertOP•5/1/25, 12:18 PM

I want to get this fixed because I usually just put the device to sleep between short gaming sessions and don't want to spend 2-3 minutes to get to the game from a cold start.

mindxpertOP•5/1/25, 12:20 PM

I'll try to collect some logs when that happens again. From what I've seen there are some issues that LACT says it cannot pick up the GPU to set clocks after a wake up, but I can see the correct clocks being set by it in the overlay. Maybe I'll give this a try without LACT and see if the freeze will happen.

Zetarancio•5/1/25, 12:49 PM

Ahhhh...if it's related to sleeping I am afraid it will not be fixed. I always had trouble with that since 2017 with many eGPU and GPU combinations

Zetarancio•5/1/25, 12:50 PM

There are always sync issues between the devices

mindxpertOP•5/1/25, 8:08 PM

Do you turn it off after you are done?

mindxpertOP•5/1/25, 8:11 PM

What I find odd is that it doesn’t freeze immediately after waking up. Like why it can play fine for half an hour and then decide its no good? I wont give up and will see if I can pull something meaningful from the logs.

Zetarancio•5/1/25, 11:49 PM

I will test it out in the next few weeks. Keep me posted if you find something meaningful

Zetarancio•5/5/25, 1:15 AM

The only thing I could come up with is idle=nomwait. This should help with amd sleeping. Can you reproduce the issue? Would you be able to grab a log when this happens?

mindxpertOP•5/5/25, 7:06 AM

I'll give that a try later today. I have some logs that I can pull out once I'm home. Thanks!

mindxpertOP•5/5/25, 9:58 PM

Just had a freeze again with idle=nomwait. Should in the last minute or so of this log file.

freeze.txt1.33MB

Zetarancio•5/5/25, 10:33 PM

The issue seems to arise from:

May 05 23:53:03 bazzite kernel: amdgpu 0000:08:00.0: amdgpu: Failed to export SMU metrics table!
May 05 23:53:03 bazzite kernel: amdgpu 0000:08:00.0: amdgpu: SMU: response:0xFFFFFFFF for index:18 param:0x00000005 message:TransferTableSmu2Dram?

Can you test with amdgpu.smu_metrics=0?
If it happens again please try with thunderbolt.host_reset=1?

mindxpertOP•5/6/25, 6:04 PM

kernel: amdgpu: unknown parameter 'smu_metrics' ignoredkernel: amdgpu: unknown parameter 'smu_metrics' ignored

seems like smu_metrics is not a param for amdgpu on this kernel version.

mindxpertOP•5/6/25, 6:05 PM

I'll try again with thunderbolt.host_reset=1thunderbolt.host_reset=1 but the last time eGPU did not display to the external display on reboot.

mindxpertOP•5/6/25, 7:10 PM

I'm not able to get my device to recognize the eGPU with thunderbolt.host_reset=1thunderbolt.host_reset=1 so I'm reverting it back to 0.

Also tried amdgpu.ppfeaturemask=0xfffd3fffamdgpu.ppfeaturemask=0xfffd3fff based on some research I was doing and it made it worse because getting the device to sleep was hit and miss. Sometimes it would freeze while trying to get to sleep so I reverted it back to 0xfff7ffff.

Zetarancio•5/7/25, 4:27 PM

Ok. It's something related to smu but I am having trouble thinking about something else

eGPU crashes randomly and Rog Ally X needs a force reboot

Similar Threads

Similar Threads

Similar Threads