I'm looking to do a finetune with a fp8 but the Dr. setting for it uses shared memory, and is slow a
I'm looking to do a finetune with a fp8 but the Dr. setting for it uses shared memory, and is slow at around 10s/it. But speaking to koyha, it seems its an intended behavior to use shared memory.

