it depends as with anything. Most important thing is a) keep your GPU busy as always and b) use a tr
it depends as with anything. Most important thing is a) keep your GPU busy as always and b) use a transfer queue if you really need perf.
Nvidia drivers (used to? still do? a year ago it was this way) only use half the available PCIe bandwith with non-transfer queues.
Regarding buffers, I always went the route to have as many 1mb host-mapped buffers that as I could have or one buffer that was however big as the hardware supported (yes some platforms don't have 1mb of host-mapped space) and then used that as my hole to stuff data through. Designed everything around unblocking that buffer as soon as possible as it's the main blocker.
I think I also saw some gains on modern hardware by using larger blocks & manually invalidating data (which wastes some space but may improve throughput because the driver knows exactly what to transfer). I think I eventually moved to always manually invalidating the data range as host-coherent was often a subset of host-available and there's not much reason to make it coherent in the first place
Nvidia drivers (used to? still do? a year ago it was this way) only use half the available PCIe bandwith with non-transfer queues.
Regarding buffers, I always went the route to have as many 1mb host-mapped buffers that as I could have or one buffer that was however big as the hardware supported (yes some platforms don't have 1mb of host-mapped space) and then used that as my hole to stuff data through. Designed everything around unblocking that buffer as soon as possible as it's the main blocker.
I think I also saw some gains on modern hardware by using larger blocks & manually invalidating data (which wastes some space but may improve throughput because the driver knows exactly what to transfer). I think I eventually moved to always manually invalidating the data range as host-coherent was often a subset of host-available and there's not much reason to make it coherent in the first place