it depends as with anything. Most important thing is a) keep your GPU busy as always and b) use a tr

it depends as with anything. Most important thing is a) keep your GPU busy as always and b) use a transfer queue if you really need perf.
Nvidia drivers (used to? still do? a year ago it was this way) only use half the available PCIe bandwith with non-transfer queues.

Regarding buffers, I always went the route to have as many 1mb host-mapped buffers that as I could have or one buffer that was however big as the hardware supported (yes some platforms don't have 1mb of host-mapped space) and then used that as my hole to stuff data through. Designed everything around unblocking that buffer as soon as possible as it's the main blocker.
I think I also saw some gains on modern hardware by using larger blocks & manually invalidating data (which wastes some space but may improve throughput because the driver knows exactly what to transfer). I think I eventually moved to always manually invalidating the data range as host-coherent was often a subset of host-available and there's not much reason to make it coherent in the first place

Perksey•6/4/22, 2:19 PM

fwiw persistent host-mapped buffers is how directstorage works

Perksey•6/4/22, 2:19 PM

it maps the memory, and then talks directly to the nvme driver and says "write directly into this memory"

KaiOP•6/4/22, 2:20 PM

pretty sure that is just the best way to do it these days. Unfortunately not many improvements in sub-allocation performance, which I think could be better as it hands off that coherence problem to the driver without any of the downsides that a fixed block size has

PPerksey it maps the memory, and then talks directly to the nvme driver and says "write d...

KaiOP•6/4/22, 2:21 PM

iirc they also have some auto uncompression stuff that can align compressed data to specific block sizes so the unpacked data fits exactly into one of their GPU transfer blocks?

Perksey•6/4/22, 2:21 PM

yeah there's some opt-in stuff like that

PPerksey possibly relevant <https://github.com/john-h-k/Voltium/blob/a7d95d57c889708881b2...

Supine•6/4/22, 2:22 PM

Interesting

KKai it depends as with anything. Most important thing is a) keep your GPU busy as al...

TechPizza•6/4/22, 2:24 PM

my code could only ever get my gtx 960M to upload at 800MB/s

KKai Also regarding transfers, that's exactly why timeline semaphores are such an ama...

Supine•6/4/22, 2:26 PM

How do you expect to show a progress bar while doing it?

TTechPizza my code could only ever get my gtx 960M to upload at 800MB/s

Supine•6/4/22, 2:26 PM

You're manually counting this?

TTechPizza my code could only ever get my gtx 960M to upload at 800MB/s

KaiOP•6/4/22, 2:27 PM

I can't seem to find the spec online, what PCIe is that running on?

SSupine You're manually counting this?

TechPizza•6/4/22, 2:27 PM

yeah i connected a voltmeter to the bus and counted every bit that passed

KKai I can't seem to find the spec online, what PCIe is that running on?

TechPizza•6/4/22, 2:27 PM

seemingly 3.0

SSupine How do you expect to show a progress bar while doing it?

KaiOP•6/4/22, 2:27 PM

just via your normal render pipeline. You can query the progress while the GPU is doing the work. It's all atomic

TechPizza•6/4/22, 2:28 PM

task manager and hwinfo usually said that i only ever utilized like 2-5% of the bus

TTechPizza seemingly 3.0

KaiOP•6/4/22, 2:28 PM

I'm just guessing it's using x16, so that'd put it at a max of 15.754 GB/s according to wikipedia

Supine•6/4/22, 2:29 PM

KaiOP•6/4/22, 2:29 PM

can you share the transfer code? Would love to have a look at it

SSupine `File.Open("pog.bin").CopyTo(new UnmanagedMemoryStream(mapped, len, len));` 😁

KaiOP•6/4/22, 2:29 PM

something like that is actually possible

KKai I'm just guessing it's using x16, so that'd put it at a max of 15.754 GB/s accor...

TechPizza•6/4/22, 2:29 PM

that validates the readout from task manager; 800 / 0.05 = 16000

KKai something like that is actually possible 😛

Supine•6/4/22, 2:30 PM

(it's bad to distribute uncompressed resources)

KaiOP•6/4/22, 2:30 PM

could throw a deflate stream in the middle don't think that'd be a problem

KaiOP•6/4/22, 2:31 PM

problem is more that you'd need to build some super smart logic to automagically advance the underlying unmanaged memory stream to the right location

KaiOP•6/4/22, 2:31 PM

cause your resources are very likely much larger than you can fit in host-mapped memory

KKai can you share the transfer code? Would love to have a look at it

TechPizza•6/4/22, 2:31 PM

there's nothing more to it than map, write, unmap, copycommand, submit, wait for fence

TechPizza•6/4/22, 2:32 PM

and the old code was so unreadable that i'm rewriting the hole thing

KaiOP•6/4/22, 2:32 PM

well how big are the blocks you're copying here?

TechPizza•6/4/22, 2:34 PM

i was roughly doing 8MB uploads

KaiOP•6/4/22, 2:34 PM

and then in sequence or in parallel?

KKai cause your resources are very likely much larger than you can fit in host-mapped...

Supine•6/4/22, 2:35 PM

(i meant mapped staging buffers)

TechPizza•6/4/22, 2:35 PM

multiple threads were uploading constantly

SSupine (i meant mapped staging buffers)

KaiOP•6/4/22, 2:35 PM

yeah, those are often pretty limited in space

Supine•6/4/22, 2:35 PM

Unless Unified memory?

TTechPizza multiple threads were uploading constantly

KaiOP•6/4/22, 2:35 PM

I don't think I really understand what they are doing tbh, this should be 100% IO bound work

TechPizza•6/4/22, 2:35 PM

... aren't staging buffers just in RAM?

TTechPizza ... aren't staging buffers just in RAM?

Supine•6/4/22, 2:36 PM

Exactly

KaiOP•6/4/22, 2:36 PM

well it depends* they need to be in a space that the driver has access to

KaiOP•6/4/22, 2:36 PM

and on integrated GPUs that can get pretty funky

KKai 🤔 I don't think I really understand what they are doing tbh, this should be 100...

TechPizza•6/4/22, 2:37 PM

most of the time on the profiler was being spent in copying to mapped memory

KaiOP•6/4/22, 2:38 PM

I mean many of the modern AMD cards have a memory type that looks like

DEVICE_LOCAL_BIT
HOST_VISIBLE_BIT
HOST_COHERENT_BIT
DEVICE_COHERENT_BIT_AMD
DEVICE_UNCACHED_BIT_AMD

DEVICE_LOCAL_BIT
HOST_VISIBLE_BIT
HOST_COHERENT_BIT
DEVICE_COHERENT_BIT_AMD
DEVICE_UNCACHED_BIT_AMD

with like 8GB
but on some iGPUs the best you get is

DEVICE_LOCAL_BIT
HOST_VISIBLE_BIT
HOST_COHERENT_BIT

DEVICE_LOCAL_BIT
HOST_VISIBLE_BIT
HOST_COHERENT_BIT

at very limited space

TTechPizza most of the time on the profiler was being spent in copying to mapped memory

KaiOP•6/4/22, 2:38 PM

hmm. Could be related to it being coherent if the implementation is weird

how are you copying?

TechPizza•6/4/22, 2:38 PM

Unsafe.CopyBlockUnaligned :)

TechPizza•6/4/22, 2:39 PM

i tried a normal for-loop but it was slightly slower

KaiOP•6/4/22, 2:39 PM

I'm not sure, but can you try using Span.CopyTo? pretty sure that is the fastest you can get

KaiOP•6/4/22, 2:39 PM

cause that uses fancy vector instructions and stuffs

TechPizza•6/4/22, 2:40 PM

was the same speed as Unsafe.CopyBlockUnaligned

KaiOP•6/4/22, 2:40 PM

ah ok :(

Supine•6/4/22, 2:40 PM

I think they call same underlying method with

ref byte

ref byte

KaiOP•6/4/22, 2:40 PM

maybe they unified those two at some point, I think I heard about that

KaiOP•6/4/22, 2:40 PM

hmmm

KaiOP•6/4/22, 2:40 PM

Fuck it, let me create a project