unfortunately vulkan has no such mechanism

TechPizza•6/4/22, 2:03 PM

hundreds of MB per second :)

PerkseyOP•6/4/22, 2:03 PM

i'm not saying that link is how you can do it, i'm just sending that purely because it's an example of how it's done with other apis

Kai•6/4/22, 2:04 PM

in most engines you have a bunch of small tranfers, which you would typically batch & connect with semapores (or timeline sempahores if you support that) and then wait on the last one in the chain with a semaphore.
You can use events in the middle as a lightweight way of detecting that a specific transfer is done already (if you don't have timeline semaphores)

PPerksey unfortunately vulkan has no such mechanism

Kai•6/4/22, 2:04 PM

well you could export a vulkan fence as win32 object and then do unholy things with it, but from the spec I don't think you're guaranteed that the returned win32 handle supports usage as a wait handle (and specifically nvidia doesn't support this apparently)

TechPizza•6/4/22, 2:08 PM

oh no, waiting for timeline semaphores requires an extension/vk1.2

Kai•6/4/22, 2:09 PM

Also regarding transfers, that's exactly why timeline semaphores are such an amazing feature. You just make a central semaphore, and whenever you transfer something you give it an ID (by incrementing a counter) and "enqueue" the transfer task by wrapping your GPU work in a starting "wait for semaphore to be at step ID -1" and at the end you just advance the semaphore to ID.
Then you a) get free syncronization with little to no work, and you can check whether a specific transfer has completed via just "is semaphore count > ID"
AND you even get pretty realistic loading bars by just saying "start of level load was ID = 185, end of level load is ID 1835" then just poll the semaphore count and show the progress bar as

(min(0, currentId - startId) / (endId - startId)) * 100

(min(0, currentId - startId) / (endId - startId)) * 100

TTechPizza oh no, waiting for timeline semaphores requires an extension/vk1.2

Kai•6/4/22, 2:10 PM

timelines semaphores are a feature of the KHR sync2 extension... But if you don't support oldish devices almost all drivers have backported them as they require nearly no hardware support

Kai•6/4/22, 2:11 PM

like 40% windows support... Not great I guess, but it's a perfect opportunity for progressive enhancement imo

Kai•6/4/22, 2:12 PM

memory transfer is pretty easy to abstract away and you can always replace the timeline semaphore with either a fence and one submission / transfer or a pair of semaphore + event / transfer (but you can batch the transfers into bigger submissions)

TechPizza•6/4/22, 2:13 PM

since you seem quite knowledgeable about transfers, do you know the "correct" way to upload a lot of data quickly?

TechPizza•6/4/22, 2:14 PM

my code currently maps a buffer, writes a lot to it, unmaps it, and does some copy commands

TechPizza•6/4/22, 2:14 PM

i am looking into persistently mapped buffers though

TechPizza•6/4/22, 2:16 PM

and i also abuse the graphics queue instead of using the transfer queue

Kai•6/4/22, 2:17 PM

it depends as with anything. Most important thing is a) keep your GPU busy as always and b) use a transfer queue if you really need perf.
Nvidia drivers (used to? still do? a year ago it was this way) only use half the available PCIe bandwith with non-transfer queues.

Regarding buffers, I always went the route to have as many 1mb host-mapped buffers that as I could have or one buffer that was however big as the hardware supported (yes some platforms don't have 1mb of host-mapped space) and then used that as my hole to stuff data through. Designed everything around unblocking that buffer as soon as possible as it's the main blocker.
I think I also saw some gains on modern hardware by using larger blocks & manually invalidating data (which wastes some space but may improve throughput because the driver knows exactly what to transfer). I think I eventually moved to always manually invalidating the data range as host-coherent was often a subset of host-available and there's not much reason to make it coherent in the first place

PerkseyOP•6/4/22, 2:19 PM

fwiw persistent host-mapped buffers is how directstorage works

PerkseyOP•6/4/22, 2:19 PM

it maps the memory, and then talks directly to the nvme driver and says "write directly into this memory"

Kai•6/4/22, 2:20 PM

pretty sure that is just the best way to do it these days. Unfortunately not many improvements in sub-allocation performance, which I think could be better as it hands off that coherence problem to the driver without any of the downsides that a fixed block size has

PPerksey it maps the memory, and then talks directly to the nvme driver and says "write d...

Kai•6/4/22, 2:21 PM

iirc they also have some auto uncompression stuff that can align compressed data to specific block sizes so the unpacked data fits exactly into one of their GPU transfer blocks?

PerkseyOP•6/4/22, 2:21 PM

yeah there's some opt-in stuff like that

PPerksey possibly relevant <https://github.com/john-h-k/Voltium/blob/a7d95d57c889708881b2...

Supine•6/4/22, 2:22 PM

Interesting

KKai it depends as with anything. Most important thing is a) keep your GPU busy as al...

TechPizza•6/4/22, 2:24 PM

my code could only ever get my gtx 960M to upload at 800MB/s

KKai Also regarding transfers, that's exactly why timeline semaphores are such an ama...

Supine•6/4/22, 2:26 PM

How do you expect to show a progress bar while doing it?

TTechPizza my code could only ever get my gtx 960M to upload at 800MB/s

Supine•6/4/22, 2:26 PM

You're manually counting this?

TTechPizza my code could only ever get my gtx 960M to upload at 800MB/s

Kai•6/4/22, 2:27 PM

I can't seem to find the spec online, what PCIe is that running on?

SSupine You're manually counting this?

TechPizza•6/4/22, 2:27 PM

yeah i connected a voltmeter to the bus and counted every bit that passed

KKai I can't seem to find the spec online, what PCIe is that running on?

TechPizza•6/4/22, 2:27 PM

seemingly 3.0

SSupine How do you expect to show a progress bar while doing it?

Kai•6/4/22, 2:27 PM

just via your normal render pipeline. You can query the progress while the GPU is doing the work. It's all atomic

TechPizza•6/4/22, 2:28 PM

task manager and hwinfo usually said that i only ever utilized like 2-5% of the bus

TTechPizza seemingly 3.0

Kai•6/4/22, 2:28 PM

I'm just guessing it's using x16, so that'd put it at a max of 15.754 GB/s according to wikipedia

Supine•6/4/22, 2:29 PM

File.Open("pog.bin").CopyTo(new UnmanagedMemoryStream(mapped, len, len));

File.Open("pog.bin").CopyTo(new UnmanagedMemoryStream(mapped, len, len));

Kai•6/4/22, 2:29 PM

can you share the transfer code? Would love to have a look at it

SSupine `File.Open("pog.bin").CopyTo(new UnmanagedMemoryStream(mapped, len, len));` 😁

Kai•6/4/22, 2:29 PM

something like that is actually possible

KKai I'm just guessing it's using x16, so that'd put it at a max of 15.754 GB/s accor...

TechPizza•6/4/22, 2:29 PM

that validates the readout from task manager; 800 / 0.05 = 16000

KKai something like that is actually possible 😛

Supine•6/4/22, 2:30 PM

(it's bad to distribute uncompressed resources)

Kai•6/4/22, 2:30 PM

could throw a deflate stream in the middle don't think that'd be a problem

Kai•6/4/22, 2:31 PM

problem is more that you'd need to build some super smart logic to automagically advance the underlying unmanaged memory stream to the right location

Kai•6/4/22, 2:31 PM

cause your resources are very likely much larger than you can fit in host-mapped memory

KKai can you share the transfer code? Would love to have a look at it

TechPizza•6/4/22, 2:31 PM

there's nothing more to it than map, write, unmap, copycommand, submit, wait for fence

TechPizza•6/4/22, 2:32 PM

and the old code was so unreadable that i'm rewriting the hole thing

Kai•6/4/22, 2:32 PM

well how big are the blocks you're copying here?

TechPizza•6/4/22, 2:34 PM

i was roughly doing 8MB uploads

Kai•6/4/22, 2:34 PM

and then in sequence or in parallel?

KKai cause your resources are very likely much larger than you can fit in host-mapped...

Supine•6/4/22, 2:35 PM

(i meant mapped staging buffers)

TechPizza•6/4/22, 2:35 PM

multiple threads were uploading constantly

SSupine (i meant mapped staging buffers)

Kai•6/4/22, 2:35 PM

yeah, those are often pretty limited in space

Supine•6/4/22, 2:35 PM

Unless Unified memory?

TTechPizza multiple threads were uploading constantly

Kai•6/4/22, 2:35 PM

I don't think I really understand what they are doing tbh, this should be 100% IO bound work

TechPizza•6/4/22, 2:35 PM

... aren't staging buffers just in RAM?

TTechPizza ... aren't staging buffers just in RAM?

Supine•6/4/22, 2:36 PM

Exactly

Kai•6/4/22, 2:36 PM

well it depends* they need to be in a space that the driver has access to