in most engines you have a bunch of small tranfers, which you would typically batch & connect with semapores (or timeline sempahores if you support that) and then wait on the last one in the chain with a semaphore. You can use events in the middle as a lightweight way of detecting that a specific transfer is done already (if you don't have timeline semaphores)
well you could export a vulkan fence as win32 object and then do unholy things with it, but from the spec I don't think you're guaranteed that the returned win32 handle supports usage as a wait handle (and specifically nvidia doesn't support this apparently)
Also regarding transfers, that's exactly why timeline semaphores are such an amazing feature. You just make a central semaphore, and whenever you transfer something you give it an ID (by incrementing a counter) and "enqueue" the transfer task by wrapping your GPU work in a starting "wait for semaphore to be at step ID -1" and at the end you just advance the semaphore to ID. Then you a) get free syncronization with little to no work, and you can check whether a specific transfer has completed via just "is semaphore count > ID" AND you even get pretty realistic loading bars by just saying "start of level load was ID = 185, end of level load is ID 1835" then just poll the semaphore count and show the progress bar as
timelines semaphores are a feature of the KHR sync2 extension... But if you don't support oldish devices almost all drivers have backported them as they require nearly no hardware support
memory transfer is pretty easy to abstract away and you can always replace the timeline semaphore with either a fence and one submission / transfer or a pair of semaphore + event / transfer (but you can batch the transfers into bigger submissions)
it depends as with anything. Most important thing is a) keep your GPU busy as always and b) use a transfer queue if you really need perf. Nvidia drivers (used to? still do? a year ago it was this way) only use half the available PCIe bandwith with non-transfer queues.
Regarding buffers, I always went the route to have as many 1mb host-mapped buffers that as I could have or one buffer that was however big as the hardware supported (yes some platforms don't have 1mb of host-mapped space) and then used that as my hole to stuff data through. Designed everything around unblocking that buffer as soon as possible as it's the main blocker. I think I also saw some gains on modern hardware by using larger blocks & manually invalidating data (which wastes some space but may improve throughput because the driver knows exactly what to transfer). I think I eventually moved to always manually invalidating the data range as host-coherent was often a subset of host-available and there's not much reason to make it coherent in the first place
pretty sure that is just the best way to do it these days. Unfortunately not many improvements in sub-allocation performance, which I think could be better as it hands off that coherence problem to the driver without any of the downsides that a fixed block size has
iirc they also have some auto uncompression stuff that can align compressed data to specific block sizes so the unpacked data fits exactly into one of their GPU transfer blocks?