R
RunPodā€¢4mo ago
panos.firbas

How should I store/load my data for network storage?

Hi, I've been keeping my data in an sql database which is excruciatingly slow on runpod with a network storage.
But I don't see any obvious alternative.. In what type of file could my data live in on the disk, in order for it to be loaded fast in the network storage scenario of runpod?
70 Replies
justin
justinā€¢4mo ago
How much data are u loading? how are u loading it? why not a CSV / JSON file? Or something? Must it be SQL? Is this millions of rows? etc.
ashleyk
ashleykā€¢4mo ago
Anything read from network storage is as slow as constipation, network storage is garbage and unusable
panos.firbas
panos.firbasā€¢4mo ago
hi, my data is not text it's genomics so right now i'm saving numpy arrays in an sql it's about 200G
justin
justinā€¢4mo ago
Yeah, the way I see network storage is as a External Hard drive lol Do u need to load all 200G? at once? Does the data change?
ashleyk
ashleykā€¢4mo ago
An external hard drive is MUCH faster than network storage
panos.firbas
panos.firbasā€¢4mo ago
it does not change, i need to read it once per epoch, ideally in random order
ashleyk
ashleykā€¢4mo ago
Network storage is like sending stuff via a pigeon, but then again a pigeon will probably even be faster
justin
justinā€¢4mo ago
How much data are u loading per epoch?
panos.firbas
panos.firbasā€¢4mo ago
all of it eventually
justin
justinā€¢4mo ago
How are u reading it? is ur data indexes properly? So just to confirm: You have a SQLLite database for about 200GB, you are randomly selecting sizes of data for epoch cycles?
ashleyk
ashleykā€¢4mo ago
You can maybe try chatting with @JM about getting some kind of custom storage for your use case, but you may have to commit to a certain level of spend or something.
panos.firbas
panos.firbasā€¢4mo ago
i did a test, a script that just loads all the records from the db. In my system it's doing something like 6000/second in runpod it was doing 2/second
justin
justinā€¢4mo ago
Honestly, you could probably just divide your data straight up into files of N x Epoch_Size in your network storage. Create a hashmap of all the file names. Randomly pick one, remove from hashmap. Load the file for the next X cycles. And in the background you can also in a different thread load into a variable the next file for the next epoch cycle
panos.firbas
panos.firbasā€¢4mo ago
I shuffle the main index and query the db index by index. Every new epoch is a new shuffle
ashleyk
ashleykā€¢4mo ago
Not surprising, Network storage is unbarably slow
justin
justinā€¢4mo ago
I feel, that something about this feels wrong... What does shuffle the main index mean? But if it was me, I'd do this u might not even need to load something in the background But i'd just straight up divide to the pure text format, and avoid the SQLite SQLite, with all the query time could also be straight up hurting u just due to the way the searching would be working over 200GBs
ashleyk
ashleykā€¢4mo ago
Both have to load from disk, so I don't see how it will make any difference
panos.firbas
panos.firbasā€¢4mo ago
essentially order = np.arrange(lenofdatabase) +1 np.shuffle(order) then for i in order: SELECT WHERE i fetch one
justin
justinā€¢4mo ago
SQLite, has to keep a tree of his data as he goes indexes for a shuffle Yea this is probably bad
panos.firbas
panos.firbasā€¢4mo ago
but my data is not text!
justin
justinā€¢4mo ago
its looks like it doing a full table scan Ah what is the data tho? its just np arrays right?
ashleyk
ashleykā€¢4mo ago
numpy arrays he said
justin
justinā€¢4mo ago
u can serialize it into something
panos.firbas
panos.firbasā€¢4mo ago
yep numpy arrays
justin
justinā€¢4mo ago
Honestly, you could do some sort of optimization around that sql query to do like preloaded batches of randomness
panos.firbas
panos.firbasā€¢4mo ago
that can't be faster than loading an array from the disk !
justin
justinā€¢4mo ago
this looks like a random selection every time over the full database Well, if ur loading from a SQLite database, or a text, its both from disk, but yeah, i get the point. I think we can go the route of fixing hte SQL Query
panos.firbas
panos.firbasā€¢4mo ago
arent databases incredibly fast at fetching record i from a table when i is a proper index? that's what i'm doing, in a specific order
justin
justinā€¢4mo ago
Let me read over this sql query some more and think xD been a bit since ive worked with it what ur trying to do
ashleyk
ashleykā€¢4mo ago
Network storage is so slow you are probably better off storing your data in a cloud mongodb or something and fetching from there, I am sure it will be much faster
panos.firbas
panos.firbasā€¢4mo ago
and as i said, this system is blazing fast on my system so what if I don't use network storage but instead spawn a pod with the needed space ? and just move the data in there in 'runtime' ?
ashleyk
ashleykā€¢4mo ago
yeah because your system isn't using network storage that is at least 20 times slower than normal disk
justin
justinā€¢4mo ago
Yea that would work too I guess can do a small sample size for testing on a CPU Pod or something I guess the issue I worry about is that if you have 200GB worth of data, sending off an individual SQL statement per index But as Ashelyk said, maybe mongodb / planetscale etc be good, but also id worry about the cost ur incurring with that much data
panos.firbas
panos.firbasā€¢4mo ago
enter my other problem with all this. It looks like my work is blocking tcp ports but they won't acknowledge it. So i cant scp data from work. I have to do it from home at 10Mbs šŸ™ƒ
ashleyk
ashleykā€¢4mo ago
doesn't your work allow normal port 22 though? then you can get a cheap vm from scaleway/digital ocean/linode etc and use it as a jump box from your work to RunPod
justin
justinā€¢4mo ago
I guess my immediately thought is that what you could do is that you could for ex as you already are doing: 1) Create an array of random indexes, so something like: [x, y, z, a, b, c ...] 2) Do a batch fetch of instead individual indexes, pull like 20 indexes: SELECT * FROM your_table WHERE id IN (1, 2, 3, 4, 5); 3) Then in the background during epoch, I guess this takes time, load another 20 indexes or whatever amount in the background so it is immediately in memory for when the epoch is done this could be harder than just whatever i state haha, cause now u need to now do parallel work. But python provides standard library for producer / consumer data sort of patterns. I think this is better than what you do now, b/c rn, you are pulling individually in synchronous order I assume Benefits: 1) You aren't sending like individual index queries over 200GB 2) You are loading in the background during each epoch, so the next is ready Essentially the background thread, can just add to a queue if the queue is < some X size, so that it is ready in memory and keep just checking if it needs to add to the queue Yeah, I think something like this would probably work. Even though SQL is fast, sending 200GB worth of individual SQL queries is also just inefficient. (Even if network storage is slow too haha) Or at least just the batch processing itself is good enough could be, might be worth testing this by itself Orrrrrr.. as u said just make something big enough to hold ur dataset on container lol. see if it worth going that route too
panos.firbas
panos.firbasā€¢4mo ago
hey thanks for all this, i'll get back to it asap, but i need to attend to some scary bureaucracy right now !
justin
justinā€¢4mo ago
ALso just btw, making individual sql queries like that, sql tables underneath the hood are usually some sort of tree, so why it can also take a long time sending Requests * Log(N) time probably, why can make sense to multithread it, or batch process it too (just to reduce network overhead requests across to the network drive / sqllite probably doing optimization when ur batch processing multiple indexes at once vs one by one) Okay so summary of stuff: 1) SQLLite individual queries like that can be slow b/c you are doing a number of indexes * Log(N) requests b/c SQLLite is using a tree under the hood. Or NLogN, which itself is very slow. 2) You can make it go faster by batch processing, so you get something like N/BatchSize * LogN requests + also SQL can probably do some optimizations under the hood so you aren't just constantly making tons of round trips back and forth. 3) You can use a producer/consumer pattern, where one thread constantly keeps a multithreaded safe queue filled to X size, so that your main thread can keep on just checking the queue for new data, to pull from and run epoch on, so u can parallel the time when ur working on an epoch, to make the trip to fill the queue. Gl gl
panos.firbas
panos.firbasā€¢4mo ago
I think I had actually tried the batch SELECT at some point in the past and didn't see any big improvement in speedup
justin
justinā€¢4mo ago
Got it, so your best bet is probably still Batch + the producer/consumer pattern then Chatgpt example šŸ˜†
import queue
import threading
import time

# Function to simulate data production (enqueueing)
def producer(q, items):
for item in items:
print(f"Producing {item}")
q.put(item) # Add item to the queue
time.sleep(1) # Simulate time-consuming production

# Function to simulate data consumption (dequeueing)
def consumer(q):
while True:
item = q.get() # Remove and return an item from the queue
if item is None:
break # None is used as a signal to stop the consumer
print(f"Consuming {item}")
q.task_done() # Signal that a formerly enqueued task is complete

# Create a FIFO queue
q = queue.Queue()

# List of items to be produced
items = [1, 2, 3, 4, 5]

# Start producer thread
producer_thread = threading.Thread(target=producer, args=(q, items))
producer_thread.start()

# Start consumer thread
consumer_thread = threading.Thread(target=consumer, args=(q,))
consumer_thread.start()

# Wait for all items to be produced
producer_thread.join()

# Signal the consumer to terminate
q.put(None)

# Wait for the consumer to finish processing
consumer_thread.join()
import queue
import threading
import time

# Function to simulate data production (enqueueing)
def producer(q, items):
for item in items:
print(f"Producing {item}")
q.put(item) # Add item to the queue
time.sleep(1) # Simulate time-consuming production

# Function to simulate data consumption (dequeueing)
def consumer(q):
while True:
item = q.get() # Remove and return an item from the queue
if item is None:
break # None is used as a signal to stop the consumer
print(f"Consuming {item}")
q.task_done() # Signal that a formerly enqueued task is complete

# Create a FIFO queue
q = queue.Queue()

# List of items to be produced
items = [1, 2, 3, 4, 5]

# Start producer thread
producer_thread = threading.Thread(target=producer, args=(q, items))
producer_thread.start()

# Start consumer thread
consumer_thread = threading.Thread(target=consumer, args=(q,))
consumer_thread.start()

# Wait for all items to be produced
producer_thread.join()

# Signal the consumer to terminate
q.put(None)

# Wait for the consumer to finish processing
consumer_thread.join()
This probably isn't fully correct from what I read, but you would get the idea
panos.firbas
panos.firbasā€¢4mo ago
yeah i'm familiar with queues in python
justin
justinā€¢4mo ago
Also, I would say actually, that spitting to a text file is faster than a SQL tree now, b/c u need to imagine, that actually, creating a hashmap of ur file names, and reading it, is an O(N) selection time as u randomly choose + remove from the hashmap. Vs a tree introduces additional overrhead to query the SQL table. Since u are going to read through the entire 200GB anyways, a tree underneath the SQL table isnt necessary Ik u said is numpy arrays but serializing it and reading it prob wont be that bad. u can try on a smaller dataset, but just my two cents, i wouldnt know without testing, but just in theory in my head a SQL database does still have overhead even with indexes bc its usually a B-Tree underneath the hood. Both in my head still read from disk so really ur cost is serialization + deserialization + query time, and query time is probably ur largest cost right now + also network storage is slow. but yeah xD sorry for the long text hopefully some avenues to look down
panos.firbas
panos.firbasā€¢4mo ago
i could just dump the np arrays as files if i'm reading files from the disk should still be much faster than deserializing, plus much less space
justin
justinā€¢4mo ago
Yeah, could try on a small dataset vs SQLLite, see what the cost is to query over a SQLLite, vs just keeping a hashmap of all the file names; selecting one at random; remove from hashmap + read that file
panos.firbas
panos.firbasā€¢4mo ago
but i would also probably run into os problems i don't think 100k or so files would make the filesystem happy
justin
justinā€¢4mo ago
Yea... haha. could batch it up, or if its on the container storage and not in network volume, could be better.
panos.firbas
panos.firbasā€¢4mo ago
more like 500k files actually
justin
justinā€¢4mo ago
Could be the answer is just move outside of /workspace lolol actually with that many files on network drive, ull actually fill it up
panos.firbas
panos.firbasā€¢4mo ago
yeah maybe i just stuff it in the docker image
justin
justinā€¢4mo ago
*Prob not stuff it in the docker image, cause u cant build one that big, but, u can keep it on network storage, and zip it and copy it over to outside /workspace
panos.firbas
panos.firbasā€¢4mo ago
but i cant both have a network image AND lots of space in / can i
justin
justinā€¢4mo ago
U can Ive had like a 400GB container storage + a network storage before I had a 250GB dataset, and I wanted to unzip it, so I made a 400GB container storage to move stuff onto by just moving my file under /workspace over to /container
panos.firbas
panos.firbasā€¢4mo ago
ah the container disk option ?
justin
justinā€¢4mo ago
Yea Container disk is stuff that will be reset on the start/stop of the pod or essentially outside the /workspace
panos.firbas
panos.firbasā€¢4mo ago
oke that's the solution then
justin
justinā€¢4mo ago
But it is directly on the computer too
panos.firbas
panos.firbasā€¢4mo ago
i spawn with big container, move the data there
justin
justinā€¢4mo ago
Yea and then optimize from there
panos.firbas
panos.firbasā€¢4mo ago
in /scratch or something work there, then move it out
justin
justinā€¢4mo ago
Can probably try on a smaller dataset first before u go moving the full 200GB u can also if u do end up moving it, try to chunk and parallel the copying over might be possible so u arent just synchronously moving 200GB over
panos.firbas
panos.firbasā€¢4mo ago
yeap although that move SHOULD be fast
justin
justinā€¢4mo ago
Guess two avenues then as needed: Option 1: Container Disk > Batching > Producer/Consumer pattern Option 2: Container Disk > file system files > randomly select gl gl hopefully works out Just as an FYI, if u have a bunch of mini files in ur network storage it can actually eat up network storage space (so if u ever wonder about creating 100K files in network storage) network storage has a weird block thing in a different thread, where they have some "minimum file size" for some reason, forgot, so even if u have lets say 100gb worth of tiny files, if they are below that block size, each file eats up a minimum space meaning u could be eating up 200Gb in network storage (forgot where i read it, but one the staff said it when ppl complained about network storage eating up space more than the files should be) *Not an issue on container disk tho / if it isnt a network drive
panos.firbas
panos.firbasā€¢4mo ago
Thanks a lot !! So yeah, i'm running it now with the data in / and it's going fast so that's the solution
justin
justinā€¢4mo ago
Nice
panos.firbas
panos.firbasā€¢4mo ago
it's still a little slower in the a100 than in my 3090 for a small model, is that to be expected?
justin
justinā€¢4mo ago
try to do:
nvidia-smi
nvidia-smi
šŸ‘ļø just make sure is correct haha sanity check not sure tho
panos.firbas
panos.firbasā€¢4mo ago
hehe A100 80GB PCIe
justin
justinā€¢4mo ago
ok yeh, looks good, just i saw a bug yesterday where someone got assigned the wrong one, seemed like a one-off but now im careful if i see an unexpected performance drop lol but cant say tbh haha. maybe on a small model a100 isnt as efficient utilizing all the power btw are u fine tuning or training? but gl gl maybe u can get away with a weaker gpu haha who knows tho
panos.firbas
panos.firbasā€¢4mo ago
looks like 3090 has more 'cores' it makes some sense that one is faster the other is bigger I'm pretraining now
ashleyk
ashleykā€¢4mo ago
Depends on cloud type, region, etc etc Also there are no cores, its vcpu and a vcpu is a thread not a core so basically 2 vpcu is the equivalent of 1 cpu core