RunPod•15mo ago

How should I store/load my data for network storage?

Hi, I've been keeping my data in an sql database which is excruciatingly slow on runpod with a network storage.
But I don't see any obvious alternative.. In what type of file could my data live in on the disk, in order for it to be loaded fast in the network storage scenario of runpod?

70 Replies

J.•15mo ago

How much data are u loading? how are u loading it? why not a CSV / JSON file? Or something? Must it be SQL? Is this millions of rows? etc.

ashleyk•15mo ago

Anything read from network storage is as slow as constipation, network storage is garbage and unusable

panos.firbasOP•15mo ago

hi, my data is not text it's genomics so right now i'm saving numpy arrays in an sql it's about 200G

J.•15mo ago

Yeah, the way I see network storage is as a External Hard drive lol Do u need to load all 200G? at once? Does the data change?

ashleyk•15mo ago

An external hard drive is MUCH faster than network storage

panos.firbasOP•15mo ago

it does not change, i need to read it once per epoch, ideally in random order

ashleyk•15mo ago

Network storage is like sending stuff via a pigeon, but then again a pigeon will probably even be faster

J.•15mo ago

How much data are u loading per epoch?

panos.firbasOP•15mo ago

all of it eventually

J.•15mo ago

How are u reading it? is ur data indexes properly? So just to confirm: You have a SQLLite database for about 200GB, you are randomly selecting sizes of data for epoch cycles?

ashleyk•15mo ago

You can maybe try chatting with @JM about getting some kind of custom storage for your use case, but you may have to commit to a certain level of spend or something.

panos.firbasOP•15mo ago

i did a test, a script that just loads all the records from the db. In my system it's doing something like 6000/second in runpod it was doing 2/second

J.•15mo ago

Honestly, you could probably just divide your data straight up into files of N x Epoch_Size in your network storage. Create a hashmap of all the file names. Randomly pick one, remove from hashmap. Load the file for the next X cycles. And in the background you can also in a different thread load into a variable the next file for the next epoch cycle

panos.firbasOP•15mo ago

I shuffle the main index and query the db index by index. Every new epoch is a new shuffle

ashleyk•15mo ago

Not surprising, Network storage is unbarably slow

J.•15mo ago

I feel, that something about this feels wrong... What does shuffle the main index mean? But if it was me, I'd do this u might not even need to load something in the background But i'd just straight up divide to the pure text format, and avoid the SQLite SQLite, with all the query time could also be straight up hurting u just due to the way the searching would be working over 200GBs

ashleyk•15mo ago

Both have to load from disk, so I don't see how it will make any difference

panos.firbasOP•15mo ago

essentially order = np.arrange(lenofdatabase) +1 np.shuffle(order) then for i in order: SELECT WHERE i fetch one

J.•15mo ago

SQLite, has to keep a tree of his data as he goes indexes for a shuffle Yea this is probably bad

panos.firbasOP•15mo ago

but my data is not text!

J.•15mo ago

its looks like it doing a full table scan Ah what is the data tho? its just np arrays right?

ashleyk•15mo ago

numpy arrays he said

J.•15mo ago

u can serialize it into something

panos.firbasOP•15mo ago

yep numpy arrays

J.•15mo ago

Honestly, you could do some sort of optimization around that sql query to do like preloaded batches of randomness

panos.firbasOP•15mo ago

that can't be faster than loading an array from the disk !

J.•15mo ago

this looks like a random selection every time over the full database Well, if ur loading from a SQLite database, or a text, its both from disk, but yeah, i get the point. I think we can go the route of fixing hte SQL Query

panos.firbasOP•15mo ago

arent databases incredibly fast at fetching record i from a table when i is a proper index? that's what i'm doing, in a specific order

J.•15mo ago

Let me read over this sql query some more and think xD been a bit since ive worked with it what ur trying to do

ashleyk•15mo ago

Network storage is so slow you are probably better off storing your data in a cloud mongodb or something and fetching from there, I am sure it will be much faster

panos.firbasOP•15mo ago

and as i said, this system is blazing fast on my system so what if I don't use network storage but instead spawn a pod with the needed space ? and just move the data in there in 'runtime' ?

ashleyk•15mo ago

yeah because your system isn't using network storage that is at least 20 times slower than normal disk

J.•15mo ago

Yea that would work too I guess can do a small sample size for testing on a CPU Pod or something I guess the issue I worry about is that if you have 200GB worth of data, sending off an individual SQL statement per index But as Ashelyk said, maybe mongodb / planetscale etc be good, but also id worry about the cost ur incurring with that much data

panos.firbasOP•15mo ago

enter my other problem with all this. It looks like my work is blocking tcp ports but they won't acknowledge it. So i cant scp data from work. I have to do it from home at 10Mbs 🙃

ashleyk•15mo ago

doesn't your work allow normal port 22 though? then you can get a cheap vm from scaleway/digital ocean/linode etc and use it as a jump box from your work to RunPod

J.•15mo ago

I guess my immediately thought is that what you could do is that you could for ex as you already are doing: 1) Create an array of random indexes, so something like: [x, y, z, a, b, c ...] 2) Do a batch fetch of instead individual indexes, pull like 20 indexes: SELECT * FROM your_table WHERE id IN (1, 2, 3, 4, 5); 3) Then in the background during epoch, I guess this takes time, load another 20 indexes or whatever amount in the background so it is immediately in memory for when the epoch is done this could be harder than just whatever i state haha, cause now u need to now do parallel work. But python provides standard library for producer / consumer data sort of patterns. I think this is better than what you do now, b/c rn, you are pulling individually in synchronous order I assume Benefits: 1) You aren't sending like individual index queries over 200GB 2) You are loading in the background during each epoch, so the next is ready Essentially the background thread, can just add to a queue if the queue is < some X size, so that it is ready in memory and keep just checking if it needs to add to the queue Yeah, I think something like this would probably work. Even though SQL is fast, sending 200GB worth of individual SQL queries is also just inefficient. (Even if network storage is slow too haha) Or at least just the batch processing itself is good enough could be, might be worth testing this by itself Orrrrrr.. as u said just make something big enough to hold ur dataset on container lol. see if it worth going that route too

panos.firbasOP•15mo ago

hey thanks for all this, i'll get back to it asap, but i need to attend to some scary bureaucracy right now !

J.•15mo ago

ALso just btw, making individual sql queries like that, sql tables underneath the hood are usually some sort of tree, so why it can also take a long time sending Requests * Log(N) time probably, why can make sense to multithread it, or batch process it too (just to reduce network overhead requests across to the network drive / sqllite probably doing optimization when ur batch processing multiple indexes at once vs one by one) Okay so summary of stuff: 1) SQLLite individual queries like that can be slow b/c you are doing a number of indexes * Log(N) requests b/c SQLLite is using a tree under the hood. Or NLogN, which itself is very slow. 2) You can make it go faster by batch processing, so you get something like N/BatchSize * LogN requests + also SQL can probably do some optimizations under the hood so you aren't just constantly making tons of round trips back and forth. 3) You can use a producer/consumer pattern, where one thread constantly keeps a multithreaded safe queue filled to X size, so that your main thread can keep on just checking the queue for new data, to pull from and run epoch on, so u can parallel the time when ur working on an epoch, to make the trip to fill the queue. Gl gl

panos.firbasOP•15mo ago

I think I had actually tried the batch SELECT at some point in the past and didn't see any big improvement in speedup

J.•15mo ago

Got it, so your best bet is probably still Batch + the producer/consumer pattern then Chatgpt example 😆

import queue
import threading
import time

# Function to simulate data production (enqueueing)
def producer(q, items):
    for item in items:
        print(f"Producing {item}")
        q.put(item)  # Add item to the queue
        time.sleep(1)  # Simulate time-consuming production

# Function to simulate data consumption (dequeueing)
def consumer(q):
    while True:
        item = q.get()  # Remove and return an item from the queue
        if item is None:
            break  # None is used as a signal to stop the consumer
        print(f"Consuming {item}")
        q.task_done()  # Signal that a formerly enqueued task is complete

# Create a FIFO queue
q = queue.Queue()

# List of items to be produced
items = [1, 2, 3, 4, 5]

# Start producer thread
producer_thread = threading.Thread(target=producer, args=(q, items))
producer_thread.start()

# Start consumer thread
consumer_thread = threading.Thread(target=consumer, args=(q,))
consumer_thread.start()

# Wait for all items to be produced
producer_thread.join()

# Signal the consumer to terminate
q.put(None)

# Wait for the consumer to finish processing
consumer_thread.join()

import queue
import threading
import time

# Function to simulate data production (enqueueing)
def producer(q, items):
    for item in items:
        print(f"Producing {item}")
        q.put(item)  # Add item to the queue
        time.sleep(1)  # Simulate time-consuming production

# Function to simulate data consumption (dequeueing)
def consumer(q):
    while True:
        item = q.get()  # Remove and return an item from the queue
        if item is None:
            break  # None is used as a signal to stop the consumer
        print(f"Consuming {item}")
        q.task_done()  # Signal that a formerly enqueued task is complete

# Create a FIFO queue
q = queue.Queue()

# List of items to be produced
items = [1, 2, 3, 4, 5]

# Start producer thread
producer_thread = threading.Thread(target=producer, args=(q, items))
producer_thread.start()

# Start consumer thread
consumer_thread = threading.Thread(target=consumer, args=(q,))
consumer_thread.start()

# Wait for all items to be produced
producer_thread.join()

# Signal the consumer to terminate
q.put(None)

# Wait for the consumer to finish processing
consumer_thread.join()

This probably isn't fully correct from what I read, but you would get the idea

panos.firbasOP•15mo ago

yeah i'm familiar with queues in python

J.•15mo ago

Also, I would say actually, that spitting to a text file is faster than a SQL tree now, b/c u need to imagine, that actually, creating a hashmap of ur file names, and reading it, is an O(N) selection time as u randomly choose + remove from the hashmap. Vs a tree introduces additional overrhead to query the SQL table. Since u are going to read through the entire 200GB anyways, a tree underneath the SQL table isnt necessary Ik u said is numpy arrays but serializing it and reading it prob wont be that bad. u can try on a smaller dataset, but just my two cents, i wouldnt know without testing, but just in theory in my head a SQL database does still have overhead even with indexes bc its usually a B-Tree underneath the hood. Both in my head still read from disk so really ur cost is serialization + deserialization + query time, and query time is probably ur largest cost right now + also network storage is slow. but yeah xD sorry for the long text hopefully some avenues to look down

panos.firbasOP•15mo ago

i could just dump the np arrays as files if i'm reading files from the disk should still be much faster than deserializing, plus much less space

J.•15mo ago

Yeah, could try on a small dataset vs SQLLite, see what the cost is to query over a SQLLite, vs just keeping a hashmap of all the file names; selecting one at random; remove from hashmap + read that file

panos.firbasOP•15mo ago

but i would also probably run into os problems i don't think 100k or so files would make the filesystem happy

J.•15mo ago

Yea... haha. could batch it up, or if its on the container storage and not in network volume, could be better.

panos.firbasOP•15mo ago

more like 500k files actually

J.•15mo ago

Could be the answer is just move outside of /workspace lolol actually with that many files on network drive, ull actually fill it up

panos.firbasOP•15mo ago

yeah maybe i just stuff it in the docker image

J.•15mo ago

*Prob not stuff it in the docker image, cause u cant build one that big, but, u can keep it on network storage, and zip it and copy it over to outside /workspace

panos.firbasOP•15mo ago

but i cant both have a network image AND lots of space in / can i

J.•15mo ago

U can Ive had like a 400GB container storage + a network storage before I had a 250GB dataset, and I wanted to unzip it, so I made a 400GB container storage to move stuff onto by just moving my file under /workspace over to /container

panos.firbasOP•15mo ago

ah the container disk option ?

J.•15mo ago

Yea Container disk is stuff that will be reset on the start/stop of the pod or essentially outside the /workspace

panos.firbasOP•15mo ago

oke that's the solution then

J.•15mo ago

But it is directly on the computer too

panos.firbasOP•15mo ago

i spawn with big container, move the data there

J.•15mo ago

Yea and then optimize from there

panos.firbasOP•15mo ago

in /scratch or something work there, then move it out

J.•15mo ago

Can probably try on a smaller dataset first before u go moving the full 200GB u can also if u do end up moving it, try to chunk and parallel the copying over might be possible so u arent just synchronously moving 200GB over

panos.firbasOP•15mo ago

yeap although that move SHOULD be fast

J.•15mo ago

Guess two avenues then as needed: Option 1: Container Disk > Batching > Producer/Consumer pattern Option 2: Container Disk > file system files > randomly select gl gl hopefully works out Just as an FYI, if u have a bunch of mini files in ur network storage it can actually eat up network storage space (so if u ever wonder about creating 100K files in network storage) network storage has a weird block thing in a different thread, where they have some "minimum file size" for some reason, forgot, so even if u have lets say 100gb worth of tiny files, if they are below that block size, each file eats up a minimum space meaning u could be eating up 200Gb in network storage (forgot where i read it, but one the staff said it when ppl complained about network storage eating up space more than the files should be) *Not an issue on container disk tho / if it isnt a network drive

panos.firbasOP•15mo ago

Thanks a lot !! So yeah, i'm running it now with the data in / and it's going fast so that's the solution

J.•15mo ago

Nice

panos.firbasOP•15mo ago

it's still a little slower in the a100 than in my 3090 for a small model, is that to be expected?

J.•15mo ago

try to do:

nvidia-smi

nvidia-smi

👁️ just make sure is correct haha sanity check not sure tho

panos.firbasOP•15mo ago

hehe A100 80GB PCIe

J.•15mo ago

ok yeh, looks good, just i saw a bug yesterday where someone got assigned the wrong one, seemed like a one-off but now im careful if i see an unexpected performance drop lol but cant say tbh haha. maybe on a small model a100 isnt as efficient utilizing all the power btw are u fine tuning or training? but gl gl maybe u can get away with a weaker gpu haha who knows tho

panos.firbasOP•15mo ago

looks like 3090 has more 'cores' it makes some sense that one is faster the other is bigger I'm pretraining now

ashleyk•15mo ago

Depends on cloud type, region, etc etc Also there are no cores, its vcpu and a vcpu is a thread not a core so basically 2 vpcu is the equivalent of 1 cpu core

Gaming

Programming

How should I store/load my data for network storage?

Did you find this page helpful?