Splitting a query with range()

I have a Gremlin Query that starts simple (one Label), and then branches out to many different paths to collect unrelated informations (aka, I need to follow those paths). I'm considering using range() to break down that query into smaller chunks of, say 1k rows' and avoid processing the whole set of Labels into one. Of course, I'll have to run the query several times, but I expect each run to be faster, better fit in memory. May be I'll escape some fast degradation by keeping the load small enough.

Does that sound like a good idea?

I'm usually concerned such partitioning means that the common part of the query (before the range()) is executed several times, and that limits the speed potential. In the current case, it is merely a hasLabel() + some property collection.

g.V().hasLabel('xxx'). sideEffect().sideEffect().map()

vs

g.V().hasLabel('xxx'). range(x, x + 1000).sideEffect().sideEffect().map()
Solution
I have often used range() steps to break up queries, it can be a useful technique but does come with several caveats.

The most important piece is that this will only work if your database guarantees that the common part of your query will always produce results in a consistent order. The default implementation of range(x, x + 1000) will first iterate and discard the first x results, then pass the next 1000. If the result ordering changes on each execution, then you will essentially be taking a random sample of 1000 results each time, instead of progressively going batch by batch.

You already mentioned the performance concerns with the common part of the query being executed each time, due to the way this is implemented, this performance penalty is proportional to x (minimal penalty when x is small as almost no results are skipped, larger penalty with large x as many results need to be processed and skipped). Results will depend greatly on your DB and your data but in general, if the left-hand side of the query is fast and efficient in your DB, and the right-hand side is slow and complex, then this technique works quite well.

I've mostly used such queries in the form of g.V().range(x, x+1000).foo()... and the results have generally been acceptable for my purposes. Assuming that your database is able to efficiently lookup vertices by label and that there are no ordering concerns there, your proposed solution seems reasonable in my opinion. Other alternatives may be to filter based on vertex id's (depends largely on what type/structure your graph uses for id's), or adding some sort of metadata to your graph to help with partitioning.
Was this page helpful?