Cursor based pagination on gremlin queries
I wanted to ask whether there's a way of implementing cursor based pagination on gremlin queries in a relatively performant way.
For example, if I were to run the following query:
g.V().hasLabel('person').order().by('created', asc)
.limit(50)
, to get the first 50 results for all nodes with label "person" and ordered by the created attribute, and then save the highest created property for this batch of results, and then run
g.V().hasLabel('person').has('created', gt(last_highest_created_property).order().by('created', asc)
.limit(50)
, would this effectively get me the next 50 results without having to traverse the first 50 again?11 Replies
Hello @marti.sant which database are you using? Different providers could have unique ways of handling ordering of results.
I'm using Neptune, but I was hoping there would be a way to construct a gremlin query using the .limit() function without retraversing parts of the graph that were asked for in the previous pagination request.
It seems it is possible with the query cache: https://docs.aws.amazon.com/neptune/latest/userguide/gremlin-results-cache.html
You can get cached results paginated, in blocks.
Caching query results in Amazon Neptune Gremlin - Amazon Neptune
Overview of using the query results cache with Gremlin.
In regards to Gremlin itself, ordering before a limit or range should give you consistent results, barring changes to the data between queries
I see, what do you mean by barring? I also more so wanted to know whether the gremlin query itself would optimize things (like if the second gremlin query retraverses the first 50 elements that I queried first or whether it wouldn't and therefore have better performance for pagination)
barring meaning as long as the data doesn't change between the queries
Looking at your second query I believe it will still process the first 50 in order to evaluate the
gt
predicateGotcha, so is there no way to do this type of pagination that wouldn't re-traverse nodes in Gremlin? I'm looking for an effective way to paginate queries at the moment
I don't think gremlin itself provides pagination specifically but the Neptune query cache linked above could help
Makes sense. Another question I had was does the .limit() method traverse the whole graph or just traverse through the limit specified? For example, if I were to specified
g.V().limit(100)
would gremlin only traverse 100 nodes and stop after that ?As has been mentioned, it really depends on the provider that you're using. Apache Tinkerpop is a reference implementation for graph databases. Different providers store data and index it in different ways. Gremlin is just a higher level query language and each provider is left to their own in how to optimize the various steps and query patterns in Gremlin to efficiently fetch data from their storage implementation.
For Neptune, the underlying storage model uses a series of quads (https://docs.aws.amazon.com/neptune/latest/userguide/feature-overview-data-model.html). These quads are indexed on vertex ID, edge ID, and property key/value. So a
g.V().limit(100)
query only needs to fetch the first 100 vertices in the index constructed on vertex ID. It doesn't need to fetch all vertices because of how the data is indexed/stored.@marti.sant i think you have some good answers for how efficient pagination would work with Neptune given the query cache, i'm not sure how much more background you want, but the peculiarities of pagination in general is discussed in https://jayanta-mondal.medium.com/the-curious-case-of-pagination-for-gremlin-queries-d6fd9518620 which begins with a stackoverflow answer i provided here: https://stackoverflow.com/questions/39826983/how-to-perform-pagination-in-gremlin
Medium
The curious case of Pagination for Gremlin queries
Why TinkerPop graph databases do not support pagination? What can we do as application developers? What else can we achieve as a…