May I suggest a new topic-channel for us? Like "really-big-data" or "pagination"?

Related to How to to get total count for pagination with gremlin query using java Apache thinker pop and having read the recommended links on how to paginate the end of a query, I am wondering about how to manage large sets of traversals and large side-effect-collected collections which a query might be encountering or constructing as the graph is visited when the paths offer relatively large datasets after having been wisely filtered. For example, what is advised if one really does need to group by first-name all the followers of Taylor Swift (i.e. some exemplary uber-set) and wants to bag that for a later phase of a query which isn't the final collection that will be consumed by some external REST client? Yes, the final collect step can be easily paginated as advised - but what about all that earlier processing?

What should we be thinking when we anticipate having 500,000, or 10x this, traversals heading into a group by - by - bye! barrier / collecting stage?

Other than, "Punt" or "Run away!"?
Solution
There is not a feature in Gremlin directly that will directly handle this for you automatically but the drivers do let you stream back results instead of collecting them all at once which can help mitigate transferring large result sets. If you are using Amazon Neptune it also has a query results cache to assist with paging: https://docs.aws.amazon.com/neptune/latest/userguide/gremlin-results-cache.html#gremlin-results-cache-paginating
Overview of using the query results cache with Gremlin.
Was this page helpful?