Efficient degree computation for traversals of big graphs
Hello,
We're trying to use Neptune with gremlin for a fairly big (XX m nodes) graph and our queries usually have to filter out low degree vertices at some point in the query for both efficiency and product reasons. According to the profiler, this operation takes the big brunt of our computation time. At the same time, the computation of the degree is something that could be pre-computed on the database (we dont even need it to be 100% accurate, a "recent" snapshot computation would be good enough) which would significantly speed up the query.
Is there anyone here who has some trick up their sleeve for degree computation that would work well, either as an ad-hoc snippet for a query or as a nice way to precompute it on the graph?
The snippet we currently use:
We're trying to use Neptune with gremlin for a fairly big (XX m nodes) graph and our queries usually have to filter out low degree vertices at some point in the query for both efficiency and product reasons. According to the profiler, this operation takes the big brunt of our computation time. At the same time, the computation of the degree is something that could be pre-computed on the database (we dont even need it to be 100% accurate, a "recent" snapshot computation would be good enough) which would significantly speed up the query.
Is there anyone here who has some trick up their sleeve for degree computation that would work well, either as an ad-hoc snippet for a query or as a nice way to precompute it on the graph?
The snippet we currently use:
.filter_(inE("reported_by").count().is_(gt(REPORTS_MINIMUM)))Solution
I would first mention that the
https://docs.aws.amazon.com/neptune/latest/userguide/gremlin-profile-api.html
If you have 10s of millions of nodes, you could use Neptune Analytics to do the degree calculations. Then extract the degree properties from NA, delete the NA graph, and bulk load those values back into NDB. We're working to make this round-trip process more seamless. But it isn't too hard to automate in the current form.
profile() step in Gremlin is different than the Neptune Profile API. The latter is going to provide a great deal more info, including whether or not the entire query is being optimized by Neptune:https://docs.aws.amazon.com/neptune/latest/userguide/gremlin-profile-api.html
If you have 10s of millions of nodes, you could use Neptune Analytics to do the degree calculations. Then extract the degree properties from NA, delete the NA graph, and bulk load those values back into NDB. We're working to make this round-trip process more seamless. But it isn't too hard to automate in the current form.
The Gremlin profile feature in Neptune runs a specified traversal, collects metrics about the run, and produces a profile report.