Limiting .path() results to a number of valid starting vertices
Hey folks, for context, we're using AWS Neptune, and Neptune Notebook for visualisation. We would like to visualise neighbourhoods of data with a given criteria:
1. We would like an exact number of neighbourhoods, e.g. 5 distinct neighbourhoods
2. We only want to consider neighbourhoods that have a particular node in its tree
Example:
Let's say our 'root' of the neighbourhood is any
Foo
vertex, and we want to graph the neighbourhoods which include a Bar
vertex in its tree (through some explicit traversal). The important point here is that a neighbourhood starting from Foo
might not contain a vertex Bar
in its tree, in which case we want to skip this one and find another. We know that this query will graph all valid neighbourhoods:
But let's say we want exactly 5
valid neighbourhoods instead,. If we write g.V().hasLabel("Foo").limit(5)...
, then we are not guaranteed that all 5 Foo
vertices will actually lead to Bar
, sometimes our traversal never makes it to a Bar
from one of the randomly chosen Foo
starting vertices, and we are left with fewer than 5
neighbourhoods. Placing it at the end, e.g. ....out("c").hasLabel("Bar").limit(5)
, filters the actual paths returned rather than by the starting .
The way we've made this work is to perform a seemingly redundant filter at the beginning to validate that we're starting from a Foo
that definitely leads to Bar
, but there must be a simpler way of expressing it:
We have experimented with grouping by the starting vertex and limiting this, but it seems this does not execute lazily and first collects all neighbourhoods before grouping and limiting:
I think this might be a trivial question, but I can try to provide some sample data to work with if needed.
Thanks all!Solution:Jump to solution
Just realized that the approach I suggested with
project()
doesn't need deduplication - just need to unfold()
the collection to get back to the original form:
```gremlin> g.V().project('s','e').
......1> by().
......2> by(out().hasLabel('software').path().fold()).
......3> filter(select('e').unfold())....17 Replies
good question and this is a common problem. it often feels redundant to traverse the same vertices/edges more than once. depending on the nature of your filter and your graph structure that might not be too expensive (like maybe checking a property on an incident edge or adjacent vertex) but in your case you basically have three steps of
out().in().out()
to deal with which could be a ton of data to traverse (but only you know how many paths that will produce). let's see if someone with more @neptune query optimization experience than i can help with this one.Thanks! Yeah, for this query in particular performance is not a concern but rather the brevity of the query itself. Maintaining these types of queries becomes error-prone as they get more complex.
I have a feeling that some magical combination of
local
or sack
can reduce this to a single traversal. The last code example does work as expected, but it's not lazy.I maybe just having a hard time grokking this in my head, and I may not be understanding this correctly. Something like the following...
... would result in the first 5 traversers that hit a final vertex with a label of
Bar
would then get returned, all other traversers canceled, and the a set of paths generated based on those 5 results.you possibly won't get the entire neighborhood of the starting vertex though:
note that
==>[v[4],v[3]]
is missingOk, that's what I was missing.... you want the neighborhood that contains a vertex with that label.
(always amazing to me how the "modern" graph has a way of being able to model most graph problems)
Yep, this is the key problem I'm facing
As far as the neighborhood goes, are you only filtering at the 4th level of the hierarchy?
We might want to query at any level of the hierarchy, but our primary concern is controlling the number of neighbourhoods that are displayed regardless of how the neighbourhood is constructed. This is an example query we're giving to our graph clients to understand the shape of the data, but we'd like to give them a query that always returns 5 of these neighbourhoods.
Does that help?
Yes, that is helpful.
Would have to think more about how to do this in Gremlin (and specifically optimized for Neptune). Interestingly enough, this is likely easier expressed in openCypher. Which is sort of the case for these pattern-matchey style of queries.
Along those lines, using the approach that you have of:
sort of follows the same construct. You have to be able to search the graph first and then go back and find all of the paths that you may have missed during your initial search.
Interesting, I actually thought about OpenCypher for this, but since it's in a Neptune Notebook which only supports Gremlin it's probably not useful for our customers.
Is there a way you could think to make this evaluate lazily:
Could
local
+ some filter step like coalesce
or choose
maybe help? This may just be a theoretical limitation of how Gremlin queries are executed.
This is all really helpful by the way, thanks for the discussion so far.since it's in a Neptune Notebook which only supports Gremlin
- the Notebooks have OC support (%%oc
magic).
Yeah, the group()
step sort of blocks that as it needs to perform the aggregation.there is no way to stop
group()
once it gets going though i've often wanted to have a condition for that
maybe the following could inspire something:
that gets both paths for v[4]
it might not be a form you can use though
you could capture the path()
in there i guess at the risk of later having to do some deduplication i guess:
i feel like the approach where you filter up front though is the most readable and depending on your data it might actually be an inexpensive check.There is an example in Practical Gremlin of using
group
in side effect mode to capture n results at each depth. If that sounds helpful I can dig up the link.
https://kelvinlawrence.net/book/PracticalGremlin.html#depthlimitSolution
Just realized that the approach I suggested with
project()
doesn't need deduplication - just need to unfold()
the collection to get back to the original form:
Looks like that's what we need! Awesome, thanks @spmallette and everyone else for your input! (please mark this as the answer, I can't do it myself it seems)
Project doesn't act as a barrier in the same way that group does, I suppose. I guess since gremlins fill this as they go rather than at the end of traversal
The
.filter(select('e').unfold())
serves to remove paths that couldn't be constructed, i.e. invalid neighbourhoods?no. that
filter()
just drops the surround Map
to just yield the paths. scroll back up to the example with out that and you can see what you'd have otherwise