Limiting .path() results to a number of valid starting vertices

Hey folks, for context, we're using AWS Neptune, and Neptune Notebook for visualisation. We would like to visualise neighbourhoods of data with a given criteria: 1. We would like an exact number of neighbourhoods, e.g. 5 distinct neighbourhoods 2. We only want to consider neighbourhoods that have a particular node in its tree Example: Let's say our 'root' of the neighbourhood is any Foo vertex, and we want to graph the neighbourhoods which include a Bar vertex in its tree (through some explicit traversal). The important point here is that a neighbourhood starting from Foo might not contain a vertex Bar in its tree, in which case we want to skip this one and find another. We know that this query will graph all valid neighbourhoods:
g.V().hasLabel("Foo")
.out("a")
.in("b")
.out("c").hasLabel("Bar")
.path()
g.V().hasLabel("Foo")
.out("a")
.in("b")
.out("c").hasLabel("Bar")
.path()
But let's say we want exactly 5 valid neighbourhoods instead,. If we write g.V().hasLabel("Foo").limit(5)..., then we are not guaranteed that all 5 Foo vertices will actually lead to Bar, sometimes our traversal never makes it to a Bar from one of the randomly chosen Foo starting vertices, and we are left with fewer than 5 neighbourhoods. Placing it at the end, e.g. ....out("c").hasLabel("Bar").limit(5), filters the actual paths returned rather than by the starting . The way we've made this work is to perform a seemingly redundant filter at the beginning to validate that we're starting from a Foo that definitely leads to Bar, but there must be a simpler way of expressing it:
g.V().where(hasLabel("Foo")
.out("a")
.in("b")
.out("c").hasLabel("Bar"))
.limit(5)
.out("a")
.in("b")
.out("c").hasLabel("Bar")
.path()
g.V().where(hasLabel("Foo")
.out("a")
.in("b")
.out("c").hasLabel("Bar"))
.limit(5)
.out("a")
.in("b")
.out("c").hasLabel("Bar")
.path()
We have experimented with grouping by the starting vertex and limiting this, but it seems this does not execute lazily and first collects all neighbourhoods before grouping and limiting:
g.V().hasLabel("Foo").as("start")
.out("a")
.in("b")
.out("c").hasLabel("Bar")
.path()
.group().by(select("start").values("id"))
.select(values).unfold()
.limit(5)
g.V().hasLabel("Foo").as("start")
.out("a")
.in("b")
.out("c").hasLabel("Bar")
.path()
.group().by(select("start").values("id"))
.select(values).unfold()
.limit(5)
I think this might be a trivial question, but I can try to provide some sample data to work with if needed. Thanks all!
Solution:
Just realized that the approach I suggested with project() doesn't need deduplication - just need to unfold() the collection to get back to the original form: ```gremlin> g.V().project('s','e'). ......1> by(). ......2> by(out().hasLabel('software').path().fold()). ......3> filter(select('e').unfold())....
Jump to solution
17 Replies
spmallette
spmallette17mo ago
good question and this is a common problem. it often feels redundant to traverse the same vertices/edges more than once. depending on the nature of your filter and your graph structure that might not be too expensive (like maybe checking a property on an incident edge or adjacent vertex) but in your case you basically have three steps of out().in().out() to deal with which could be a ton of data to traverse (but only you know how many paths that will produce). let's see if someone with more @neptune query optimization experience than i can help with this one.
jessea
jessea17mo ago
Thanks! Yeah, for this query in particular performance is not a concern but rather the brevity of the query itself. Maintaining these types of queries becomes error-prone as they get more complex. I have a feeling that some magical combination of local or sack can reduce this to a single traversal. The last code example does work as expected, but it's not lazy.
triggan
triggan17mo ago
I maybe just having a hard time grokking this in my head, and I may not be understanding this correctly. Something like the following...
g.V().hasLabel("Foo").
.out("a")
.in("b")
.out("c").hasLabel("Bar")
.limit(5)
.path()
g.V().hasLabel("Foo").
.out("a")
.in("b")
.out("c").hasLabel("Bar")
.limit(5)
.path()
... would result in the first 5 traversers that hit a final vertex with a label of Bar would then get returned, all other traversers canceled, and the a set of paths generated based on those 5 results.
spmallette
spmallette17mo ago
you possibly won't get the entire neighborhood of the starting vertex though:
gremlin> g.V().out().hasLabel('software').path()
==>[v[1],v[3]]
==>[v[4],v[5]]
==>[v[4],v[3]]
==>[v[6],v[3]]
gremlin> g.V().out().hasLabel('software').limit(1).path()
==>[v[1],v[3]]
gremlin> g.V().out().hasLabel('software').limit(2).path()
==>[v[1],v[3]]
==>[v[4],v[5]]
gremlin> g.V().out().hasLabel('software').path()
==>[v[1],v[3]]
==>[v[4],v[5]]
==>[v[4],v[3]]
==>[v[6],v[3]]
gremlin> g.V().out().hasLabel('software').limit(1).path()
==>[v[1],v[3]]
gremlin> g.V().out().hasLabel('software').limit(2).path()
==>[v[1],v[3]]
==>[v[4],v[5]]
note that ==>[v[4],v[3]] is missing
triggan
triggan17mo ago
Ok, that's what I was missing.... you want the neighborhood that contains a vertex with that label.
spmallette
spmallette17mo ago
(always amazing to me how the "modern" graph has a way of being able to model most graph problems)
jessea
jessea17mo ago
Yep, this is the key problem I'm facing
triggan
triggan17mo ago
As far as the neighborhood goes, are you only filtering at the 4th level of the hierarchy?
jessea
jessea17mo ago
We might want to query at any level of the hierarchy, but our primary concern is controlling the number of neighbourhoods that are displayed regardless of how the neighbourhood is constructed. This is an example query we're giving to our graph clients to understand the shape of the data, but we'd like to give them a query that always returns 5 of these neighbourhoods. Does that help?
triggan
triggan17mo ago
Yes, that is helpful. Would have to think more about how to do this in Gremlin (and specifically optimized for Neptune). Interestingly enough, this is likely easier expressed in openCypher. Which is sort of the case for these pattern-matchey style of queries.
MATCH (a1:Foo)-[:a]->()<-[:b]-()-[:c]->(a2:Bar)
WITH a1 LIMIT 5
MATCH p = (a1)-[:a]->()<-[:b]-()-[:c]->()
RETURN p
MATCH (a1:Foo)-[:a]->()<-[:b]-()-[:c]->(a2:Bar)
WITH a1 LIMIT 5
MATCH p = (a1)-[:a]->()<-[:b]-()-[:c]->()
RETURN p
Along those lines, using the approach that you have of:
g.V().where(hasLabel("Foo")
.out("a")
.in("b")
.out("c").hasLabel("Bar"))
.limit(5)
.out("a")
.in("b")
.out("c"). //hasLabel("Bar") - not sure why this is needed if you want the full neighborhood?
.path()
g.V().where(hasLabel("Foo")
.out("a")
.in("b")
.out("c").hasLabel("Bar"))
.limit(5)
.out("a")
.in("b")
.out("c"). //hasLabel("Bar") - not sure why this is needed if you want the full neighborhood?
.path()
sort of follows the same construct. You have to be able to search the graph first and then go back and find all of the paths that you may have missed during your initial search.
jessea
jessea17mo ago
Interesting, I actually thought about OpenCypher for this, but since it's in a Neptune Notebook which only supports Gremlin it's probably not useful for our customers. Is there a way you could think to make this evaluate lazily:
g.V().hasLabel("Foo").as("start")
.out("a")
.in("b")
.out("c").hasLabel("Bar")
.path()
.group().by(select("start").values("id")) // This seems to act as a barrier, requiring full traversal
.select(values).unfold()
.limit(5)
g.V().hasLabel("Foo").as("start")
.out("a")
.in("b")
.out("c").hasLabel("Bar")
.path()
.group().by(select("start").values("id")) // This seems to act as a barrier, requiring full traversal
.select(values).unfold()
.limit(5)
Could local + some filter step like coalesce or choose maybe help? This may just be a theoretical limitation of how Gremlin queries are executed. This is all really helpful by the way, thanks for the discussion so far.
triggan
triggan17mo ago
since it's in a Neptune Notebook which only supports Gremlin - the Notebooks have OC support (%%oc magic). Yeah, the group() step sort of blocks that as it needs to perform the aggregation.
spmallette
spmallette17mo ago
there is no way to stop group() once it gets going though i've often wanted to have a condition for that maybe the following could inspire something:
gremlin> g.V().project('s','e').
......1> by().
......2> by(out().hasLabel('software').fold()).
......3> filter(select('e').unfold()).
......4> limit(2)
==>[s:v[1],e:[v[3]]]
==>[s:v[4],e:[v[5],v[3]]]
gremlin> g.V().project('s','e').
......1> by().
......2> by(out().hasLabel('software').fold()).
......3> filter(select('e').unfold()).
......4> limit(2)
==>[s:v[1],e:[v[3]]]
==>[s:v[4],e:[v[5],v[3]]]
that gets both paths for v[4] it might not be a form you can use though you could capture the path() in there i guess at the risk of later having to do some deduplication i guess:
gremlin> g.V().project('s','e').
......1> by().
......2> by(out().hasLabel('software').path().fold()).
......3> filter(select('e').unfold()).
......4> limit(2)
==>[s:v[1],e:[[v[1],v[3]]]]
==>[s:v[4],e:[[v[4],v[5]],[v[4],v[3]]]]
gremlin> g.V().project('s','e').
......1> by().
......2> by(out().hasLabel('software').path().fold()).
......3> filter(select('e').unfold()).
......4> limit(2)
==>[s:v[1],e:[[v[1],v[3]]]]
==>[s:v[4],e:[[v[4],v[5]],[v[4],v[3]]]]
i feel like the approach where you filter up front though is the most readable and depending on your data it might actually be an inexpensive check.
kelvinl2816
kelvinl281617mo ago
There is an example in Practical Gremlin of using group in side effect mode to capture n results at each depth. If that sounds helpful I can dig up the link. https://kelvinlawrence.net/book/PracticalGremlin.html#depthlimit
Solution
spmallette
spmallette17mo ago
Just realized that the approach I suggested with project() doesn't need deduplication - just need to unfold() the collection to get back to the original form:
gremlin> g.V().project('s','e').
......1> by().
......2> by(out().hasLabel('software').path().fold()).
......3> filter(select('e').unfold()).
......4> limit(2).
......5> select('e').unfold()
==>[v[1],v[3]]
==>[v[4],v[5]]
==>[v[4],v[3]]
gremlin> g.V().project('s','e').
......1> by().
......2> by(out().hasLabel('software').path().fold()).
......3> filter(select('e').unfold()).
......4> limit(2).
......5> select('e').unfold()
==>[v[1],v[3]]
==>[v[4],v[5]]
==>[v[4],v[3]]
jessea
jessea17mo ago
Looks like that's what we need! Awesome, thanks @spmallette and everyone else for your input! (please mark this as the answer, I can't do it myself it seems) Project doesn't act as a barrier in the same way that group does, I suppose. I guess since gremlins fill this as they go rather than at the end of traversal The .filter(select('e').unfold()) serves to remove paths that couldn't be constructed, i.e. invalid neighbourhoods?
spmallette
spmallette17mo ago
no. that filter() just drops the surround Map to just yield the paths. scroll back up to the example with out that and you can see what you'd have otherwise