I met a man with seven wives, each of which had seven sacks.
I met a man with seven wives, each of which had seven sacks. Now, suppose I have shipping container that can hold up to 500 items and I need to inform a number of men that they and their families can board my ship because I know that all the items in their families' sacks will fit it the container. There may be a few empty spaces, but I can't tell a man that he and his family can board if any of their items would overflow the container. How do I construct a query which selects men as long as all the items of in the 7 sacks of their 7 wives will fit. Here's the challenge: I don't know how many items are in each sack until the family is considered. I ask the men and their families to line up and then I board men until the container is nearly full or exactly full and where any of the items of the next family would assuredly not fit.
Let's say we have Vertices for Item, Sack, Wife, and Man and Relations marriedTo, hasSack, and hasItem.
Let's say we rank order Man by Lastname.
Breaking the St Ives rhyme, let's say that the number of wives per Man is variable as well as the number of Sacks per Wife and Items per Sack - but I want the to select Men from the queue until the count of Items from M+ -> W+ -> S+ -> I+ would exceed 500.
Got an idea?
5 Replies
Here's the advice that ChatGPT suggests:
g.V() // start with a traversal
.sideEffect( // use sideEffect to maintain state
__.aggregate('items') // aggregate items in a side-effect step
.by( // specify what to aggregate
// your criteria for what to aggregate goes here
)
.limit( // limit the aggregation based on a condition
__.unfold() // unfold the aggregated items
.count() // count them
.is(lt(threshold)) // continue if count is less than threshold
)
)
.cap('items') // retrieve the aggregated items
If this is not hallucinatory, yes, it has the gist of the idea.
the Set "V" we replace with the ordered set of Man and the by() clause is out( marriedTo ).out( hasSack ).out( hasItem ).
threshold was chosen to be 500.
Those in the audience who have Gremlin stanzas burned in the brains can tell that the ChatGPT attempt is a hallucination. The language doesn't support the statement. The attempt is useful though for illustrating the idea.
How do we fetch from a collection, taking items from that collection, until a predicate fails where the predicate is based on an external sideEffect variable whose value is adjusted by traversals flowing through the path to the predicate?
Here is a far more executable attempt.
g.withSideEffect( 'budget', [500] ).E().has( 'Relation', 'container', {{thread_key}} ).
has('modifiedDate',lte( {{Btimestamp}} ) ).
group().
by( 'sKey' ).
by( fold().as( 'versions', 'remaining' ).
select( 'versions', 'remaining' ).
by( sideEffect( count(local). // current cost, the number of Relations for the current sKey
math( 'budget - ' ).
by( unfold().tail( 1 ) ). // the latest remaining budget
by( unfold().tail( 1 ) ). // the cost extracted from its collection
store( 'budget' ) ).
identity()
).
by( cap( 'budget' ).unfold().tail(1) ) ).
select( values ).unfold().filter( select( 'remaining' ).is( gt( 0 ) ) )
Here, the construct "unfold().tail(1)" is used because I cannot figure out the syntax for a sideEffect variable that is anything but a set. Forced to work with a set, I implemented mutability of a budget amount by pushing the reduced value of the budget to the end of the set as its new value. Therefore, the tail end of that set is always to latest updated value of the budget.
This almost works but is not quite right because I am not yet passing the right object to the count(local) clause and my cost count is always only one even if the values collecion for the group-key has more than one element.
This variation gets the "cost" count correct - I leave it to my future self to understand why unfold.count works and count(local) does not.
Also: the performance hotspot is that "math" step. Ideally Gremlin might offer a postfix method call approach to mathematical operations vs (or in addition to) the script parsed tactic (as cool as that is). For example: add().by( 'x' ).by( 'y' ) versus select( 'x' ).math( '_ + y' ).by( select( y ) )
- or whatever
Talking to myself. 😠Anyway, we see how I solved this with a modifiable sideEffect. I think this would be more intuitive with a sack. But I don't yet understand the details about sacks per traversal, merging sacks, and such details. I need a global-ish sack that is able to span barriers - because the query is intentionally going to traverse, filter, accumulate, and move on. If a sack is limited to just one phase of querying between barriers then that would not suffice.sorry - this one seemed a little deep and i wasn't able to get my head into it given other things i've been doing.
😉
Although the concept doesn't work yet with a sack, the query is much simpler looking if I replace the sideEffect with a sack. There, the sack begins with 500 and each time through the value-grouping, the sack is sack(minus) reduced by the length of the value chain.
btw, it would help if you could provide a bit of sample data (
g.addV()...addE()
) to test with. makes the thinking a bit easier when we have something concrete to work withHowever, I cannot seem to make the modified sack value persist across the value grouping by clause.
let's see: the graph has Edges with label "Relation" each with a property of "container". The g.E().has() merely selects a set of relevant Edges. Then, each Edge also has a "modifiedDate" property and an "sKey" property. Now, "sKey" is not unique. The Edges also have a "version" property. So I group these edges that have a modified date less than some chosen timestamp and then group them by their non-unique sKeys. This will yield a hashmap of sKeys to edges. I want the count of edges per sKey as a cost figure.
Once I have this formulation, I want to take while the sum of the cost is less than a global budget from the set of sKeys.
I seek some way to implement a stream-like takeWhile where the predicate is comparing against a value that is growing (or shrinking) with each take.
The query above actually works. But I think there is a better with either a sack tactic or with some how to repeat filter until variable-condition.
I think of this as a kind of general "shipping logistics" query where I want to arrange content according to some scheme and to then take from until a condition is reached (first-come, first-served, until no space/food/money left)
I don't want to alter the graph with a new property - like the tank draining example that Kelvin (I think) added to the Recipes site. Also, that example terminates when Gremlin finds himself in a no-out-edge Vertex. How would the problem be solved if, instead, he should visit tanks until he has collected 63 gallons from the drainable tanks?
another way to try would be an "elastic filter". We filter a collection over and over adding one more entry until the sum of a property (or traversal that yields a value) exceeds the predicate condition. Something like
repeat().filter( limit( loops ).select( "cost" ).sum().is( lt( budget ) ) ).until()
or whatever would be the parsable syntax.
my working example above makes a collection that I can decorate and I decorate each key-grouped values entry by the remaining availabe budget. At the end, then, I reject all the key entries whose decorated (decreasing) amount exceed 0. I'm left with the entries that can be chosen without exceeding the budget limit.
I asked my new Gremlin sous chef - AI bot in awesome g.dot.V - to cook up the following. Try this:
and concluding with
Gives the audience the idea, anyway, even if it were to prove not to parse.
Here's a more expressive variant:
The "aggregate" builds a queue of all per-key costs see so far and so the accumulated cost at the encounter of the current key is the sum of the queue of costs plus the current cost.
Then, after the grouping is finished, we filter to select all entries whose accumulated cost is less than some budget amount.
This is cleaner because there's no preamble to set up a sideEffect nor a sack and there's no need to pick the tail off the growing queue. We just re-sum the queue each key.
This form, I think, is even better. Rather than decorate the collection during the grouping, the filter operation can also accumulate the cost. Ideally, there would be a filterUntil (a "takeWhile") step.