Isolated vertices vs connected vertices with no join benefit

Is there any downside to storing an isolated vertex with references to other nodes? Creating relationships makes the query more complicated than it needs to be, but storing references to other vertices seems like an anti-pattern/smell. The relationship between nodes is defined as follows:
g.addV('batch').property(id,'batch-a').as('a').
addV('batch').property(id,'batch-b')as('b').
addV('reusable').property(id, 123).as('r1').
addV('reusable').property(id, 123).as('r2').
addE('link').from('a').to('r').
addE('link').from('b').to('r2').iterate()
g.addV('batch').property(id,'batch-a').as('a').
addV('batch').property(id,'batch-b')as('b').
addV('reusable').property(id, 123).as('r1').
addV('reusable').property(id, 123).as('r2').
addE('link').from('a').to('r').
addE('link').from('b').to('r2').iterate()
g.addV('connected').property('property", 'abc').as('coonected-v).addE('relates_to_batch').from('connected-a).to('batch-a').addE('related_to_reusable').from('connected-a').to('r1).iterate()
g.addV('connected').property('property", 'abc').as('coonected-v).addE('relates_to_batch').from('connected-a).to('batch-a').addE('related_to_reusable').from('connected-a').to('r1).iterate()
Querying this looks like:
g.V('batch-a').project('batchId', 'reusableId', 'connected').by(T.id).by(__.in('relates_to_batch').out("relates_to_reusable").id()).by(__.in("relates_to_batch").elementMap().fold()).toList()
g.V('batch-a').project('batchId', 'reusableId', 'connected').by(T.id).by(__.in('relates_to_batch').out("relates_to_reusable").id()).by(__.in("relates_to_batch").elementMap().fold()).toList()
With an isolated vertex:
g.addV('isoalted').property('property", 'abc').property('batchId', 'batch-a').property('reusableId', 'r2).iterate()
g.addV('isoalted').property('property", 'abc').property('batchId', 'batch-a').property('reusableId', 'r2).iterate()
Querying in this way produces the same result:
g.V().hasLabel('isolated').has('batchId', 'batch-a').elementMap().toList()
g.V().hasLabel('isolated').has('batchId', 'batch-a').elementMap().toList()
Please let me know if this doesn't make sense
Solution:
We're basically talking about denormalization here. denormalization is common for graphs as it is for relational data structures and it comes with the same drawbacks. Is this a case for denormalization? Based on this simple example, I'd say "no" because you really just have a single hop or so to collect what you need and you're done. But, I also don't know any other statistics about your data structure and other expected query patterns so it's hard to say with certainty that you shouldn't denormalize. That said, if denormalization is the answer, do you denormalize to a wholly disconnected vertex? i still don't think i'd recommend that based on what i know. You're most painful traversal is a two hop of __.in('relates_to_batch').out("related_to_reusable") to get an id or perhaps multiple ids. Considering adding a property of "reusableIds" to "batch-a" and store a List of the ids there (or use multi-properties, https://tinkerpop.apache.org/docs/current/reference/#vertex-properties). That seems like the most natural model to me since "isolated" is really just a "batch" with properties containing various things its connected to. Seems better to me to not introduce an "isolated" concept for that and denormalize to a thing that is actually part of your graph and connected. ...
S
spmallette396d ago
had trouble getting the sample data scripts to run in Gremlin Console. I'm further worried that the sample data isn't really representative of your data structure with all the different edge and vertex labels that are kinda mixed together. could we clean it up a bit before I dig into this, especially since others might be trying to follow along to learn? after fixing syntax errors and step label references i have this:
g = TinkerGraph.open().traversal()
g.addV('batch').property(id,'batch-a').as('a').
addV('batch').property(id,'batch-b').as('b').
addV('reusable').property(id, 123).as('r1').
addV('reusable').property(id, 321).as('r2').
addV('connected').property('property', 'abc').as('connected-a').
addE('relates_to_batch').from('connected-a').to('a').
addE('related_to_reusable').from('connected-a').to('r1').
addE('link').from('a').to('r1').
addE('link').from('b').to('r2').iterate()
g = TinkerGraph.open().traversal()
g.addV('batch').property(id,'batch-a').as('a').
addV('batch').property(id,'batch-b').as('b').
addV('reusable').property(id, 123).as('r1').
addV('reusable').property(id, 321).as('r2').
addV('connected').property('property', 'abc').as('connected-a').
addE('relates_to_batch').from('connected-a').to('a').
addE('related_to_reusable').from('connected-a').to('r1').
addE('link').from('a').to('r1').
addE('link').from('b').to('r2').iterate()
is that the structure? if so, is it reasonable for us to have all these diverse vertex/edge labels in there? can they be more simplified? like, is there really a "link" or should those be "relates_to_reusable"? and could you define "connected"? is that just meant to be one of the vertices that is part of a "batch" that isn't one that you would reuse among batches? also, are you using a graph that allows id assignment...if not, maybe we'd do better to use a property key instead of the T.id for vertex identifiers
L
Legendary396d ago
Yeah sorry! I didn’t actually try out the scripts before posting but was just trying to demonstrate the use case! The data structure is accurate and I think this is probably part of the problem; the “relates_to” connections are so loosely related which is kind of why I want to remove the edges entirely. The link between “batch” and “reusable” is concrete and has to stay. The importance behind the “connected” vertex is the content. For example, I might have another “connected” vertex that related to a “batch-c” but still “r2” but the “content” property would be entirely different. I need a way to get that “content” for the combination of “batch” and “reusable”. Yes, whatever needs to change is still possible to change. E.g if property keys make more sense, it’s doable. If reassigning ids makes more sense that’s also doable 👍🏼
S
spmallette396d ago
ok, i think i see it now. so in full:
gremlin> g = TinkerGraph.open().traversal()
==>graphtraversalsource[tinkergraph[vertices:0 edges:0], standard]
gremlin> g.addV('batch').property(id,'batch-a').as('a').
......1> addV('batch').property(id,'batch-b').as('b').
......2> addV('reusable').property(id, 123).as('r1').
......3> addV('reusable').property(id, 321).as('r2').
......4> addV('connected').property('property', 'abc').as('connected-a').
......5> addE('relates_to_batch').from('connected-a').to('a').
......6> addE('related_to_reusable').from('connected-a').to('r1').
......7> addE('link').from('a').to('r1').
......8> addE('link').from('b').to('r2').iterate()
gremlin> g.V().hasLabel('isolated').has('batchId', 'batch-a').elementMap().toList()
==>[id:6,label:isolated,reusableId:123,property:abc,batchId:batch-a]
gremlin> g.V('batch-a').
......1> project('batchId', 'reusableId', 'connected').
......2> by(T.id).
......3> by(__.in('relates_to_batch').out("related_to_reusable").id()).
......4> by(__.in("relates_to_batch").elementMap().fold())
==>[batchId:batch-a,reusableId:123,connected:[[id:0,label:connected,property:abc]]]
gremlin> g.V().hasLabel('isolated').has('batchId', 'batch-a').elementMap().toList()
==>[id:6,label:isolated,reusableId:123,property:abc,batchId:batch-a]
gremlin> g = TinkerGraph.open().traversal()
==>graphtraversalsource[tinkergraph[vertices:0 edges:0], standard]
gremlin> g.addV('batch').property(id,'batch-a').as('a').
......1> addV('batch').property(id,'batch-b').as('b').
......2> addV('reusable').property(id, 123).as('r1').
......3> addV('reusable').property(id, 321).as('r2').
......4> addV('connected').property('property', 'abc').as('connected-a').
......5> addE('relates_to_batch').from('connected-a').to('a').
......6> addE('related_to_reusable').from('connected-a').to('r1').
......7> addE('link').from('a').to('r1').
......8> addE('link').from('b').to('r2').iterate()
gremlin> g.V().hasLabel('isolated').has('batchId', 'batch-a').elementMap().toList()
==>[id:6,label:isolated,reusableId:123,property:abc,batchId:batch-a]
gremlin> g.V('batch-a').
......1> project('batchId', 'reusableId', 'connected').
......2> by(T.id).
......3> by(__.in('relates_to_batch').out("related_to_reusable").id()).
......4> by(__.in("relates_to_batch").elementMap().fold())
==>[batchId:batch-a,reusableId:123,connected:[[id:0,label:connected,property:abc]]]
gremlin> g.V().hasLabel('isolated').has('batchId', 'batch-a').elementMap().toList()
==>[id:6,label:isolated,reusableId:123,property:abc,batchId:batch-a]
hmm - "reusableId" isn't in the one with project() ok - another little mistype. updated
L
Legendary396d ago
Yep looks good!
S
spmallette396d ago
needed another edit for the "reusableId" to match up in your real model, could a batch have multiple reusables?
L
Legendary396d ago
Yes On average there are 5 reusable per batch
S
spmallette396d ago
in this example the reusable vertices are all directly connected to batch - is there often more hierarchy between them - like, more than one step away?
L
Legendary396d ago
No it’s a direct connection between batch and reusable. There are additional edges coming off reusable but I’m not sure how relevant that is
S
spmallette396d ago
could you point out where this approach "created weird and complex queries"? our example so far feels fairly graphy to me, but after working with our friend Gremlin for over a decade i have a different definition of "weird and complex" 👽
L
Legendary396d ago
hahaha I think what seemed "weird" to me:
g.V('batch-a').
......1> project('batchId', 'reusableId', 'connected').
......2> by(T.id).
......3> by(__.in('relates_to_batch').out("related_to_reusable").id()).
......4> by(__.in("relates_to_batch").elementMap().fold())
g.V('batch-a').
......1> project('batchId', 'reusableId', 'connected').
......2> by(T.id).
......3> by(__.in('relates_to_batch').out("related_to_reusable").id()).
......4> by(__.in("relates_to_batch").elementMap().fold())
Having to traverse in and out to get the data I was after. I was wondering the implication of just storing that data as property values Especially since the links are so loosely related, it seems like there could be benefit to removing the edges entirely We've discussed a lot here, a lot of it has probably been lost in translation haha Also something like:
g.V().hasLabel('isolated').has('batchId', 'batch-a').elementMap().toList()
g.V().hasLabel('isolated').has('batchId', 'batch-a').elementMap().toList()
Is a lot more straight forward than the above, imo Also not sure about the performance implications of each of these Im also interested to know if the latter is considered an anti-pattern or code smell
Solution
S
spmallette396d ago
We're basically talking about denormalization here. denormalization is common for graphs as it is for relational data structures and it comes with the same drawbacks. Is this a case for denormalization? Based on this simple example, I'd say "no" because you really just have a single hop or so to collect what you need and you're done. But, I also don't know any other statistics about your data structure and other expected query patterns so it's hard to say with certainty that you shouldn't denormalize. That said, if denormalization is the answer, do you denormalize to a wholly disconnected vertex? i still don't think i'd recommend that based on what i know. You're most painful traversal is a two hop of __.in('relates_to_batch').out("related_to_reusable") to get an id or perhaps multiple ids. Considering adding a property of "reusableIds" to "batch-a" and store a List of the ids there (or use multi-properties, https://tinkerpop.apache.org/docs/current/reference/#vertex-properties). That seems like the most natural model to me since "isolated" is really just a "batch" with properties containing various things its connected to. Seems better to me to not introduce an "isolated" concept for that and denormalize to a thing that is actually part of your graph and connected.
g.V().hasLabel('batch').has('batchId', 'batch-a').elementMap().toList()
g.V().hasLabel('batch').has('batchId', 'batch-a').elementMap().toList()
I'm still hesitant to say denormalize at all though. I suppose you can do it for ease of querying sake, but it's more often done for performance reasons. I also could be missing more context with this advice, but I think i'll stick with what I've posted here as an answer. gremlin
L
Legendary396d ago
The issue with the reusableIds to batch-a is we are only really concerned with the ids that are connected both relates_to_batch and related_to_reusable via that connected vertex. For example, we might have a connected vertex that relates_to_batch via batch-a but might be related_to_reusable via r2 -- These ids can't be contained within the same batch, since it's specific to the combination of batch and reusable. so I'm not too sure how it would work by adding the list to batch-a. I hope this clarifies the questions you had about denormalization. Not sure if you wanna discuss this further, but happy to move on for now 🙂
S
spmallette396d ago
without following that example too closely (and maybe i have to in order to properly answer you), i guess my question is what's the difference between you doing:
g.addV('isolated').property('property', 'abc').
property('batchId', 'batch-a').property('reusableId', '123')
g.addV('isolated').property('property', 'abc').
property('batchId', 'batch-a').property('reusableId', '123')
and
g.addV('batch').property('property', 'abc').
property('batchId', 'batch-a').property('reusableId', '123').
addE().... // connected to all the other graph structure
g.addV('batch').property('property', 'abc').
property('batchId', 'batch-a').property('reusableId', '123').
addE().... // connected to all the other graph structure
Do you not get the same ease of querying using the second approach? I just don't see what "isolated" is doing for you that is special. it looks like a way to quickly get information about a "batch" but in some separate structure.
L
Legendary396d ago
Those examples are practically the same. I'm also not tied to either approach, I think I'm just trying to understand what is the most "correct" way of doing things. I guess this kind of goes back to my original question about storing vertex references in another vertex? I think the term batch here makes this a bit confusing cos we are already using batch for batch-a. Is this meant to symbolize the same object? I think it's supposed to be connected if we wanted to tie this closer to the example. Curious to know what the edge would connect to? Is it the batch that contains batch-a? Sorry if we are going in circles!
S
spmallette396d ago
yeah, i think we have a few circles. it's ok though... what confused me was your saying that "I'm not too sure how it would work by adding the list to batch-a" so my reply was to clarify that if you understand how to do the "isolated" way that you are proposing, then i'm just saying put it on your "batch" which is connected to everything else instead of using an isolated vertex. this is just a mechanism for denormalization and i'd say an "isolated" vertex isn't a pattern to follow in this case. denormalize within your graph structure to preserve the main part of why you chose a graph in the first place - for it's connections
L
Legendary396d ago
Right okay I think I get you now. I think it's probably important to note is that there is information in that external vertex that is useful to the use case other than the batchId and reusableId -- content was just one example, so I think it makes sense for it to be it's own vertex. Based on what you're saying, I think it probably makes sense just to stick with the relates_to_batch and relates_to_reusable example, we spoke about earlier.
S
spmallette396d ago
ok, well feel free to ask more questions if you get stuck or to just drop a note in #open-sharing as you continue with you work to let us know how you're doing. i'll probably mark this question as answered with my point about denormalization. that's probably the most general answer here for folks to learn from. thanks for the conversation <:gremlin_smile:1091719089807958067> (as i use that as excuse to try out my new emoji) 🙂
L
Legendary396d ago
Thanks so much! I really appreciate your time and patience 🙂
Want results from more Discord servers?
Add your server
More Posts
Subgraph Strategy with vertexProperties + project().by("field name") = crashRunning the following query: `g.withStrategies(new SubgraphStrategy(vertexProperties: constant(true)Custom MutationListener on TransactionHello everyone, I'm a beginner regarding tinkerpop and i'm trying to fire my custom listener after Gremlin Query to give all related items in versioned graphI am working on a requirement where I need to store all version of a record in a Graph Database say Order group count result alphabeticallyHi! Given the following query and results in the enclosed image: how would I sort the labels alphabTransactions - tx.commit vs tx.closeI have a question related to transactions. I'm having issues with tx.commit() hanging locally when rExtracting the ProjectStep of a GraphTraversal instance during unit testing**Tl;dr** Given an instance of `GraphTraversal<?, Map<String, Object>>`, is it possible to extract tWhen can we get a non-RC release for Python 3.11 support in Production Envs?There was a bug where the version of aiohttp was too strict and blocked Python 3.11 support. ( httpsSubgraph query returns String instead of TinkerGraph Object in Fluent Java APIHi guys. I am running the following query in the console and it works fine: `g.V().has('user', 'id'Multiple Graphs in Gremlin Server?How do I manage multiple graphs? Is there an option where I can select which graph to use for query Has anyone else encountered "NoSuchElementException" when calling getLeafTrees() on a tree objectDid a little bit of debugging and it seems that the issue has to do with a cast to Tree before calliThe query gets slower as the number of vertices that already exist in JanusGraph gets bigger and bigWhen i start JanusGraph and perform any query like : - g.V().has(properties).limit(10).toList() - g.Is there a limitation on Neptune HTTP API endpoint compatibility when using a proxy and IAM Auth?Hi, Got a weird one today. I'm working on bringing full compatibility for the use of proxies frontiPreventing Janusgraph crash on timeoutAccording to this: https://stackoverflow.com/questions/61985018/janusgraph-image-stop-responding-aftWay to update static vertexhttps://docs.janusgraph.org/schema/advschema/#static-vertices I read document about TTL vertex. And Dotnet best practice: converting Vertex properties to ModelA very common task in Dotnet is to convert a stored entity into a Model class. How is this best accoWhat is the use of adding type information to e.g ElementMap<type> in Gremlin.Net?Consider the query and output in the attached image: What TYPE could be placed inside the `ElemementHow can I find property with a certain data type?I have a situation where the same property has different type under the same label, kind of like theVerifying the count of ingested vertex and edges after bulk loading in Janusgraph.I have bulk loaded around 600k Vertices and 800k Edges into my janusgraph cluster backed with bigtabTraversal is propagating to further edges?I have node label A and B with edge between them ("Has") Also i have node B with edge to another nodHow to load url data into Neptune?I am trying to load a small dataset into Neptune and it seems to always error. I tried g.io("<file