JanusGraph•16mo ago

Idempotent upsert, is that possible?

For our project, we need to be able to insert vertexes and edges at a very high pace using spark streaming. After many tests, we found an approach that seems very promising. In our context, we could have sporadic vertex collisions, so instead of checking for the existence of a vertex before inserting it, we decided to use a custom ID. As an id, we use a hash generated from the vertex property; the id somehow represents the vertex value, so if two vertexes have the same property values, they have the same hash. Using this approach, we don't check but insert the vertex again; we find that the vertex is precisely overwritten. Looking at the HBase storage layout, it seems ok. We did additional tests, and we couldn't find any counterexample. We wonder if this approach could be potentially dangerous. Performance-wise, we reach more than 600K/s vertex insertions by removing transactions with the existence checks. Any comment on this?

10 Replies

rngcntr•16mo ago

The approach of using a hash as a custom ID sounds promising. What kind of hash do you use? At a pace of 600k vertices per second, I would be concerned about hash collisions appearing after only a few hours of operation.

dgrecoOP•16mo ago

we use sha256 BTW, we tested the same approach with the Cassandra backend, and we got the same result, so it could be a generic recipe for enabling fast streaming of vertexes and edges

rngcntr•16mo ago

With Cassandra/ScyllaDB backends, we encountered ghost vertices when multiple clients issued writes to the same vertex at the same time. Doesn't that happen to you? Things may have changed because our research on that topic was a few years back

dgrecoOP•16mo ago

did it happen with creating vertexes with the same customer defined id? This should work only if you insert vertexes with the same id

rngcntr•16mo ago

Custom vertex IDs weren't a thing at the time, but it was the same issue when multiple edges were added to the same vertex simultaneously. Maybe @Florian Hockmann can tell more

dgrecoOP•16mo ago

using the internal id this mechanism doesn't work, you don't have idempotent upsert, so you would need to check the existence of the vertex (by some property) before inserting, so in an highly concurrent scenario and with an eventually consistent backend like cassandra I think that the risk to have ghost nodes is very high. Moreover, keeping on the transactions (bulkloading = false) potentially synchronises all the writers to ensure the consistency. A possible solution could be to implement a single writer model, where all the equal nodes are always written by the same writer process. I think it's possible, we thought to a potential implementation based on spark streaming. The point is how to scale the insertion? With the solution we implement we reached almost 500K vertex insertion per second on our cluster

Florian Hockmann•16mo ago

You've mentioned that this also works for edges. So, you can first insert vertex A with an edge to vertex B and some time later just insert vertex A again (same properties and thus same vertex ID), now with an edge to vertex C without deleting the edge between A and B? I'm just wondering because if the second insert simply overwrites the first, then I would expect the edge between A and B to be deleted or resulting in an "half-edge" which only exists from B to A but not in the opposite direction

dgrecoOP•16mo ago

Yes, it seems to be working; we tested all the different scenarios, and I think that this is strictly related to how data are stored in the backend. We observed the same behavior using Cassandra.

Bo•15mo ago

Any comment on this?

Well done. I don't see a problem except for the extremely low chance of hash collision. As long as you can tolerate that probability, I don't see a problem.

dgrecoOP•15mo ago

Thank you so much

Gaming

Programming

Idempotent upsert, is that possible?

Did you find this page helpful?