Idempotent upsert, is that possible?

For our project, we need to be able to insert vertexes and edges at a very high pace using spark streaming. After many tests, we found an approach that seems very promising. In our context, we could have sporadic vertex collisions, so instead of checking for the existence of a vertex before inserting it, we decided to use a custom ID. As an id, we use a hash generated from the vertex property; the id somehow represents the vertex value, so if two vertexes have the same property values, they have the same hash. Using this approach, we don't check but insert the vertex again; we find that the vertex is precisely overwritten. Looking at the HBase storage layout, it seems ok. We did additional tests, and we couldn't find any counterexample. We wonder if this approach could be potentially dangerous. Performance-wise, we reach more than 600K/s vertex insertions by removing transactions with the existence checks. Any comment on this?
10 Replies
rngcntr
rngcntr5mo ago
The approach of using a hash as a custom ID sounds promising. What kind of hash do you use? At a pace of 600k vertices per second, I would be concerned about hash collisions appearing after only a few hours of operation.
dgreco
dgreco5mo ago
we use sha256 BTW, we tested the same approach with the Cassandra backend, and we got the same result, so it could be a generic recipe for enabling fast streaming of vertexes and edges
rngcntr
rngcntr5mo ago
With Cassandra/ScyllaDB backends, we encountered ghost vertices when multiple clients issued writes to the same vertex at the same time. Doesn't that happen to you? Things may have changed because our research on that topic was a few years back
dgreco
dgreco5mo ago
did it happen with creating vertexes with the same customer defined id? This should work only if you insert vertexes with the same id
rngcntr
rngcntr5mo ago
Custom vertex IDs weren't a thing at the time, but it was the same issue when multiple edges were added to the same vertex simultaneously. Maybe @Florian Hockmann can tell more
dgreco
dgreco5mo ago
using the internal id this mechanism doesn't work, you don't have idempotent upsert, so you would need to check the existence of the vertex (by some property) before inserting, so in an highly concurrent scenario and with an eventually consistent backend like cassandra I think that the risk to have ghost nodes is very high. Moreover, keeping on the transactions (bulkloading = false) potentially synchronises all the writers to ensure the consistency. A possible solution could be to implement a single writer model, where all the equal nodes are always written by the same writer process. I think it's possible, we thought to a potential implementation based on spark streaming. The point is how to scale the insertion? With the solution we implement we reached almost 500K vertex insertion per second on our cluster
Florian Hockmann
You've mentioned that this also works for edges. So, you can first insert vertex A with an edge to vertex B and some time later just insert vertex A again (same properties and thus same vertex ID), now with an edge to vertex C without deleting the edge between A and B? I'm just wondering because if the second insert simply overwrites the first, then I would expect the edge between A and B to be deleted or resulting in an "half-edge" which only exists from B to A but not in the opposite direction
dgreco
dgreco5mo ago
Yes, it seems to be working; we tested all the different scenarios, and I think that this is strictly related to how data are stored in the backend. We observed the same behavior using Cassandra.
Bo
Bo4mo ago
Any comment on this?
Well done. I don't see a problem except for the extremely low chance of hash collision. As long as you can tolerate that probability, I don't see a problem.
dgreco
dgreco4mo ago
Thank you so much
Want results from more Discord servers?
Add your server
More Posts