Idempotent upsert, is that possible?
For our project, we need to be able to insert vertexes and edges at a very high pace using spark streaming. After many tests, we found an approach that seems very promising. In our context, we could have sporadic vertex collisions, so instead of checking for the existence of a vertex before inserting it, we decided to use a custom ID. As an id, we use a hash generated from the vertex property; the id somehow represents the vertex value, so if two vertexes have the same property values, they have the same hash. Using this approach, we don't check but insert the vertex again; we find that the vertex is precisely overwritten. Looking at the HBase storage layout, it seems ok. We did additional tests, and we couldn't find any counterexample. We wonder if this approach could be potentially dangerous. Performance-wise, we reach more than 600K/s vertex insertions by removing transactions with the existence checks. Any comment on this?