Sequential IDs in Neptune?

@neptune I'm attempting to implement sequential IDs for the vertices in our AWS Neptune graph. So far, we have added a new property called vertexNumber, which will store the numeric sequential ID for each vertex. Then, before saving a vertex to the database, I run a simple query to retrieve the current highest vertex number, increment it, and store the new vertexNumber to that vertex. Pseudo-code examples found below.
// Calculate the current highest vertexNumber
id = g.V().hasLabel('my_vertex_label').has("vertexNumber").values("vertexNumber").max()

// Increment result by 1 which will be for the next vertex we save.
vertexNumber = id + 1

// Add the new vertex
g.addV('my_vertex_label').property(Cardinality.single, "vertexNumber", vertexNumber).property(...etc)
// Calculate the current highest vertexNumber
id = g.V().hasLabel('my_vertex_label').has("vertexNumber").values("vertexNumber").max()

// Increment result by 1 which will be for the next vertex we save.
vertexNumber = id + 1

// Add the new vertex
g.addV('my_vertex_label').property(Cardinality.single, "vertexNumber", vertexNumber).property(...etc)
My question is: How will Neptune handle this at scale? For context, we have a distributed architechture in which tens or hundreds (in super rare cases, maybe over a thousand?) new vertexes can be created per SECOND, meaning our db cluster probably sees a lot of concurrent transactions. We are looking for information on how Neptune will handle the initial read query with, for example, 10 or more concurrent transactions. Will all 10+ transactions return the same vertexNumber, or Will Neptune be smart enough to isolate these queries? Thanks!
Solution:
@andys1814 How much do you care about the ids being truly sequential or is having some gaps acceptable as long as they are human readable? I ask as this was a common request when I was working with Cassandra. A common practice was to allot a range of ids to each client on connection versus getting a new one each time. When a client exhausts it's assigned range it then reaches out to get a new range.
This helps to minimize the single point of failure and additional overhead of having to go to a single coordinator to get an id value for each request. It does however means that inserts will not be in sequential order and that you may have gaps in the number. This may or may not be an issue depending on your use case....
Jump to solution
12 Replies
3x1
3x17mo ago
Hi, I suggest you have a look at this page : https://docs.aws.amazon.com/neptune/latest/userguide/transactions-neptune.html#transactions-neptune-read-only More broadly at this section about transactions : https://docs.aws.amazon.com/neptune/latest/userguide/transactions.html Neptune operates with transactions, so if you don't modify a part of your graph, you will always return the same value and will most likely use its cache to speed up the result. Just as a side note, if you have concurrent threads running the example you show, it's very likely you will face race conditions. To avoid it you either have to make sure only 1 thread updates 1 part of the graph, or build a single query to achieve what you want (read + write in the same query).
Transaction Semantics in Neptune - Amazon Neptune
How Neptune handles transactions and isolation levels.
Transaction Isolation Levels in Neptune - Amazon Neptune
Neptune implements different transaction isolation levels for read-only and mutation queries.
triggan
triggan7mo ago
Both the max() and min() steps in Gremlin are not currently leveraging Neptune's built-in indexes. So whenever you run these, it would effectively require a full scan of all IDs. As your data scales, these queries (even if pulled from bufferpool cache) would continue to get slower and slower with more data that they would need to fetch. What is the purpose of the sequential valued properties? Also note, that each vertex and edge will have it's own ID. In Neptune, vertex an edge IDs must be unique, so they can be valuable in the sense that attempting to create another vertex or edge with the same ID would through an exception. (This is one of the few built-in constraints in Neptune). If you don't supply an ID, Neptune uses a UUID in it's place. However, it is a best practice to use a unique and deterministic value for vertex and edge IDs when possible. That can make simple lookup queries (g.V(<id>)) easy to express and also performant.
Andys1814
Andys18147mo ago
Thanks for the information. I've read all the Neptune docs and didn't feel like it provided a lot of clarity as to whether or not my use case would work. And yes, we would absolutely have concurrent threads running the provided example. Like I said we have a distributed architechture in which tens, or hundreds of vertexes can be ingested per second. Thanks for the comment. And yes, the performance of max() was absolutely a concern for us, which is why we're putting that query behind a Redis cache. In an ideal situation, the DB won't need to be hit very often as the Redis cache should have the most up-to-date sequential ID.
Andys1814
Andys18147mo ago
We're not replacing the UUID convention that Neptune uses at all. The UUID will still be the primary identifier for vertices and edges. We are simply adding an additional property called vertexNumber. The purpose of sequential valued properties is for user experience and human readability. The best example I can give you on what we're trying to achieve is how ServiceNow does it: https://i.imgur.com/zYOFnE5.png . However, they still use a UUID on the backend, probably for security and scaling purposes: https://i.imgur.com/tXNBNiT.png. This is exactly what we're going for. UUIDs are a nice easy solution but we need a way for our users to identify something without us needing to display a 32-character string of random non-human readable junk on the frontend.
Imgur
Imgur
triggan
triggan7mo ago
Makes sense. You may also consider taking advantage of the ID values instead of using UUIDs. That would ensure that your IDs are unique across the entire graph.
Dragos Ciupureanu
Another approach would be to generate "readable" ids such as the ones from nanoid https://github.com/ai/nanoid & https://zelark.github.io/nano-id-cc/ . The last link is a calculator of collision probability and for 1k ids/s with only numbers and capital case letters, and the id length of 16 characters, you have ~13 years until you get 1% chance of collision. I guess what I'm trying to say is that if the UX is not greatly impacted by not having a sequential number perhaps you could do without the headache of having to manage the id yourself.
GitHub
GitHub - ai/nanoid: A tiny (130 bytes), secure, URL-friendly, uniqu...
A tiny (130 bytes), secure, URL-friendly, unique string ID generator for JavaScript - GitHub - ai/nanoid: A tiny (130 bytes), secure, URL-friendly, unique string ID generator for JavaScript
Andys1814
Andys18147mo ago
I have considered this, because I think it'd be great to have the built-in uniqueness constraint, but I'm a little worried that vertexes with conflicting IDs will cause us to miss certain data being ingested. We would definitely need to handle whatever exception Neptune throws when attempting to add vertexes with non-unique IDs
Dragos Ciupureanu
Correct, in which case you retry the call with a different ID.
Andys1814
Andys18147mo ago
This is a good note. 16 characters might be a little more than I would hope for, but maybe I can find a balance between the characters count, vs. a practically low chance at a collision. Of course I'd still like to do uniqueness validation before saving to the database, but luckily we're still going to be using UUID for the actual primary ID, so it's not the absolute end of the world if we get a very rare collision
Dragos Ciupureanu
If you really want uniqueness you can use this ID as the vertex/edge ID and that's taken care by Neptune for you.
Solution
Dave
Dave7mo ago
@andys1814 How much do you care about the ids being truly sequential or is having some gaps acceptable as long as they are human readable? I ask as this was a common request when I was working with Cassandra. A common practice was to allot a range of ids to each client on connection versus getting a new one each time. When a client exhausts it's assigned range it then reaches out to get a new range.
This helps to minimize the single point of failure and additional overhead of having to go to a single coordinator to get an id value for each request. It does however means that inserts will not be in sequential order and that you may have gaps in the number. This may or may not be an issue depending on your use case.
Andys1814
Andys18147mo ago
This is a great idea. We're not really worried if there's some gaps. The range idea definitely mitigates the negative affects of when collisions do hit and retries are necessary.