Analyzing samples of Gremlin Queries in Neptune Notebook
Hey everyone,
I’m working on a project where we give internal customers access to our Neptune graph through Neptune Notebook. There are already quite a few users, and we want to analyze the queries they run to see which parts of our ontology are used more and which parts are less utilized. This is not as straight-forward as retrieving all labels from the query, since our edge labels are not unique, and if people would be using
We’ve figured out how to override the Gremlin magic in Neptune Notebook to add our custom logic to handle each query. And for my problem I’m considering two approaches:
I’m working on a project where we give internal customers access to our Neptune graph through Neptune Notebook. There are already quite a few users, and we want to analyze the queries they run to see which parts of our ontology are used more and which parts are less utilized. This is not as straight-forward as retrieving all labels from the query, since our edge labels are not unique, and if people would be using
.in or .out steps without clarifying the entity name, it's almost impossible to analyze which part of ontology was visited. We also want to identify common query patterns to understand what people are usually querying for and which connections in our ontology are the most frequently used, but also filtering out some common to all queries parts, like g.V() or g.V(), retrieving rather information about combinations of multiple steps that were called.We’ve figured out how to override the Gremlin magic in Neptune Notebook to add our custom logic to handle each query. And for my problem I’m considering two approaches:
- Running the Gremlin profiler on each query to get detailed info on the nodes visited and then applying custom language analysis algorithms.
- Collecting this data and feeding it into an LLM to summarize the queries and answer questions I'm interested in.
Solution
I think this is going to depend on how granular you want to get. If the intent is to see what labeled vertices or edges are accessed, then just looking at a query in the audit log would be sufficient. But, if your intent is to see every atomic component that is accessed in the database as part of query execution, that could be expensive. It is possible, though. You could run every query through the Neptune Gremlin Profiler: https://docs.aws.amazon.com/neptune/latest/userguide/gremlin-profile-api.html and set
With the list of indexed lookup patterns, you could possibly maintain an external counter (maybe in sorted set in Redis/Valkey) with a a key of the S-P-O-G combination and the value being the number of times accessed.
Just be aware that attaining a Neptune Gremlin Profile output requires that you run the query again. So you may not be able to use this to capture writes (without rewriting the data) and it will incur additional database resources to re-run all of the read queries.
profile.indexOps to True and you'll get an output at the bottom of the profile output with every index operation that occurs. These will equate to some permutation of S-P-O-G patterns that are used in the three different built-in indexes (or fourth index, if enabled). With the list of indexed lookup patterns, you could possibly maintain an external counter (maybe in sorted set in Redis/Valkey) with a a key of the S-P-O-G combination and the value being the number of times accessed.
Just be aware that attaining a Neptune Gremlin Profile output requires that you run the query again. So you may not be able to use this to capture writes (without rewriting the data) and it will incur additional database resources to re-run all of the read queries.
The Gremlin profile feature in Neptune runs a specified traversal, collects metrics about the run, and produces a profile report.