Apache TinkerPop

AT

Apache TinkerPop

Apache TinkerPop is an open source graph computing framework and the home of the Gremlin graph query language.

Join

Neo4j news

Why there are Neo4j news in the graph-news? Did neo4j build on top of Tinkerpop?
Solution:
The TinkerPop Community has long maintained compatibility with neo4j, but recent releases of neo4j haven't been made easily compatible for ongoing maintenance of that support. As a result, support for neo4j has been pinned to a really old version at 3.x. Recent discussions within the TinkerPop community are generally in favor of dropping support for neo4j for TinkerPop 4.x which was an easier decision now that TinkerGraph supports basic transactions giving us a way to test that functionality. As for your question about why we keep neo4j news in the #graph-news channel, I suppose i'm mostly responsible for that. As someone who has been working on TinkerPop since its earliest days, we've long thought of TinkerPop as a place to talk about graphs, not just TinkerPop enabled graph, but all graphs. Traditionally, it's been that way, but in more recent times that general conversation seems to have drifted to other places. You're the second person to question the inclusion of neo4j here in graph-news so perhaps there are more folks who find it confusing as to why it is present. it's also fairly noisy as they post with great consistency and if you follow graphs generally, you're probably getting that information other places already. i've been thinking about removing it. i'd be happy to hear if you or others agree with that happening....

Confusing behavior of `select()`.

The following traversal acts as a counter: ``` g.withSideEffect("map", [3: "foo", 4: "bar"]). inject("a", "b", "c", "d")....
Solution:
This is caused (if you are using TinkerGraph for example) by Java's HashMap implementation and the fact that a Long will not match an Integer type. ``` g.withSideEffect("map", [3L: "foo", 4L: "bar"]). inject("a", "b", "c", "d"). aggregate(local, "x")....

Tinkerpop Server OOM

Hi Tinkerpop team, I'm trying to make sense of this OOMing that seems to consistently occur in my environment over the course of usually a couple hours. Attached is a screenshot of the JVM GC behavior metrics showing before & after a GC. It's almost like the underlying live memory continues to grow but I'm not sure why....
Solution:
Sorry for the delayed response. I'll try to take a look at this soon. But for now, I just wanted to point out that SingleTaskSession and the like are part of the UnifiedChannelizer. From what I remember, the UnifiedChannelizer isn't quite production ready, and in fact is being removed in the next major version of TinkerPop. We can certainly still make bug/performance fixes to this part of the code for 3.7.x though.
No description

Good CLI REPL allowing unlabeled edges?

Is there another tool like Gremlin with a REPL but perhaps overall simpler? I’m mainly looking for the ability to make labeled nodes and unlabeled directed binary edges (arrows) between nodes. (On the other hand, I can use a generic label for every level in my Gremlin graph, I guess.)...
Solution:
i think the recommendation would be to do as you suggested at the end of your qeustion and to just use default labels and just ignore them in Gremlin, like g.V().out() as opposed to g.V().out('default'). speaking more to your questions, i'm not sure what other graph frameworks you might use. i could be wrong, but i think NetworkX lets you create labelless graph elements: https://networkx.org/

Best practices for local development with Neptune.

I would like to use a local Gremlin server with TinkerGraph for local development, and then deploy changes to Neptune later. However, there are several differences between TinkerGraph and Neptune that impact the portability of the code. The most important one is probably the fact that in Tinkergraph vertex and edge ids are numeric, but they are strings in Neptune. Also, I think there are some differences in how properties are handled if the cardinality is a list. What is the recommended workflow to minimize discrepancies between my local environment and Neptune?...
Solution:
There's a blog post here that contains some of the details on what properties you can change in TinkerGraph to get close: https://aws.amazon.com/blogs/database/automated-testing-of-amazon-neptune-data-access-with-apache-tinkerpop-gremlin/ It's unlikely that you'll find anything that emulates things like the result cache, lookup cache, full-text-search, features, etc.
I would be curious to hear what the needs are for local dev....

Sequential edge creation between streamed vertices

I would like to create an edge between vertices as they are streamed in sequence from a traversal. I want to connect each vertex to the next one in the stream, like a linear chain of vertices with a next edge. For example, given this g.V().hasLabel("person").values("name") produces: ``` ==> josh...
Solution:
i think that's about as good as most approaches. we're missing a step that coudl simplify this code though. i've long wanted a partition() step so that you could change all that code to just: ``` g.V().hasLabel('person'). partition(2). addE('next').from.......

[Bug?] gremlinpython is hanged up or not recovering connection after connection error has occurred

Hello, TinkerPop team. I am struggling to avoid problems after a connection error occur. And now, I suspect it might be led by something bug of gremlinpython... ...
Solution:
What you're noticing here kind of boils down to how connection pooling works in gremlin-python. The pool is really just a queue that the connection adds itself back to after either an error or a success but it's missing some handling for the scenarios you pointed out. One of the main issues is that the pool itself can't determine if a connection is healthy or if it unhealthy and should be removed from the pool. I think you should go ahead and make a Jira for this. If it's easier for you, I can help you make one that references this post. I think the only workaround right now is to occasionally open a new Client to create a new pool of connections when you notice some of those exceptions....

Vertex hashmaps

Hi, I'm looking to copy subgraphs, if there are better practices for this in general, please let me know I'm currently looking at emitting a subtree, then creating new vertices, storing a mapping of the original to the copy, and reusing this mapping to build out the relationships for the copied vertices. I'm not sure how I should be doing this, currently I'm trying to use the aggregate step to store the original/copy pairs, but I'm not sure how to select nodes from this in future steps....
Solution:
since you tagged this question with javascript i think that aggregate() is probably your best approach. in java, you would probably prefer subgraph() because it gives you a Graph representation which you could in turn run Gremlin on and as a result is quite convenient. we hope to see better support for subgraph() in javascript (and other language variants) in future releases.

Benchmarking

Hi everyone, how to benchmark with gremlin?
Solution:
that's a fairly broad question, so i'll give a broad answer. one of the nice things about TinkerPop is that it lets you connect to a lot of different graph databases with the same code, so it does allow you to compare performance of different graph databases. that said, doing a good benchmark is still a bit hard as it's not enough to just use Gremlin to generate a random graph and issue a few queries. among other things, a critical step is to gain a decent understanding of the workings of the gr...

How to improve Performance using MergeV and MergeE?

I made an implementation similar to this: g.mergeV([(id): 'vertex1'].option(onCreate, [(label): 'Person', 'property1': 'value1', 'updated_at': 'value2']).option(onMatch, ['updated_at': 'value2'])).mergeV([(id): 'vertex2'].option(onCreate, [(label): 'Person', 'property1': 'value1', 'updated_at': 'value2']).option(onMatch, ['updated_at': 'value2'])).mergeV([(id): 'vertex3'].option(onCreate, [(label): 'Person', 'property1': 'value1', 'updated_at': 'value2']).option(onMatch, ['updated_at': 'value2'])) So I'm send 2 requests to neptune. The first one with 11 vertexes and the second with 10 edges in two different requests and doing a performance test using neptune. The duration of the process for this amount of content is like 200ms-500ms. Is there a way to improve this query to be faster? For connection I'm using gremlin = client.Client(neptune_url, 'g', transport_factory=lambda: AiohttpTransport(call_from_event_loop=True), message_serializer=serialier.GraphSONMessageSerializer()) so I send this query by gremlini.submit(query)...
Solution:
In general, the method to get the best write performance/throughput on Neptune is to both batch multiple writes into a single requests and then do multiple batched writes in parallel. Neptune stores each atomic component of the graph as separate records (node, edge, and property). For example, if you have a node with 4 properties, that turns into 5 records in Neptune. A batched write query with around 100-200 records is a sweet spot that we've found in testing. So issuing queries with that many records and running those in parallel should provide better throughput. Conditional writes will slow things down, as additional locks are being taken to ensure data consistency. So writes that use straight addV(), addE(), property() steps will be faster than using mergeV() or mergeE(). The latter can also incur more deadlocks (exposed in Neptune as ConcurrentModificationExceptions). So it is also good practice to implement exponential backoff and retries whenever doing parallel writes into Neptune....

Why is T.label immutable and do we have to create a new node to change a label?

We cannot do g.V('some label').property(T.label, 'new label').iterate() ? Is this correct? Thank you
Solution:
you have a few questions here @Julius Hamilton
Why is T.label immutable
i'm not sure there's a particular reason except to say that many graphs have not allowed that functionality so TinkerPop hasn't offered a way to do it.
and do we have to create a new node to change a label?...

Simple question about printing vertex labels

I am creating a graph in the gremlin cli by doing graph = TinkerGraph.open(), g = graph.traversal(), g.addV("somelabel"). i can confirm a vertex was created. i can do g.V().valueMap(true) and it shows ==>[id:0,label:documents]. But I so far do not know how to print information about a vertex via its index. I have tried g.V(0) but it doesnt print anything.
Solution:
By default, IDs are stored as longs. You likely need to use g.V(0L) in Gremlin Console to return the vertex that you created.

Defining Hypergraphs

I want to create a software system where a person can create labeled nodes, and then define labeled edges between the nodes. However, edges also count as nodes, which means you can have edges between edges and edges, edges and nodes, edges-between-edges-and-nodes and edges-between-nodes-and-nodes, and so on. This type of hypergraph is described well here in this Wikipedia article:
One possible generalization of a hypergraph is to allow edges to point at other edges. There are two variations of this generalization. In one, the edges consist not only of a set of vertices, but may also contain subsets of vertices, subsets of subsets of vertices and so on ad infinitum. In essence, every edge is just an internal node of a tree or directed acyclic graph, and vertices are the leaf nodes. A hypergraph is then just a collection of trees with common, shared nodes (that is, a given internal node or leaf may occur in several different trees). Conversely, every collection of trees can be understood as this generalized hypergraph....
Solution:
I always understood that back in the day Marko and crew decided that hypergraphs can be modeled by a property graph. You have to stick a vertex in the middle that represents the hyper edge. This leaves the query language without any first class constructs about navigating a hyper edges but everything is reachable. Another problem would be performance, the more abstraction away from the implementation on disc, the slower the graph becomes....

JanusGraph AdjacentVertex Optimization

Hiya, I'm wondering if anyone has any advice on how to inspect the provider-side optimizations being applied to my gremlin code by janus graph. Currently when I call explain I get the following output. ``` Original Traversal [GraphStep(vertex,[]), HasStep([plabel.eq(Person)])@[a], VertexStep(OUT,vert...
Solution:
TinkerPop applies all optimization strategies to all queries (including JanusGraph internal optimizations). However, JanusGraph skips some of the optimizations as it sees necessary. We don't currently store information if the optimization strategy modified any part of the query or was simply skipped (potential feature request). Thus, the way I would test if the optimization strategy actually makes any changes or not is to debug the query with the breaking point placed in the necessary optimization strategy. I.e. in your case I would place a breaking point here: https://github.com/JanusGraph/janusgraph/blob/c9576890b5e9dc48676ccc16a58552b8a665e5f0/janusgraph-core/src/main/java/org/janusgraph/graphdb/tinkerpop/optimize/strategy/AdjacentVertexOptimizerStrategy.java#L58C13-L58C28 If this part is triggered during your query execution - optimization works in this case....

Efficient degree computation for traversals of big graphs

Hello, We're trying to use Neptune with gremlin for a fairly big (XX m nodes) graph and our queries usually have to filter out low degree vertices at some point in the query for both efficiency and product reasons. According to the profiler, this operation takes the big brunt of our computation time. At the same time, the computation of the degree is something that could be pre-computed on the database (we dont even need it to be 100% accurate, a "recent" snapshot computation would be good enough) which would significantly speed up the query. Is there anyone here who has some trick up their sleeve for degree computation that would work well, either as an ad-hoc snippet for a query or as a nice way to precompute it on the graph? ...
Solution:
I would first mention that the profile() step in Gremlin is different than the Neptune Profile API. The latter is going to provide a great deal more info, including whether or not the entire query is being optimized by Neptune: https://docs.aws.amazon.com/neptune/latest/userguide/gremlin-profile-api.html If you have 10s of millions of nodes, you could use Neptune Analytics to do the degree calculations. Then extract the degree properties from NA, delete the NA graph, and bulk load those values back into NDB. We're working to make this round-trip process more seamless. But it isn't too hard to automate in the current form....

Gremlin query to order vertices with some locked to specific positions

I'm working with a product catalog in a graph database using Gremlin. The graph structure includes: 1. Product vertices...
Solution:
you can play tricks like this to move the 0 index to last place, but that still leaves the 7 one row off ``` gremlin> g.V().hasLabel("Category"). ......1> inE("belongsTo").as('a').outV(). ......2> path()....

Is it possible to configure SSL with PEM certificate types?

Hi all, I'm new to this group and currently working getting an implementation of Gremlin (Aerospike Graph) to listen over SSL. The certificates we get from our provider's API are only served in PEM format. It appears, according to the documentation that the keyStoreType and trustStoreType either JKS or PKCS12 format: https://tinkerpop.apache.org/javadocs/current/full/org/apache/tinkerpop/gremlin/server/Settings.SslSettings.html Is this true? Is there any way for us to configure SSL with PEM format certificates?...
Solution:
Hi @joshb, am I correct in assuming you are using the Java driver to connect to Aerospike? The java driver uses the JSSE keyStore and trustStore, which as far as I understand does not support the PEM format. You may be able to use a 3rd party tool such as openssl to convert from PEM to PKCS12 (https://docs.openssl.org/1.1.1/man1/pkcs12/). Perhaps @aerospike folks may have more direct recommendations for driver configuration....

Query works when executed in console but not in javascript

``` const combinedQuery = gremlin.V(profileId) .project('following', 'follows') .by( __.inE('FOLLOWS').outV().dedup().id().fold()...

Very slow regex query (AWS Neptune)

We have a query that searches a data set of about ~400,000 vertices, matching properties using a case insensitive TextP.regex() expression. We are observing very bad query performance; even after several other optimizations, it still takes 20-45 seconds, often timing out. Simplified query: ``` g.V()...
Solution:
When Neptune stores data it stores it in 3 different indexed formats (https://docs.aws.amazon.com/neptune/latest/userguide/feature-overview-data-model.html#feature-overview-storage-indexing), each of which are optimized for a specific set of common graph patterns. Each of these indexes is optimized for exact match lookups so when running queries that require partial text matches, such as a regex query, all the matching property data needs to be scanned to see if it matches the provided expression.
To get a performant query for partial text matches the suggestion is to use the Full Text search integration (https://docs.aws.amazon.com/neptune/latest/userguide/full-text-search.html) , which will integrate with OpenSearch to provide robust full text searching capabilities within a Gremlin query...

How can we extract values only

"Latitude", { "@type": "g:Double", "@value": 45.2613104 },...
Solution:
If I understand this correctly, you are first trying to take the result of a gremlin query that returns Latitude and Longitude (like in the initial post you made), and use those values in the in the math() step that calculates the Haversine formula (your latest post). If that is the case, you have two options. 1. You should combine this into one Gremlin query. You can save the results of the Latitude and Longitude to variables or use them in a by() modulator to the math step. Assuming that those values are properties on a vertex called 'lat' and 'lon' it would look something like g.V().project('Latitude', 'Longitude').by('lat').by('lon').math(...). You would replace the ... in the math() step with the Haversine formula. 2. If you want to keep these as two separate queries, then you should use one of the Gremlin Languages Variants (GLVs) which are essentially drivers that will automatically deserialize the result into the appropriate type so you don't have to deal with the GraphSON (which is what your initial post shows). Read triggan's answer above for more details about that....