3 replies

How to improve Performance using MergeV and MergeE?

I made an implementation similar to this:

g.mergeV([(id): 'vertex1'].option(onCreate, [(label): 'Person', 'property1': 'value1', 'updated_at': 'value2']).option(onMatch, ['updated_at': 'value2'])).mergeV([(id): 'vertex2'].option(onCreate, [(label): 'Person', 'property1': 'value1', 'updated_at': 'value2']).option(onMatch, ['updated_at': 'value2'])).mergeV([(id): 'vertex3'].option(onCreate, [(label): 'Person', 'property1': 'value1', 'updated_at': 'value2']).option(onMatch, ['updated_at': 'value2']))

g.mergeV([(id): 'vertex1'].option(onCreate, [(label): 'Person', 'property1': 'value1', 'updated_at': 'value2']).option(onMatch, ['updated_at': 'value2'])).mergeV([(id): 'vertex2'].option(onCreate, [(label): 'Person', 'property1': 'value1', 'updated_at': 'value2']).option(onMatch, ['updated_at': 'value2'])).mergeV([(id): 'vertex3'].option(onCreate, [(label): 'Person', 'property1': 'value1', 'updated_at': 'value2']).option(onMatch, ['updated_at': 'value2']))

So I'm send 2 requests to neptune. The first one with 11 vertexes and the second with 10 edges in two different requests and doing a performance test using neptune. The duration of the process for this amount of content is like 200ms-500ms. Is there a way to improve this query to be faster? For connection I'm using

gremlin = client.Client(neptune_url, 'g', transport_factory=lambda: AiohttpTransport(call_from_event_loop=True), message_serializer=serialier.GraphSONMessageSerializer())

gremlin = client.Client(neptune_url, 'g', transport_factory=lambda: AiohttpTransport(call_from_event_loop=True), message_serializer=serialier.GraphSONMessageSerializer())

so I send this query by

gremlini.submit(query)

gremlini.submit(query)

Solution

In general, the method to get the best write performance/throughput on Neptune is to both batch multiple writes into a single requests and then do multiple batched writes in parallel.

Neptune stores each atomic component of the graph as separate records (node, edge, and property). For example, if you have a node with 4 properties, that turns into 5 records in Neptune. A batched write query with around 100-200 records is a sweet spot that we've found in testing. So issuing queries with that many records and running those in parallel should provide better throughput.

Conditional writes will slow things down, as additional locks are being taken to ensure data consistency. So writes that use straight

addV()

addV()

addE()

addE()

property()

property()

steps will be faster than using

mergeV()

mergeV()

mergeE()

mergeE()

. The latter can also incur more deadlocks (exposed in Neptune as

ConcurrentModificationExceptions

ConcurrentModificationExceptions

). So it is also good practice to implement exponential backoff and retries whenever doing parallel writes into Neptune.