Elasticsearch mixed index performance
Hi All, we use JanusGraph 1.0 with Cassandra storage backend and 3 node cluster Elasticsearch index backend. We try to ingest 100.000 nodes. Our nodes are very simple for benchmarking purpose: all are person nodes with id, first name and last name properties. 'id' has a composite index, first name and last name are supposed to have mixed index. At this point we don't ingest any edges.
Ingestion time of 100.000 nodes without any mixed index (only composite index for 'id') took for us 2 minutes. While ingesting the same data with mixed index on first name and last name took 50 minutes which is significantly slower.
I am wondering if we misconfigured something or using mixed index is expected to slow down ingestion so drastically?
Do you have any suggestion or idea on how to speed up mixed indexing?
Thank you.
Peter
5 Replies
Hi, first of all, have you considered using custom ID? That way, you don't need to specify an "id" property and its composite index.
25x slower with mixed index sounds bad. My first impression is that you need to tune your Elasticsearch configurations. You may want to take a look at https://docs.janusgraph.org/index-backend/elasticsearch/#write-optimization
Also, do you mind sharing your configs and code, just in case we can spot anything suspicious?
Thanks a lot @Boxuan Li for your suggestions. We will have a look at the referenced resource and get back to you. Please note that under the "Write optimization" section in the JG docs, the "this blog post" external link is not reachable any more. Shall I open an issue on JG github regarding this? Thank you.
Shall I open an issue on JG github regarding this? Thank you.That would be great, thanks! You could also submit a PR if that's convenient for you. I would simply remove that paragraph as I cannot find any copy of this blog post.
Thanks Boxuan for the suggestion. I have opened the PR with id 4258. Thank you.
Hi @Boxuan Li , sorry for the late reply on this. It took some time to play around with various configs.
Thanks for the suggestions. For now, we kept the custom ID as property but this probably is not expected to have any effect on the mixed index performance.
We played around with some settings but no success so far.
We have our stack as follows:
* RedHat Linux virtual machine - 8 vcpus, 64 GiB memory
* the following components running as Docker containers
* janusgraph 1.0
* Cassandra 4.0.11
* three node cluster Elasticsearch 8.0
We tried running node ingestion with the following configs:
* Scenario 1 - baseline:
* ingesting person nodes with id, firstName, lastName
* only composite index on id
* Scenario 2:
* ingesting person nodes with id, firstName, lastName
* composite index on id
* mixed index on firstName
* Default configs on JG and Elastic side.
* Scenario 3:
* ingesting person nodes with id, firstName, lastName
* composite index on id
* mixed index on firstName
* adjusted configs:
* storage.batch-loading set to true on the dynamic graph
* storage.buffer-size set to 6144 on the dynamic graph
* ids.block-size set to 100000 on the dynamic graph
* for the firstname index, refresh-interval set to 30s
We executed all three scenarios for 5,000 and 100,000 person nodes.
Below are the results:
* 5.000 nodes
* Scenario 1: 16.49 sec
* Scenario 2: 107.43 sec (1.79 min)
* Scenario 3: 119.3 sec (1.99 min)
* 100.000 nodes
* Scenario 1: 141.84 sec (2.36 min)
* Scenario 2: 1883.28 sec (31.39 min)
* Scenario 3: 1798.08 sec (29.97 min)
The strange thing that the results with default config and adjusted configs are not very different, and compared to the baseline, ingestion is still much slower.
Could you please have a look at our configs as we might have misconfigured something? We welcome any suggestion on how we could improve our ingestion speed with mixed index. Thank you.