Performance issues with bulk loading
We've a JanusGraph cluster with Cassandra as the storage backend. Our cluster is deployed into a AWS EKS cluster. We've 32 JanusGraph pods with 2 cores and 16 GB and 9 Cassandra pods with 8 cores and 36 GB. We're using AWS Lambda and gremlin python library to load data in parallel. We've 20 concurrent Lambda invocations.
We're loading data anywhere between 1 million to 20 million nodes (and at least that many edges if not more, hard to predict) in each run. Each Lambda invocation adds 500 nodes and associated relationships. The time it takes to load 500 nodes goes up as the load progresses especially  when the number of nodes is closer to 20 million. A AWS Lambda can run for a maximum of 15 minutes. Several AWS Lambda invocations timeout especially towards the end of load.
What can we do to improve the performance of the data loading. We've to load about 1 billions nodes and associated relationships.
Our JanusGraph and Cassandra config is mostly default.
Thanks.
0 Replies