Performance Problems/Config review
Hello,
I am reaching out as I anticipate enhanced performance from our Scylla + JanusGraph deployment and am seeking insights to optimize it. We have tested versions 0.6.3 and 1.0.0-20230918-091019.c39a12a.
We currently operate a concise setup on Kubernetes, consisting of three nodes. Each node is allocated 10 CPUs exclusively for our tests and 60GB of memory. Scylla and JanusGraph are deployed on the same nodes, and a fourth node runs a benchmark program to assess our configuration’s efficacy.
The benchmark program repetitively executes a specific query within a loop, performed 100,000 times, with the workload handled by an execution service operating with 128 threads. If every Vertex is inserted, we achieve approximately 4k/s, and with updates only, it's around 10k/s.
I am attaching our most recent configuration for your review and would greatly appreciate any insights or suggestions. We have conducted extensive trial-and-error testing, and the ops/s were not significantly lower even in a single Scylla + single JanusGraph configuration. This leads me to believe there may be a misconfiguration somewhere that is throttling the overall system.
I eagerly await your recommendations, hints, and tips and am hopeful for constructive feedback to optimize our deployment.
4 Replies
I strongly recommend you to turn off
query.smart-limit
, even though it won't affect your write QPS.
If uid
is unique, I recommend you to use the https://docs.janusgraph.org/master/advanced-topics/custom-vertex-id/ feature. This reduces storage and ID allocation overhead.
We have conducted extensive trial-and-error testing, and the ops/s were not significantly lower even in a single Scylla + single JanusGraph configuration. This leads me to believe there may be a misconfiguration somewhere that is throttling the overall system.That's a very interesting observation and I don't have an answer to that, but I am curious too. You mentioned "Scylla and JanusGraph are deployed on the same nodes" - I presume that means you have a JanusGraph (gremlin) server running on each node. How does your benchmark program communicate with these servers? Do you have a proxy server that load balances the requests from client and forwards to different JanusGraph servers?
Hello @Boxuan Li thanks for the feedback and sorry for the late reply, I was on vacation due to national holiday. I will give your recommendation a try on Friday.
Regarding your question. I have a kubernetes setup with Scylla sts with pod anti affinity so that 3 different pods are used. And a jg sts with headless service, pod affinity to Scylla and anti affinity to jg.
The benchmark test uses the java client with 3 contacts.
During the benchmark jg CPU load is 2 to 4 times higher then Scylla.
As you see I dint manage to configure the jg cql driver so that it uses the Scylla instance with the lowest ping (that would be smart I think)
As you see I dint manage to configure the jg cql driver so that it uses the Scylla instance with the lowest ping (that would be smart I think)That's something cql driver itself handles. Maybe DataStax CQL driver doesn't know enough about Scylla cluster's information, so it cannot be smart enough. There's an ongoing attempt to add Scylla's native driver, but it won't be in 1.0.0 release.
By default DataStax CQL driver is used which doesn’t have shard awareness. However, by explicitly excluding CQL driver and explicitly enabling Scylla driver it possible to use Scylla driver instead which will have less latency than CQL driver due to Shard awareness (i.e. instead of a single load balancing processor requests will be processed by the necessary processor immediately).
However, it’s not possible to use Scylla driver with the default JanusGraph docker image at this moment or a default distribution release. Users must build that JanusGraph distribution release manually to enable it.
I was thinking to change the distribution workflow to include Scylla driver in both the default distribution release and the default docker release but I just don’t have enough motivation for this task because we are currently using AstraDB and not Scylla. If we switch to Scylla I will definitely handle that task quickly.