[parameterized queries] Increased time in query evaluation when gremlin server starts/restarts

Hi folks, High latency observed in query evaluation time whenever janusgraph server restarts/starts & the latency degradation is there for atleast 5 mins. * I'm using parameterized queries for the janusgraph server requests, So I know there will be some increased latency whenever sever starts/restarts but the issue is this degradation does not go away for atleast 5 mins and the latency for evaluation goes from aoound 300ms to 5k ms. * Janusgraph is deployed in a kubernetes cluster with 20 pods, So everytime I redeploy the janusgraph cluster this issue arises which results in timeouts at client side. Wanted to know if there is some other way to add all the parameterized queries to the cache so that whenever started/restarted janusgraph pod is ready to serve requests all the parametrized query should be already in cache
19 Replies
shivam.choudhary
shivam.choudhary12mo ago
The time for query evaluation goes upto 30k ms on pod starts/restartsand and remains high for around 5 minutes before settling to previous levels.
No description
shivam.choudhary
shivam.choudhary12mo ago
The above panel shows response times for janusgraph cluster having 20 janusgraph instances. @KelvinL is there a way this can be avoided? TIA.
spmallette
spmallette12mo ago
i'm not sure that i can think of any other way to do this than the obvious one. your application has to send the queries you wish to cache to the server when it starts up. as there is no shared cache, you will need to be sure that the queries are sent to each node in the cluster. the only other thing i can think of that you could try would be to write this pre-caching function into the init script for the server (which would execute on each node) to do it from there. the one trick i can think of is that the init script execution happens before the server is fully ready to accept requests. you'd have to spawn a thread (or maybe start a separate application??) that tests for when the server is fully started and then send all your pre-caching requests.
shivam.choudhary
shivam.choudhary12mo ago
@spmallette Thanks for the detailed approach, We thought of doing this only at the start but as the approach looks a bit hacky I was checking if there is any standard way of regitering all the parameterized query before using them. Also I observed that whenever a new pod startup the number of current thread increases rapidly from roughly around 28 to 1000, can that be the reason for the increased evaluation time? because the latency we are observing is going upto 5000 ms from mere 50ms.
shivam.choudhary
shivam.choudhary12mo ago
No description
spmallette
spmallette12mo ago
The number of threads shouldn't exceed the size of the gremlinPool settings, unless you're using sessions which technically doesn't have an upper bound at this point. i don't know if the the thread increase is extending your evaluation time. i sense that it's more just query compilation time stacking up on you. switching from scripts to bytecode requests would resolve this caching isssue if that is actually the problem. there is no cache for bytecode. it's simply faster. and if you're very adventurous you could test out the GremlinLangScriptEngine instead of the groovy one. tests are showing it to be the fastest way to process Gremlin. unfortunately it does not yet handle parameteres (there hasn't been a ton of need on that because there is no caching needed as there is for Groovy).
shivam.choudhary
shivam.choudhary12mo ago
unfortunately it does not yet handle parameteres
We have desgined the architecture on usage of parameters in the query it wont be possible for now to switch from it. btw I profiled the JVM and turns out that these ~1000 threads which are getting created belongs to StandardIDPool
spmallette
spmallette12mo ago
We have desgined the architecture on usage of parameters in the query it wont be possible for now to switch from it.
that functionality isn't available yet, but will be soon. there will still be limitations as it seemed important to restrict Gremlin a bit more than Groovy but for standard uses like has('name',x) or property('name',x), which i assume is the kind of thing you're doing, it should work. hoping it lands with 3.7.0 though i'm not sure we'll replace it as a default in Gremlin Server for a long time.
btw I profiled the JVM and turns out that these ~1000 threads which are getting created belongs to StandardIDPool
i don't know if that is normal offhand as that is not a Gremlin Server thread pool. i believe it is a JanusGraph one. does anyone at @janusgraph know if that it is expected to generate that many threads under these conditions?
Bo
Bo12mo ago
No that does not sound normal. @shivam.choudhary Did you change ids.num-partitions config to a large number?
shivam.choudhary
shivam.choudhary12mo ago
@boxuanli I checked it and the value is set as 1024, as the config is fixed for the lifetime of the graph we wanted to have the graph sufficiently partitioned so that we can make partitioned vertex labels for supernodes. But as of now we haven't had the requirement to create partition label, are there issues which we might face due to this down the line? Also we use Bigtable as our storage backend not sure how it can help here as we mostly have 1 or 2 bigtable nodes based on the traffic.
Bo
Bo12mo ago
Setting that number to 1024 means you have 1024 threads for StandardIDPool
shivam.choudhary
shivam.choudhary12mo ago
@spmallette I checked this metric - longRunCompilationCount which gives the count of events where the script compilation time time was more than the expectedCompilationTime (I configured it as 100 milliseconds) but it came out to be 0.* (Actually it was 1 due to a script which gets evaluated on start up by default which took around 2403ms).* This means that the query compilation time in not taking that much time but still the latency we are observing is of the order of ~500ms for few minutes after startup.
spmallette
spmallette11mo ago
@shivam.choudhary did the size of the StandardIDPool have anything to do with this problem?
shivam.choudhary
shivam.choudhary11mo ago
I'm still figuring out a way to eliminate if StandardIDPool have anything to do with this problem. Have tried several ways but nothing so far. As the config which sets the size of the StandardIdPool can not be change that's why it is getting a bit challenging.
Flynnt
Flynnt11mo ago
There is a difference between cluster.max-partitions and ids.num-partitions. On startup each janusgraph server instance try to retrieve 1024 block of ids ( of a size define in ids.block-size) for each ids.num-partitions (the parameter name is quite confusing in my opinion). Each block allocation result in a lock in the « id allocation table «  in Janus So in your case, when you restart your instances, the table take 20 x 1024 allocations requests and should lock and verify for each block that it has not been allocated to an other instance. What’s your value for cluster.max-partitions ?
shivam.choudhary
shivam.choudhary11mo ago
Sorry I had a mixup last time, the value ids.num-partitions is set as 10 and the value cluster.max-partitions is set as 1024
Flynnt
Flynnt11mo ago
Have you tryed to start your graph in read only mode ? It would disable the idPool stuff. Can you give us your janusgraph.properties file ? It's really looks like you have a bottleneck on startup with the idBlock retrieving. Did you have change your connection's settings to you backend (number of connections, timeout...) ?
shivam.choudhary
shivam.choudhary11mo ago
No, we havent change anything related to backend connection but we did change the ids.block-size when we were initially ingesting the data into the graph as the data was huge. Please find the janusgraph.properties below:
properties:
storage.backend: hbase
storage.directory: null
storage.hbase.ext.google.bigtable.instance.id: ##########
storage.hbase.ext.google.bigtable.project.id: ##########
storage.hbase.ext.google.bigtable.app_profile.id: ############
storage.hbase.ext.hbase.client.connection.impl: com.google.cloud.bigtable.hbase2_x.BigtableConnection
storage.hbase.short-cf-names: true
storage.hbase.table: ###########
cache.db-cache: false
cache.db-cache-clean-wait: 20
cache.db-cache-time: 180000
cache.db-cache-size: 0.5
cluster.max-partitions: 1024
graph.replace-instance-if-exists: true
metrics.enabled: true
metrics.jmx.enabled: true
ids.block-size: "1000000"
query.batch: true
query.limit-batch-size: true
schema.constraints: true
schema.default: none
storage.batch-loading: false
storage.hbase.scan.parallelism: 10
properties:
storage.backend: hbase
storage.directory: null
storage.hbase.ext.google.bigtable.instance.id: ##########
storage.hbase.ext.google.bigtable.project.id: ##########
storage.hbase.ext.google.bigtable.app_profile.id: ############
storage.hbase.ext.hbase.client.connection.impl: com.google.cloud.bigtable.hbase2_x.BigtableConnection
storage.hbase.short-cf-names: true
storage.hbase.table: ###########
cache.db-cache: false
cache.db-cache-clean-wait: 20
cache.db-cache-time: 180000
cache.db-cache-size: 0.5
cluster.max-partitions: 1024
graph.replace-instance-if-exists: true
metrics.enabled: true
metrics.jmx.enabled: true
ids.block-size: "1000000"
query.batch: true
query.limit-batch-size: true
schema.constraints: true
schema.default: none
storage.batch-loading: false
storage.hbase.scan.parallelism: 10
Currently I'm working on setting the read only janusgraph instances, will be able to test it out soon with the load which we have on the current janusgraph instances.
Bo
Bo11mo ago
I cannot remember the difference between ids.num-partitions and cluster.max-partitions so I might have made a mistake. Sorry about that. Btw I don't think setting read only janusgraph instance per se will make a difference. If you don't create any new data, the JanusGraphID threads won't be created at the first place. So technically it doesn't matter if you set readonly or not. As long as you don't attempt to write data, the JanusGraphID threads won't show up.
Want results from more Discord servers?
Add your server
More Posts
Gremlin Statement for Adding Edges Based on Existence of Other EdgesI'm trying to figure out how to do the following in Gremlin with Kelvin's air-routes data. I want toGremlin Python 3.4.13 - Exception Ignored Message When Existing A Python Main**What Happens** This happens in Gremlin Python 3.4.13 1. Open `self.connection = DriverRemoteConnSolved: Gremlin Python Exceptions with .property("timeStamp", 0)**The Issue:** In the Python code below: ` def create_edge(self, from_v: Vertex, to_v: Vertex,indexOf Vertex with given property in a sorted listHi all! I've been trying to create a paginated GET REST api and sometimes I need to enforce a `rangExpected use of `next()` in Java driverI was doing some testing recently and noticed that on very large datasets, invoking `next()` seems tDirect and indirect EdgesThis may also be an "it depends" question, but here I have labels `A`, `B` and `C` where entity `C` Advantage/Disadvantage of In and out edge vs one edgeI realize that the answer to this questions might be "it depends", but if I have a bunch of verticesMergeV uint32 properties inserted as longUsing a property map with OnCreate in MergeV exhibits different behavior if the property type is uinUsing the modern graph, how can I write a query that finds the name of the oldest person?I want to take the graph produced by TinkerFactory.createModern() and write a query that finds the oConcurrent MergeV gremlingo - Vertex Id existsIf I use more than one go routine to update the gremlin server, so each has its own connection and t