3.6.2 gremlin-server possible memory leak

Has there been any memory leak reports in versions >=3.6.2? I just tracked down a memory leak that manifests itself as slow memory growth when queries execute and found that the change that caused it is when we changed from 3.6.1 to 3.6.2. I switched back to 3.6.1 and the leak is gone. I also tried running the graph outside of gremlin-server and the leak is gone. Before I dive deep into trying to figure out exactly where this is coming from, I was wondering if this has been reported by anyone else?
29 Replies
Lyndon
Lyndon12mo ago
To reproduce the issue, all I do is make a timed task that calls the garbage collector multiple times and reports the total memory used every few seconds, then I boot our graph in gremlin-server and start querying it at a high speed. In 3.6.1 this slowly grows to about 1 GB of memory then stabalizes In 3.6.2/3.6.3 this grows fast and grows until it OOME (several GB)
spmallette
spmallette12mo ago
https://github.com/apache/tinkerpop/compare/3.6.1..3.6.2 not sure what that could be. i do see a groovy version change and some small changes to the GremlinGroovyScrtipEngine, so if you're submiting scripts then that might be a place to look. for the server, the user-agent stuff was added and a fix around timeout/closing. do you think you could get a profile going and see what items are being held in memory? cc/ @colegreer @kennh
GitHub
Comparing 3.6.1..3.6.2 · apache/tinkerpop
Apache TinkerPop - a graph computing framework. Contribute to apache/tinkerpop development by creating an account on GitHub.
Lyndon
Lyndon12mo ago
Yeah I am going to look into this I just need to setup code to bootstrap the server without the shell script so I can profile it properly. Will be a bit but I will report findings here
Lyndon
Lyndon12mo ago
So my test boots up gremlin-server in java and I attached a profile. The first screenshot is 3.6.1
No description
Lyndon
Lyndon12mo ago
This next screenshot is 3.6.3 with the exact same test conditions and snapshot taken at same point
No description
Lyndon
Lyndon12mo ago
I don't really understand this part of the code base, but it looks liike for some reason something between netty/gremlin-server is holding onto a bunch of data
spmallette
spmallette12mo ago
what the heck 54 NioEventLoops?!
Lyndon
Lyndon12mo ago
The only change in my code was changing 3.6.1->3.6.3 in my pom.xml and then adjusting my MergeVertex step to not use a function that is only in 3.6.3 when in 3.6.1. I checked that this is not being called in my test. I will take another trace of 3.6.3 running twice as long and post that to see if that number grows That was a 2 min trace, so should have a result in 5ish mins
spmallette
spmallette12mo ago
oh...i misread. hmm - something weird there nonetheless. bah
Lyndon
Lyndon12mo ago
i will try to go through the GC roots and see if I can figure out who is holding all this, but seems like something is holding onto all this. I am surprised no other graph providers have encountered this, unless it's a weird interaction between our implementation and 3.6.3
spmallette
spmallette12mo ago
thanks for digging in 3.6.x is still pretty early i guess it's been out for a while but full adoption is still sorta in the early stages
Lyndon
Lyndon12mo ago
this is the same test running for 2x length
No description
Lyndon
Lyndon12mo ago
(note the client was closed and GC was forced many times at this point)
spmallette
spmallette12mo ago
you wrote 3.6.2 earlyier but now youre' talking about 3.6.3 is it pinned to 3.6.2? that would help reduce the search space
Lyndon
Lyndon12mo ago
I went through our commits and found that the leak came up when we went to 3.6.2, but our dev is at 3.6.3 and has the same problem
spmallette
spmallette12mo ago
ok, so we could jsut focus on the 3.6.1 to 3.6.2 commits
Lyndon
Lyndon12mo ago
perhaps I should go through the server change PR's and compile the server, inject that into my local maven, and then see if the leak exists until I find the PR that introduced it?
spmallette
spmallette12mo ago
yeah, that would be great if you could do that there weren't many based on what i'm looking at there in those screenies, it does seem like it is something gremlin server specific and didn't come from core or groovy
Lyndon
Lyndon12mo ago
okay I can try that, will take me a bit to get set up but I will start working on that
spmallette
spmallette12mo ago
so focusing on the changes just in gremlin-server might be best
Yang Xia
Yang Xia12mo ago
We bumped netty version from 4.1.77 to 4.1.86 in 3.6.2, would that potentially affect it? I mean it is a very minor version bump...
Lyndon
Lyndon12mo ago
its not impossible, i dont see any leak reports on google for that version it could be the usage of netty manifesting as it looking like netty is causing the leak too I will give that pr a try once i get setup
Yang Xia
Yang Xia12mo ago
Thanks Lyndon!
Lyndon
Lyndon12mo ago
Update: Checked the commit right before that change and it has the leak. The user agent stuff came after that so both the user agent and the netty bump are ruled out will continue with some older stuff
spmallette
spmallette12mo ago
that timeout thingy was the other one i noticed
Lyndon
Lyndon12mo ago
do you what pr that is? update: Aug 1st, commit id 4c7dbb0 (https://github.com/apache/tinkerpop/commit/4c7dbb070303dff306ac30aee261da7086770693) has no leak Aug 5, commit id 33c1d37 (https://github.com/apache/tinkerpop/commit/33c1d37030ed29b11cc6e58ea94910779a701772) has the leak I am narrowing in on it and should have the exact commit id very soon
Lyndon
Lyndon12mo ago
It appears the leak starts in this commit https://github.com/apache/tinkerpop/commit/33c1d37030ed29b11cc6e58ea94910779a701772 I am going through the code now to see if there's anything obvious
spmallette
spmallette12mo ago
timeout stuff
Lyndon
Lyndon12mo ago
yeah, I think I might have an idea what is causing it, let me make sure though If I remove this block of code:
final Future<?> executionFuture = context.getGremlinExecutor().getExecutorService().submit(evalFuture);
if (seto > 0) {
// Schedule a timeout in the thread pool for future execution
context.getScheduledExecutorService().schedule(() -> {
executionFuture.cancel(true);
if (!context.getStartedResponse()) {
context.sendTimeoutResponse();
}
}, seto, TimeUnit.MILLISECONDS);
}
final Future<?> executionFuture = context.getGremlinExecutor().getExecutorService().submit(evalFuture);
if (seto > 0) {
// Schedule a timeout in the thread pool for future execution
context.getScheduledExecutorService().schedule(() -> {
executionFuture.cancel(true);
if (!context.getStartedResponse()) {
context.sendTimeoutResponse();
}
}, seto, TimeUnit.MILLISECONDS);
}
From the TraversalOpProcessor it goes away I think it's something to do with the executionFuture objects being help in that lambda but I am not sure why this is a problem since it should be released when it executes I will try making my evaluationTimeout small and see if it stops being a problem seems like it, so it looks like what happens is if you have a large evaluation timeout, you will continually grow these Objects and in my case, this grew around 1 GB per minute, so this appeared as a leak. Technically it isn't a leak but it can consume gross amounts of memory and maybe we can do something about that... I made a fix that does the following: 1. Add private final AtomicReference<ScheduledFuture<?>> scheduledTimeoutFuture = new AtomicReference<>(null); to Context with a getter/setter 2. set scheduledTimeoutFuture to value of scheduled timeout task when it is created 3. when query completes, get value of scheduledTimeoutFuture, if it is not null, cancel() it This prevents us from leaving a ton of dangling threads I will get this in a PR back to TinkerPop on Monday since I am running low on time but i tested the fix and it works https://github.com/apache/tinkerpop/pull/2090/files