3.6.2 gremlin-server possible memory leak
Has there been any memory leak reports in versions >=3.6.2?
I just tracked down a memory leak that manifests itself as slow memory growth when queries execute and found that the change that caused it is when we changed from 3.6.1 to 3.6.2.
I switched back to 3.6.1 and the leak is gone. I also tried running the graph outside of gremlin-server and the leak is gone.
Before I dive deep into trying to figure out exactly where this is coming from, I was wondering if this has been reported by anyone else?
29 Replies
To reproduce the issue, all I do is make a timed task that calls the garbage collector multiple times and reports the total memory used every few seconds, then I boot our graph in gremlin-server and start querying it at a high speed.
In 3.6.1 this slowly grows to about 1 GB of memory then stabalizes
In 3.6.2/3.6.3 this grows fast and grows until it OOME (several GB)
https://github.com/apache/tinkerpop/compare/3.6.1..3.6.2
not sure what that could be. i do see a groovy version change and some small changes to the
GremlinGroovyScrtipEngine
, so if you're submiting scripts then that might be a place to look. for the server, the user-agent stuff was added and a fix around timeout/closing. do you think you could get a profile going and see what items are being held in memory? cc/ @colegreer @kennhGitHub
Comparing 3.6.1..3.6.2 · apache/tinkerpop
Apache TinkerPop - a graph computing framework. Contribute to apache/tinkerpop development by creating an account on GitHub.
Yeah I am going to look into this
I just need to setup code to bootstrap the server without the shell script so I can profile it properly. Will be a bit but I will report findings here
So my test boots up gremlin-server in java and I attached a profile. The first screenshot is 3.6.1
This next screenshot is 3.6.3 with the exact same test conditions and snapshot taken at same point
I don't really understand this part of the code base, but it looks liike for some reason something between netty/gremlin-server is holding onto a bunch of data
what the heck
54 NioEventLoops?!
The only change in my code was changing 3.6.1->3.6.3 in my pom.xml and then adjusting my MergeVertex step to not use a function that is only in 3.6.3 when in 3.6.1. I checked that this is not being called in my test.
I will take another trace of 3.6.3 running twice as long and post that to see if that number grows
That was a 2 min trace, so should have a result in 5ish mins
oh...i misread. hmm - something weird there nonetheless. bah
i will try to go through the GC roots and see if I can figure out who is holding all this, but seems like something is holding onto all this. I am surprised no other graph providers have encountered this, unless it's a weird interaction between our implementation and 3.6.3
thanks for digging in
3.6.x is still pretty early i guess
it's been out for a while but full adoption is still sorta in the early stages
this is the same test running for 2x length
(note the client was closed and GC was forced many times at this point)
you wrote 3.6.2 earlyier but now youre' talking about 3.6.3
is it pinned to 3.6.2?
that would help reduce the search space
I went through our commits and found that the leak came up when we went to 3.6.2, but our dev is at 3.6.3 and has the same problem
ok, so we could jsut focus on the 3.6.1 to 3.6.2 commits
perhaps I should go through the server change PR's and compile the server, inject that into my local maven, and then see if the leak exists until I find the PR that introduced it?
yeah, that would be great if you could do that
there weren't many
based on what i'm looking at there in those screenies, it does seem like it is something gremlin server specific and didn't come from core or groovy
okay I can try that, will take me a bit to get set up but I will start working on that
so focusing on the changes just in gremlin-server might be best
We bumped netty version from 4.1.77 to 4.1.86 in 3.6.2, would that potentially affect it? I mean it is a very minor version bump...
its not impossible, i dont see any leak reports on google for that version
it could be the usage of netty manifesting as it looking like netty is causing the leak too
I will give that pr a try once i get setup
Thanks Lyndon!
Update:
Checked the commit right before that change and it has the leak.
The user agent stuff came after that so both the user agent and the netty bump are ruled out
will continue with some older stuff
that timeout thingy was the other one i noticed
do you what pr that is?
update:
Aug 1st, commit id 4c7dbb0 (https://github.com/apache/tinkerpop/commit/4c7dbb070303dff306ac30aee261da7086770693) has no leak
Aug 5, commit id 33c1d37 (https://github.com/apache/tinkerpop/commit/33c1d37030ed29b11cc6e58ea94910779a701772) has the leak
I am narrowing in on it and should have the exact commit id very soon
It appears the leak starts in this commit https://github.com/apache/tinkerpop/commit/33c1d37030ed29b11cc6e58ea94910779a701772
I am going through the code now to see if there's anything obvious
timeout stuff
yeah, I think I might have an idea what is causing it, let me make sure though
If I remove this block of code:
From the TraversalOpProcessor it goes away
I think it's something to do with the executionFuture objects being help in that lambda but I am not sure why this is a problem since it should be released when it executes
I will try making my evaluationTimeout small and see if it stops being a problem
seems like it, so it looks like what happens is if you have a large evaluation timeout, you will continually grow these Objects and in my case, this grew around 1 GB per minute, so this appeared as a leak. Technically it isn't a leak but it can consume gross amounts of memory and maybe we can do something about that...
I made a fix that does the following:
1. Add private final AtomicReference<ScheduledFuture<?>> scheduledTimeoutFuture = new AtomicReference<>(null); to Context with a getter/setter
2. set scheduledTimeoutFuture to value of scheduled timeout task when it is created
3. when query completes, get value of scheduledTimeoutFuture, if it is not null, cancel() it
This prevents us from leaving a ton of dangling threads
I will get this in a PR back to TinkerPop on Monday since I am running low on time but i tested the fix and it works
https://github.com/apache/tinkerpop/pull/2090/files