Apache TinkerPop•2y ago

3.6.2 gremlin-server possible memory leak

Has there been any memory leak reports in versions >=3.6.2? I just tracked down a memory leak that manifests itself as slow memory growth when queries execute and found that the change that caused it is when we changed from 3.6.1 to 3.6.2. I switched back to 3.6.1 and the leak is gone. I also tried running the graph outside of gremlin-server and the leak is gone. Before I dive deep into trying to figure out exactly where this is coming from, I was wondering if this has been reported by anyone else?

29 Replies

LyndonOP•2y ago

To reproduce the issue, all I do is make a timed task that calls the garbage collector multiple times and reports the total memory used every few seconds, then I boot our graph in gremlin-server and start querying it at a high speed. In 3.6.1 this slowly grows to about 1 GB of memory then stabalizes In 3.6.2/3.6.3 this grows fast and grows until it OOME (several GB)

spmallette•2y ago

https://github.com/apache/tinkerpop/compare/3.6.1..3.6.2 not sure what that could be. i do see a groovy version change and some small changes to the GremlinGroovyScrtipEngine, so if you're submiting scripts then that might be a place to look. for the server, the user-agent stuff was added and a fix around timeout/closing. do you think you could get a profile going and see what items are being held in memory? cc/ @colegreer @kennh

GitHub

Comparing 3.6.1..3.6.2 · apache/tinkerpop

Apache TinkerPop - a graph computing framework. Contribute to apache/tinkerpop development by creating an account on GitHub.

LyndonOP•2y ago

Yeah I am going to look into this I just need to setup code to bootstrap the server without the shell script so I can profile it properly. Will be a bit but I will report findings here

LyndonOP•2y ago

So my test boots up gremlin-server in java and I attached a profile. The first screenshot is 3.6.1

LyndonOP•2y ago

This next screenshot is 3.6.3 with the exact same test conditions and snapshot taken at same point

LyndonOP•2y ago

I don't really understand this part of the code base, but it looks liike for some reason something between netty/gremlin-server is holding onto a bunch of data

spmallette•2y ago

what the heck 54 NioEventLoops?!

LyndonOP•2y ago

The only change in my code was changing 3.6.1->3.6.3 in my pom.xml and then adjusting my MergeVertex step to not use a function that is only in 3.6.3 when in 3.6.1. I checked that this is not being called in my test. I will take another trace of 3.6.3 running twice as long and post that to see if that number grows That was a 2 min trace, so should have a result in 5ish mins

spmallette•2y ago

oh...i misread. hmm - something weird there nonetheless. bah

LyndonOP•2y ago

i will try to go through the GC roots and see if I can figure out who is holding all this, but seems like something is holding onto all this. I am surprised no other graph providers have encountered this, unless it's a weird interaction between our implementation and 3.6.3

spmallette•2y ago

thanks for digging in 3.6.x is still pretty early i guess it's been out for a while but full adoption is still sorta in the early stages

LyndonOP•2y ago

this is the same test running for 2x length

LyndonOP•2y ago

(note the client was closed and GC was forced many times at this point)

spmallette•2y ago

you wrote 3.6.2 earlyier but now youre' talking about 3.6.3 is it pinned to 3.6.2? that would help reduce the search space

LyndonOP•2y ago

I went through our commits and found that the leak came up when we went to 3.6.2, but our dev is at 3.6.3 and has the same problem

spmallette•2y ago

ok, so we could jsut focus on the 3.6.1 to 3.6.2 commits

LyndonOP•2y ago

perhaps I should go through the server change PR's and compile the server, inject that into my local maven, and then see if the leak exists until I find the PR that introduced it?

spmallette•2y ago

yeah, that would be great if you could do that there weren't many based on what i'm looking at there in those screenies, it does seem like it is something gremlin server specific and didn't come from core or groovy

LyndonOP•2y ago

okay I can try that, will take me a bit to get set up but I will start working on that

spmallette•2y ago

so focusing on the changes just in gremlin-server might be best

Yang Xia•2y ago

We bumped netty version from 4.1.77 to 4.1.86 in 3.6.2, would that potentially affect it? I mean it is a very minor version bump...

LyndonOP•2y ago

its not impossible, i dont see any leak reports on google for that version it could be the usage of netty manifesting as it looking like netty is causing the leak too I will give that pr a try once i get setup

Yang Xia•2y ago

Thanks Lyndon!

LyndonOP•2y ago

Update: Checked the commit right before that change and it has the leak. The user agent stuff came after that so both the user agent and the netty bump are ruled out will continue with some older stuff

spmallette•2y ago

that timeout thingy was the other one i noticed

LyndonOP•2y ago

do you what pr that is? update: Aug 1st, commit id 4c7dbb0 (https://github.com/apache/tinkerpop/commit/4c7dbb070303dff306ac30aee261da7086770693) has no leak Aug 5, commit id 33c1d37 (https://github.com/apache/tinkerpop/commit/33c1d37030ed29b11cc6e58ea94910779a701772) has the leak I am narrowing in on it and should have the exact commit id very soon

LyndonOP•2y ago

It appears the leak starts in this commit https://github.com/apache/tinkerpop/commit/33c1d37030ed29b11cc6e58ea94910779a701772 I am going through the code now to see if there's anything obvious

GitHub

Merge branch '3.5-dev' into 3.6-dev · apache/tinkerpop@33c1d37

spmallette•2y ago

timeout stuff

LyndonOP•2y ago

yeah, I think I might have an idea what is causing it, let me make sure though If I remove this block of code:

            final Future<?> executionFuture = context.getGremlinExecutor().getExecutorService().submit(evalFuture);
            if (seto > 0) {
                // Schedule a timeout in the thread pool for future execution
                context.getScheduledExecutorService().schedule(() -> {
                    executionFuture.cancel(true);
                    if (!context.getStartedResponse()) {
                        context.sendTimeoutResponse();
                    }
                }, seto, TimeUnit.MILLISECONDS);
            }

            final Future<?> executionFuture = context.getGremlinExecutor().getExecutorService().submit(evalFuture);
            if (seto > 0) {
                // Schedule a timeout in the thread pool for future execution
                context.getScheduledExecutorService().schedule(() -> {
                    executionFuture.cancel(true);
                    if (!context.getStartedResponse()) {
                        context.sendTimeoutResponse();
                    }
                }, seto, TimeUnit.MILLISECONDS);
            }

From the TraversalOpProcessor it goes away I think it's something to do with the executionFuture objects being help in that lambda but I am not sure why this is a problem since it should be released when it executes I will try making my evaluationTimeout small and see if it stops being a problem seems like it, so it looks like what happens is if you have a large evaluation timeout, you will continually grow these Objects and in my case, this grew around 1 GB per minute, so this appeared as a leak. Technically it isn't a leak but it can consume gross amounts of memory and maybe we can do something about that... I made a fix that does the following: 1. Add private final AtomicReference<ScheduledFuture<?>> scheduledTimeoutFuture = new AtomicReference<>(null); to Context with a getter/setter 2. set scheduledTimeoutFuture to value of scheduled timeout task when it is created 3. when query completes, get value of scheduledTimeoutFuture, if it is not null, cancel() it This prevents us from leaving a ton of dangling threads I will get this in a PR back to TinkerPop on Monday since I am running low on time but i tested the fix and it works https://github.com/apache/tinkerpop/pull/2090/files

Gaming

Programming

3.6.2 gremlin-server possible memory leak

Did you find this page helpful?