A real life example of finding memory leaks in java

by perfstories

Theory of finding memory leaks is pretty simple. I would formulate it as follows:

  1. Make sure it is a leak indeed;
  2. Identify leaking objects;
  3. Identify code lines, which prove a memory leak;
  4. Fix code.

There is a lot of articles about that. I remember a really great article: http://olex.openlogic.com/wazi/2009/how-to-fix-memory-leaks-in-java. I recommend it to everyone and thank it’s author.

If you got memory leak while doing performance testing, you are lucky. Usually, it’s very simple. Even if you use J2EE technologies, I mean, say, 80% of code, which is running on JRE, isn’t your team code (I mean container, various libraries, etc). I said simple, as you can reproduce it. Of course, there could be more difficult cases (JDK native code, lack of sources, etc.). But anyway, opportunity to reproduce is a serious bid for victory.

First of all, let’s clarify

Under what circumstances could we get such a difference between tests running on test environments and real user load on production environment?:

  1. There are some differences between test and production load profiles. It’s very easy to imagine. Even if you give much attention for your tests representativeness. If a system is difficult to some extent, there will be differences.

    As said Yogi Berra: “In theory there is no difference between theory and practice. In practice there is.” :);

  2. The difference between test and production environment. This difference is very common thing per se. And even if we do think we know what the difference is exactly and completely and we are 100% sure it cannot influence on memory leak, it may influence on memory leak presence. Read please above, what Yogi Berra said. It goes very well here too.

So, it makes no sense to reproduce memory leak on test environment as soon as you got it in production. A probability of reproducing is about 1/2: either you will be lucky reproduce it or not: find mare’s nest :).

Instead of this

Try to get additional information from production. Based on this data you can try to understand (a) what objects do leak?; (b) what are problem scenarios and code lines?. And then, when we will have some ideas about scenarios, code line, we can try to reproduce memory leak on test environment. Even, you know, we must rather than can to make sure our assumption and fix are correct.

I said about additional information, let me clarify what I mean. Mainly, 2 points:

  1. Memory dumps;
  2. “Side” information – what was changed in your system between two states: “memory didn’t leak” and “memory started to leak”. There could be a lot of changes, but we should imagine this at least in general features. It’s better to get consistent picture at the end.

I’m going to illustrate by an example all mentioned above. My memory leak happened in JMSJCA library http://java.net/projects/jmsjca/. So, perhaps, it could be not just the example but pretty useful information.

So, once upon a time…

Symptoms were very common. All available java heap memory leaked in, say, N days, after a new release installation. It was difficult to link memory leak with user activities, i.e. single out one-two user scenario.

Ok, let’s order the dumps from production. Ordered. Got. However, And suddenly an interesting feature was discovered. A memory level came down while dump generation. At least, a memory level came up much slower. And memory leaked in M days rather than in N.

Fortunately, this effect took place not each time. So I could have dumps as heap consumption increased.

So as input data I have: the memory dumps and observation about “memory falls” after the heap dump generation.

So, what can the dumps tell us?

So let’s analyze 3 dumps, generated in the same time intervals (24 hours, a period of user activity, in particular), without system restart, at night hours, so there was almost no user activity at all. Let’s call the dumps as D1, D2 and D3 (chronological order). We will use YourKit for analysis.

By the way, there was a “memory fall” while D2 generation.

If we compare D1 and D2, we understand the biggest increase of Retained Size (D1 – 423825232; D2 – 989956712) consists of Object[]. Moreover, this difference equals approximately to leaked memory for this time (~500Mb):

D1:

D2:

Also, we see that memory leaks happens in thread called com.sun.corba.ee.impl.orbutil.threadpool.ThreadPoolImpl$WorkerThread.

Further, let’s look at “Object Explorer” view for com.sun.corba.ee.impl.orbutil.threadpool.ThreadPoolImpl$WorkerThread. It looks pretty clear, Retained size for some of the threads increases:

D1:

D2:

For example, look at the thread p: thread-pool-1; w: 20: was 17 Mb, became – 32.

Let’s identify what leaks in these threads:

Looks like that’s inheritableThreadLocals field(java.lang.ThreadLocal$ThreadLocalMap). More exactly, it’s сom.sun.enterprise.util.InvocationManagerImpl$InvocationArray.
Com.sun.enterprise.util.InvocationManagerImpl puts there ComponentInvocation (in method preInvoke) before each call and remove after each call. I became so smart after looking at the code (just searched where InvocationManagerImpl$InvocationArray is used).

Let’s look at the code

In general, the main idea of this part of my story is: find place in the code, where memory leaks from. Further, I’ll tell you that, as usual, an object was put into a collection and forgotten to remove. So you can skip this part indeed. 🙂

We use JMSJCA adapter to work with jms messages. And configure delays in 50 seconds to resend messages to somewhere else in this interval:

<activation-config-property>
    <activation-config-property-name>
        RedeliveryHandling
    </activation-config-property-name>

    <activation-config-property-value>
        2:50000;3:move(queue:lookup://yota.jms.aaa_sessionterm.dmq)
    </activation-config-property-value>
</activation-config-property>

Actually, thred doesn’t sleep for 50 seconds. It goes through several loop steps. Each loop thread sleeps for 5 seconds as a maximum. Also, if JMSJCA see a delay is really big (>5 seconds), it will handle in a special branch of the code with a special handler – LongDelayExecutable (see com.stc.jmcjca.core.RedeliveryHandler.java).
This handler set a special flag (mOnMessageWasBypassed) for messages with big delays (more than one sleeping loop steps, see above). And there is such a check (com.stc.jmsjca.core.Delivery):

public void afterDelivery(final DeliveryResults result,            
                          ConnectionForMove connectionForMove,    
                          XMessageEndpoint target,
                          Delivery.MDB mdb, 
                          boolean rethrowSystemException)
                          throws Exception {
    if (!result.getOnMessageWasBypassed()) {
        …
        target.getEndpoint().afterDelivery();
        …
    } else {
        …
    }
}

I.e. target.getEndpoint().afterDelivery is not called. BUT. Com.sun.enterprise.util.InvocationManagerImpl.postInvoke, which could remove ComponentInvocation from ThreadLocals, is called from target.getEndpoint().afterDelivery. So ComponentInvocation isn’t removed if there are long delays (that’s our case exactly).

The next loop step (where thread sleeps for 5 seconds), JMSJCA will put new ComponentInvocation into Thread Local and remember nothing about previous loop step. Moreover, the next loop step will be executed by other thread, which has no idea about ThreadLocal for previous thread…

After that it makes sense to reproduce a problem, fix a bug and try to reproduce again. I did that. Reproduced, fixed and didn’t reproduce. And this proves correctness of my assumption. By the way, we call this proof-of-concept patch.

Is it all?

No. As you remember, we have the strange feature like “memory fall”. We need to explain this, otherwise out theory has a hole. From what I said above said, there are N threads, and their field inheritableThreadsLocals increases gradually. How can we have “memory falls”?

Actually, we need to look at threads more carefully: in particular, as you remember (see above) we looked at Object Explorer view corresponding to our threads. And let’s pay attention to details: D1 contains 9 threads with Retained size more than 16Mb, however, D2 contains 4 threads with Retained size more than 16 Mb. It’s really strange, right? It should be more than 9 in D2 (at least, equals to 9, like D1)…We have leak somehow…

Let’s take a loupe and look at threads and how their sizes change.
More exactly, let’s compare thread names and their Retained sizes (I used export to csv and then parsed by Linux utilities).

We see, there are dead threads, I mean thread that could be met only on D1 or only in D1 and D2.

As there is a pool, so new threads are created, it’s quite normal. The question is: why did they dead (no mention in memory dumps means dead for us)?

The reason is very simple (com.sun.corba.ee.impl.orbutil.threadpool.WorkQueueImpl.java):

synchronized Work requestWork(long waitTime)
                  throws WorkerThreadNotNeededException {    
    …
    long timeOutTime = System.currentTimeMillis() + waitTime;
    wait(waitTime);  = timeOutTime) {<= because dump generation is 
                                           more than 2 minutes
        …
        throw new WorkerThreadNotNeededException();<= Here it will
                                                         be killed
    }
}         

Ok, fine!

Whether this is bug or not

On the one hand, yes, the code with leaking memory – isn’t good, so it’s a bug. On the other hand:

Of course, we can interpret it a little bit different, like threads don’t sleep more than 5 seconds (and it’s true, we see that in the code). On the other hand, in the tutorials/redmes mentioned above you will not find examples with delays more than 5 seconds…

I mean, it’s not obvious, that we used this library properly.

And finally, my fix was installed to production environment and memory stopped to leak.

Advertisements