Why does high cpu utilization happen sometimes?

by perfstories

I’m going to tell you a story about one more emergency in our production environment. More exactly, cpu utilization raised up to 90% in some days, while it was about 20% in usual days.

Of course, the first thing to do is check: whether there is some extra load or not. I mean, if number of requests sent to this server increased, it could explain high cpu utilization. I verified that and observed that server had no extra tasks.

Where do those 70% = 90% – 20% go from? Server had no extra tasks. What was it doing?

In general, it’s a pretty trivial task in test environment, when you do know a way of reproducing. We could just take our profiler and go ahead.

The difficulty in my case consisted of (a) profiler in production environment, perhaps, isn’t the best idea; (b) there is no known way of reproducing. It happens randomly and I have no ideas how to prove it.

Yes, but we could gather thread dumps. So I had by-the-minute thread dumps. 24*60=1400 files with thread dumps, 600 thread stacks per file approximately. That’ll be about 1400*600=840 000 thread stacks per day. Well, at least there will be time to think while parsing is going on. 🙂

Ok, let’s start…

First, I assumeded, there is a thread, spinning in an endless cycle. After that, a number of such threads is growing up (as server had about 8 logical processors) and cpu utilization saturated. So there should be some thread/s, which has RUNNABLE state and hangs on the same stack for long time. The assumption wasn’t correct.

Then, I’ve tried to evaluate on the average, what thread stacks are more frequently used by all threads for a day. And that approach gave me an interesting result. The difference between days with high cpu utilization and “normal” days (in terms of thread stacks) was in a single thread stack. I’ve never seen this tread stack in “normal” days, while it was pretty frequently used in emergency days. One more important thing is: as cpu utilization increased, frequency of usage of this stack increased too. Note, as I said, there was no hanging thread on this thread stack. A lot of threads were caught on this trace, may be, once per day. That’s it:

at: java.util.concurrent.ConcurrentLinkedQueue.remove(ConcurrentLinkedQueue.java:346)
at: com.sun.enterprise.web.connector.grizzly.SelectorThread.returnTask(SelectorThread.java:1611)
at: com.sun.enterprise.web.connector.grizzly.DefaultReadTask.detachProcessor(DefaultReadTask.java:192)
at: com.sun.enterprise.web.connector.grizzly.ssl.SSLReadTask.detachProcessor(SSLReadTask.java:420)
at: com.sun.enterprise.web.connector.grizzly.DefaultReadTask.executeProcessorTask(DefaultReadTask.java:346)
at: com.sun.enterprise.web.connector.grizzly.ssl.SSLReadTask.process(SSLReadTask.java:440)
at: com.sun.enterprise.web.connector.grizzly.ssl.SSLReadTask.doTask(SSLReadTask.java:228)
at: com.sun.enterprise.web.connector.grizzly.TaskBase.run(TaskBase.java:265)
at: com.sun.enterprise.web.connector.grizzly.ssl.SSLWorkerThread.run(SSLWorkerThread.java:106)

Looking at Glassfish code, understood, that this collection is used for monitoring that we have never used:

if (isMonitoringEnabled()) {
     this.activeProcessorTasks.offer(processorTask);
}

So, hopefully the fix is ready 🙂

But the question is still open: where does high cpu utilization come from? 😉

There is no endless loop, as it was clarified. So, perhaps, a collection (ConcurrentLinkedQueue) becomes bigger and bigger. Why? Do we put something in this queue and forget to remove? The usage pattern is:

  1. Queue.offer(); <= add to queue
  2. Queue.remove(); <=remove from queue

Could it be due to exceptions in (3) that we just forget about remove? Probably not, according to the code, there are pretty accurate try-catch blocks. So I see no hooks there.

I found one more interesting fact about days with high cpu utilization. In usual days, a number of threads, handling http requests slow down to zero sometimes (night hours), i.e. all threads was returning to the pool to wait there for new interesting tasks. In days with high cpu utilization there were always threads (from 1 to 6) hanging on java.net.SocketInputStream.socketRead0 tightly (no timeout, as I found out later, along with connectivity issues).

Well, hanging threads. This means, that they hang in (3), see above, and don’t remove corresponding objects from collection for a long time. Ok, don’t remove. But we have in general 600 objects in that collection in regular days. How 600 +6 objects could slow down remove operation in ConcurrentLinkedQueue collection and increase cpu so much (in comparing with 600 objects, I mean)? It’s simpy absurd!

As there were no ideas anymore….

I dig into code of this ConcurrentLinkedQueue. And first wrote a very simple test, adding logging of loops steps in remove operation of ConcurrentLinkedQueue (so I could see a loops number inside ConcurrentLinkedQueue).

In this test: we add 3 elememens, remove first and second, than add some elements (4-9999), remove these elements (4-9999). So now we have 1 element in the queue (3rd). So let’s add one more elements (10000th) and remove it. So the question is: how many loops will perform ConcurrentLinkedQueue for this remove?

public class Test {
    public static void main(String[] args) {
        ConcurrentLinkedQueue queue = new ConcurrentLinkedQueue();
        queue.offer(1);
        queue.offer(2);
        queue.offer(3);
        queue.remove(1);
        queue.remove(2);      
               
        for (int i = 4; i < 10000; i ++) {
            queue.offer(i);
        }
        for (int i = 4; i < 10000; i ++) {
            queue.remove(i);
        }              
               
        queue.offer(10000);           
        queue.remove(10000);          
    }
}         

Your stakes? Mine – 3 as a maximum, as we have two elements in the queue and plus head of this queue (some element).

No. 9998!

Let me clarify that. There is some wrapper (Node) for each of an inserted object. And actually, queue contains these wrappers. So remove operation doesn’t remove wrapper, but set it’s content to null. So at the end of our test we have something like:

So, if we have so called null-Nodes in the head, there is an optimization to merge them, but if there is null node in the end of this queue there is no optimization. That’s our case. (If there was no 3rd element, 4-9999 elements were removed).

That’s all on jdk1.6 u14.

Jdk1.6 u20. Much better… There are 3 loops on my test. However, multithreaded test shows problems anyway.

Ok. That’s still just a theory. Let’s verify that!

Now I do understand how reproduce it. I’ll run test, then hang one of threads (when it will be in (3), see above, i.e. just before removing of corresponding item from the queue) and will keep test running. According to my theory, cpu utilization should increase. And then I could set this flag by jConsole (SelectorThread.isMonitoringEnabled, see above, I’m talking about Glassfish code) to decrease cpu utilization.

Was it reproduced? Yes!

And even more, in several days we got the same situation in production and this flag helped us to resolve it…

Advertisements