I’m going to tell you a story about one more emergency in our production environment. More exactly, cpu utilization raised up to 90% in some days, while it was about 20% in usual days.
Of course, the first thing to do is check: whether there is some extra load or not. I mean, if number of requests sent to this server increased, it could explain high cpu utilization. I verified that and observed that server had no extra tasks.
Where do those 70% = 90% – 20% go from? Server had no extra tasks. What was it doing?
In general, it’s a pretty trivial task in test environment, when you do know a way of reproducing. We could just take our profiler and go ahead.
The difficulty in my case consisted of (a) profiler in production environment, perhaps, isn’t the best idea; (b) there is no known way of reproducing. It happens randomly and I have no ideas how to prove it.
Yes, but we could gather thread dumps. So I had by-the-minute thread dumps. 24*60=1400 files with thread dumps, 600 thread stacks per file approximately. That’ll be about 1400*600=840 000 thread stacks per day. Well, at least there will be time to think while parsing is going on. 🙂
Ok, let’s start…