vendredi, janvier 22, 2010

Fixing performance issues in your application

I was lucky enough to assist to a presentation done by Kirk Pepperdine last Wednesday. I won't present Kirk (you can check his résumé here ), it's enough to say that he is well known as a java performance Guru.

The presentation was conducted in two parts, the first one was a Q & A session, the second part was about debugging live an application that was carefully slowed down by introducing bugs in it.

Preamble
Let me talk about the application first : the guys who invited Kirk (AFAIK, it was an 'extra', provided as a bonus following an internal presentation they paid for. Thank to Xebia for having shared this presentation with external people) have prepared the application (the well known and useless Pet Clinic) by adding some of the anti-pattern they have met when doing consulting for many of their clients. Kirk had no clue about the bugs that have been injected.


Q&A
It was asked us to provide some questions when we registered, and Kirk answered them extensively. Here are some of the Q and A I remember of :

Q : Which GC should we use ?
A : The one which works. Usually, just focus on your application, you'll not need to pick a specific GC .

Q : What do you think about other languages like Groovy, Scala, wrt performance ?
A : It's irrelevant. Picking a language to develop your application should not be a matter of performance only. 'Whatever works' is the way to go. If you want to build an application fast, and if it's not expected to be heavily loaded, then even php is a good choice.

Q : What tools to you use to check for performance bottlenecks
A : A few : a system monitor, HP-JMeter, an VisualVM

Q : How can you best write an application which depends heavily on concurrent code ?
A : Don't use any synchronization. There are ways to avoid synchronization, based on state machine theory. (pointers needed here ...)

Q : What is the ratio of GC problems you have to deal with when working for a client ?
A : Around 40%. Assuming that I'm the last hope for many of my clients, it's may be an irrelevant number. Usually, people successfully fix easier issues themselves.

Q : Do you check the code when you start tracking some performance issue ?
A : Never. I'm not a coder, I don't have time to go through thousands of line of code. I just spot the place in the code which has problem.

Q : Managers don't let me adding some traces in the application on production… What should I tell them ?
A : Managers know the difference between a slow application and a dead application. Do what you have to do, or find another client. (in other words : you don't cure cancer with aspirin...)

Q : Which profiler do you use, or prefer ?
A : YourKit: it's simple and efficient.

The most interesting presentation I have seen in years. I actually learn things in an area I thought I was efficient…

Live demo

What was the crux about this part was the processes Kirk adopted to point out the problems in the code.

Step 1

First, he asked for a baseline to work on. Namely, you should have a scenario which demonstrates the kind of real performances issues a real client perceives. Improving some application which is already perceived as working is a waste of time, energy and money. Without a base line, you also have no way to check that you have improved the application. Last, not least, define your expectations, otherwise, you won't meet them ! So here, the team has defined a JMeter test, and defined the expected response time for each page.

Step 2

Second, run the baseline scenario, and measure the response time, plus a few other counters :
- CPU (users and system)

That's it, nothing more. Here, the code has not been even checked. The only thing Kirk did was to remove all the tuning for the JVM, like the memory min and max size, and every other premature configurations.

The rational is that you have no idea at this point if those parameters have any effect, but they for sure have an impact, probably polluting the results.

Looking at the CPU consumption and response time (90%CPU, around 5% system), with an average of 10s per page, it was clear the application has a performance issue, but there was no clue about what's going on yet.

Step 3

Then he checked the way the GC was running. He added some instruction on the JVM setting to generate some GC traces, run the application for a few minutes, then checked the logs ("You have to be patient ! Memory leaks may take a while to be noticed.")

A quick look at the metrics shown that the GC was eating 13% of the whole CPU. Way too much.

Step 4

Kirk now decided to connect to the running application, using VisualVM. The idea was to check the way objects were allocated. After a few minutes of tests, the allocated objects graph shown that we have a linear increase over time, which means a memory leak.

Finding the memory leak was a matter of minutes : find an application object (no need to check a Java object like byte[] or String : "Java collection objects don't leak…"). What is the key for Kirk is the number of generations an object survived : the higher this number, the more likely this object is leaking. Very new to me.

As a side note, he also said that many of the existing tools don't provide this generation number. They base the detection of leaking object on delta between snapshots. Not convenient.

Then you can check where the object was allocated checking the stack trace, an now, look at the code.

At this point, the important lesson is : just look at the code when you know in which method you have a problem.

(the application had another memory leak he found too, using the very same approach)

Another lesson : he asked to remove the caches in the code, instead of blind-guessing what was wrong with those caches (they were leaking). His moto was : "Why would you optimize your code by adding cache when you have no idea about what's going wrong in your code ?"

Step 5

Once this initial problem was fixed, he re-runs the test, and he saw that the CPU was not going any upper than 50%. Very wrong when the response time was still awful. In this case, the System CPU was high (the ration between user and system should be around 5-10%/90-95%).

What does it mean ? Contention. How to find where we have contention ? Easy : generating a thread-dump.

No fancy profiler, no long source reading, just a thread-dump.

It immediately shown that only two threads were used to deal with 50 concurrent clients requesting the application.

A quick tuning on Tomcat (number of threads accepting requests), and we moved to the next step.

Step 6

One last measure shown now that we had much better performances, but with a very high CPU system usage : around 20%.

Same action here : thread dump, look at the blocking threads, go to the portion of code where the thread was waiting. A bad thread.sleep( 100 ) was found in the code.

And it was over for the demonstration : 2 hours to fix bugs that would have took days and days for most of us!

Conclusion

In two hours, he made the application running way faster, simply by using a couple of tools, and without reading the code.

Impressive.

Thanks to Kirk Pepperdine, Cyrille Le Clerc and Xebia !

Follow up
I have forgotten a few things :
  • at some point after stet 4, GC went up to 65%. Kirk suspected that some part of the code was calling the GC. You bet !
  • after the presentation, Kirk said that the very first step is really to catch all the GC problems first, as they will probably hide other problems.

3 commentaires:

Unknown a dit…

Thanks for sharing this! Very interesting - good to learn from an individual with substantial experience debugging application performance problems.

Ashish a dit…

Very interesting.. would be great if you add more stuff related to the State Machine.. and code without synchronization

Cyrille Le Clerc a dit…

Thank you Emmanuel,

I have been really impressed by Kirk's methodology to troubleshoot performance issues. He actually didn't know the issues we added in the application.

Being the 'moderator', I was a bit stressed by the duration of the live troubleshooting session but Kirk found the issues one after the other giving clear explanations of his methodology and making jokes.
The audience has been very pleased even if we finished after 11 pm.

We will try our best to write a detailed blog post (in french) about
this live troubleshooting session.

Thank you again,

Cyrille (Xebia)