Assume you have a complex 10000s of lines of Java code with all kinds of complexities involving multi-threading, JDBC connection pools, Stateless session beans, heavy duty XML processing, http request response processing, huge (in the order of 10s of Megabytes) Hashmaps being accessed, a fork-join model of threading and several 3rd party interfaces.
Now assume that your customer is asking for a response time of each transaction in the order of 10-100 msecs depending on the size of the response returned (XML) and a TPS of 250 on a 8 core x86 processor blade on Linux with say large memory of 64G.
How do we achieve this in Java? What are the things that needs to be kept in mind while doing the performance tuning of this application. Thats what Iam going to talk about below.
Remember that the application Iam talking about is not like a typical e-commerce application with human interfaces and response time in seconds. Iam talking about a real low-latency expectation in milliseconds.
There is no easy way. A typical common sense approach would be to do profiling of your code with any of those available in the market and identify bottlenecks where there are mega loops that are consuming time etc.
While this will help in the first pass, the biggest challenge you will face with Java would be there are functions like Garbage collection which will play a big role in deciding the performance of your app. Java unlike C or C++ allows programmers to allocate objects with the 'new' operator and then uses its GC mechanism to keep freeing up the allocated objects without the programmer needing to free them explicitly.
All allocated objects are maintained in a heap space and GC operates on that. There are broadly two categories of GC.
Single generation means one contiguous heap space and generational implies many spaces for different purposes. For example, a 2 generational heap could have one space to have temporary objects that are short-lived (called the nursery space) and then one for the long lifecycle objects.
GC in this case will only be operating on the nursery space until, it becomes difficult to free up anything there (means all short-lived young generation are living for a longer time) and it has to take an action in terms of moving those objects into the tenured space.
Over a period of time, if the tenured space becomes full, it may have to do a FULL GC cycle running across the entire (larger old generation heap) resulting in a long time of a pause when the application is running.
Apart from the way spaces are maintained, there is also the case for whether the GC will run in a mode pausing the application during the GC run or run along with the application in a sort of concurrent mode not completely pausing the application. This will have a effect when your application response time is expected to be really low. I mean choosing a GC option which stops your application is not a good thing especially if you are the low latency kind.
Another aspect of the current GC algorithms are the mark and sweep function. The way it works is for all objects in the heap which are referenced by live threads (from their stacks) or from statically allocated objects, the GC will leave them and Mark all unreferenced objects in one phase. In another phase, it will sweep them free. Basically it is a two step process, mainly to aid shorter pause times on the application.
While all this sounds cool, it comes loaded with troubles if your program is transaction oriented and memory intensive (like some heavy XML processing etc.) and also has very scattered pattern of memory allocation and freeing. If in such a situation, response time is a big thing to satisfy (low latencies, more real time responses), and needing to handle a larger TPS on a single box (if you had had to do the sizing of the HW during your bid and later worry about how to meet it , which is typically the case :-) then, I give below some broad outlines which might help you tune your app. Again I emphasize broad, because it cannot be taken as something that will straight fit in your situation. You may have to take the clues here and typically study how your system behaves adjusting the several parameters. Iam only listing down some of the key things you may want to look at while on the task.
1) Know how frequent your GC cycles and therefore the pause times on the application are
You can try with both a single generational and multi-generational heap option. Typically for low latency applications single generation with concurrent runs along with the application works better
2) Check how quickly your entire Heap space (which is determined by the -Xmx and -Xms options) is eaten up on heavy load and results in a GC run to reclaim heap. This can be done by turning on GC verbose mode and observing how frequently GC is running, how long it takes for each run, the heap before GC ran and after it ran. All of this information helps to know if your application is consuming heavy memory per transaction leading to more GC runs which in turn causes higher latencies.
3) Do longer runs with higher TPS (until CPU maxes out) to know if there are any issues of heap fragmenting over a long period and resulting in unusable heap and eventually the application becomes unresponsive. This is really bad and should be avoided at all cost. There are options of JVM that helps you to do compaction more frequently if this is the case to prevent this from happening
4) If your CPU (in top command in Linux) is mis-behaving - this means it may be erratic doing say 50% and then showing 90% and then coming back to 60% and then doing a 90% , sort of oscillating widely, then it could be caused by your GC runs. This will have to be controlled as this will not guarantee you a consistent result or a trend.
5) As you move on the TPS from low to high, you should see a proportional consumption of CPU and a stable heap and consistent periodic GC runs. This is the point when you have controlled the monster!
6) There could be several things you can do in your application like trying not to allocate huge objects holding on for the entire duration of your transaction lifetime. This will typically not help when you want to do high TPS , low latency as this will accumulate lots of objects in the heap for a long time causing frequent GC runs.
7) See if you can do a pooling concept where you do one time allocation and reusing the objects.
8) There are also several situations where you want to monitor your threads via a periodic thread dump. This will reveal quite a lot of things especially under load. There could be threads mainly blocked spinning on locks (fat locks) which can be examined to see if your code has any bottlenecks like there are two many threads contending on a synchronized block or a HashTable or a Vector. If the synchronization is not intended, then it is safe to remove them and allow the parallelism. There could also be contention on memory if your allocation is eating away all the space with long life time objects. You can use memory leak detector tools to check if there are contentions on locks during a GC cycle when a java code is trying to allocate. This could be indirectly induced by modules on the path of the transaction processing, but may surface elsewhere completely in a different module when it is trying to allocate. There could also be a 3rd party s/w misbehaving on contention resulting in poor performance.
9) Try to keep your code under threads to perform short quick actions in a multi-threading scenario. Remember that anything that is in a transaction path and runs for a long time will cause bottleneck under load causing you to not scale.
10) There are several Java language specific good practices that you can follow while coding. One good pointer is ftp://ftp.glenmccl.com/pub/free/jperf.pdf
Now assume that your customer is asking for a response time of each transaction in the order of 10-100 msecs depending on the size of the response returned (XML) and a TPS of 250 on a 8 core x86 processor blade on Linux with say large memory of 64G.
How do we achieve this in Java? What are the things that needs to be kept in mind while doing the performance tuning of this application. Thats what Iam going to talk about below.
Remember that the application Iam talking about is not like a typical e-commerce application with human interfaces and response time in seconds. Iam talking about a real low-latency expectation in milliseconds.
There is no easy way. A typical common sense approach would be to do profiling of your code with any of those available in the market and identify bottlenecks where there are mega loops that are consuming time etc.
While this will help in the first pass, the biggest challenge you will face with Java would be there are functions like Garbage collection which will play a big role in deciding the performance of your app. Java unlike C or C++ allows programmers to allocate objects with the 'new' operator and then uses its GC mechanism to keep freeing up the allocated objects without the programmer needing to free them explicitly.
All allocated objects are maintained in a heap space and GC operates on that. There are broadly two categories of GC.
- Single generation
- multi-generation (two generation typically).
Single generation means one contiguous heap space and generational implies many spaces for different purposes. For example, a 2 generational heap could have one space to have temporary objects that are short-lived (called the nursery space) and then one for the long lifecycle objects.
GC in this case will only be operating on the nursery space until, it becomes difficult to free up anything there (means all short-lived young generation are living for a longer time) and it has to take an action in terms of moving those objects into the tenured space.
Over a period of time, if the tenured space becomes full, it may have to do a FULL GC cycle running across the entire (larger old generation heap) resulting in a long time of a pause when the application is running.
Apart from the way spaces are maintained, there is also the case for whether the GC will run in a mode pausing the application during the GC run or run along with the application in a sort of concurrent mode not completely pausing the application. This will have a effect when your application response time is expected to be really low. I mean choosing a GC option which stops your application is not a good thing especially if you are the low latency kind.
Another aspect of the current GC algorithms are the mark and sweep function. The way it works is for all objects in the heap which are referenced by live threads (from their stacks) or from statically allocated objects, the GC will leave them and Mark all unreferenced objects in one phase. In another phase, it will sweep them free. Basically it is a two step process, mainly to aid shorter pause times on the application.
While all this sounds cool, it comes loaded with troubles if your program is transaction oriented and memory intensive (like some heavy XML processing etc.) and also has very scattered pattern of memory allocation and freeing. If in such a situation, response time is a big thing to satisfy (low latencies, more real time responses), and needing to handle a larger TPS on a single box (if you had had to do the sizing of the HW during your bid and later worry about how to meet it , which is typically the case :-) then, I give below some broad outlines which might help you tune your app. Again I emphasize broad, because it cannot be taken as something that will straight fit in your situation. You may have to take the clues here and typically study how your system behaves adjusting the several parameters. Iam only listing down some of the key things you may want to look at while on the task.
1) Know how frequent your GC cycles and therefore the pause times on the application are
You can try with both a single generational and multi-generational heap option. Typically for low latency applications single generation with concurrent runs along with the application works better
2) Check how quickly your entire Heap space (which is determined by the -Xmx and -Xms options) is eaten up on heavy load and results in a GC run to reclaim heap. This can be done by turning on GC verbose mode and observing how frequently GC is running, how long it takes for each run, the heap before GC ran and after it ran. All of this information helps to know if your application is consuming heavy memory per transaction leading to more GC runs which in turn causes higher latencies.
3) Do longer runs with higher TPS (until CPU maxes out) to know if there are any issues of heap fragmenting over a long period and resulting in unusable heap and eventually the application becomes unresponsive. This is really bad and should be avoided at all cost. There are options of JVM that helps you to do compaction more frequently if this is the case to prevent this from happening
4) If your CPU (in top command in Linux) is mis-behaving - this means it may be erratic doing say 50% and then showing 90% and then coming back to 60% and then doing a 90% , sort of oscillating widely, then it could be caused by your GC runs. This will have to be controlled as this will not guarantee you a consistent result or a trend.
5) As you move on the TPS from low to high, you should see a proportional consumption of CPU and a stable heap and consistent periodic GC runs. This is the point when you have controlled the monster!
6) There could be several things you can do in your application like trying not to allocate huge objects holding on for the entire duration of your transaction lifetime. This will typically not help when you want to do high TPS , low latency as this will accumulate lots of objects in the heap for a long time causing frequent GC runs.
7) See if you can do a pooling concept where you do one time allocation and reusing the objects.
8) There are also several situations where you want to monitor your threads via a periodic thread dump. This will reveal quite a lot of things especially under load. There could be threads mainly blocked spinning on locks (fat locks) which can be examined to see if your code has any bottlenecks like there are two many threads contending on a synchronized block or a HashTable or a Vector. If the synchronization is not intended, then it is safe to remove them and allow the parallelism. There could also be contention on memory if your allocation is eating away all the space with long life time objects. You can use memory leak detector tools to check if there are contentions on locks during a GC cycle when a java code is trying to allocate. This could be indirectly induced by modules on the path of the transaction processing, but may surface elsewhere completely in a different module when it is trying to allocate. There could also be a 3rd party s/w misbehaving on contention resulting in poor performance.
9) Try to keep your code under threads to perform short quick actions in a multi-threading scenario. Remember that anything that is in a transaction path and runs for a long time will cause bottleneck under load causing you to not scale.
10) There are several Java language specific good practices that you can follow while coding. One good pointer is ftp://ftp.glenmccl.com/pub/free/jperf.pdf