Troubleshooting Enterprise App Performance

Have you ever deployed an application, and find out that its real world performance is less than what you expect? Maybe you haven't got time to do proper load testing, or maybe the production environment have different performance characteristic. Fear not because for some common issues we could still improve performance even when time running out. In this blog post I will try to create conceptual framework for troubleshooting enterprise application performance. Enterprise application in this context is actually synonymous to information system (as Martin Fowler has said in his Patterns in Enterprise Application Architecture). I will use three application that I have engineered as examples. The first, application P, is primarily used for viewing the result of complex business logic calculation, have transactional features but the transactional function usage is 1 : 10 to the view/report function. The second, application C, is primarily used in transactions. The third, application S, is connected to three external systems and three database systems. I will explain what steps I have done to improve the performance of these 3 apps. All three is in production, application P have about 20000 potential users, application C have about 20000 users, application S have only about 50 users. Daily data volume for P is extremely small, for app C is small, for app S is medium.

Performance Hypotheses

There is a rule of thumb when improving performance, namely the pareto rule : that 80% of application time is spent in 20% part of the entire system. That means our first objective is to identify system bottlenecks. Emphasis here, there might be more than one bottlenecks and removing all of them might not be feasible in terms of time or cost.
For simple systems, such as when the app only connects to a single database, we might put one hypothesis that the bottleneck is in the database. And write another hypothesis that the bottleneck is in the application server.
For other systems, it is not that simple, such as with application S which has more than one external system interface and more than one database connected.
So for first step in the framework we enumerate all systems involved in the operation of the aplication. All databases, all services, pools, etc. Each system is one bottleneck candidate.

Step 1. Enumerate all system involved, for each system create a hypothesis that the system is the one causing performance bottleneck

I have 3 application as an example. Let me write the hypothesis for each application.
Application P. Only connected to an Oracle database and LDAP directory. so the hypotheses are :
P1. Bottleneck in the PHP Web server
P2. Bottleneck in the Oracle DB
P3. Bottleneck in the LDAP directory
P4. Bottleneck in interconnection between systems (networking)
Application C. Only connected to a MySQL database that replicated back to back. The hypotheses are :
C1. Bottleneck in the PHP web server
C2. Bottleneck in the MySQL DB
C3. Bottleneck in the LDAP directory
C4. Bottleneck in interconnection between systems (networking)
Application S. Connected to SAP ERP system, Joget workflow service, two MySQL database, one Oracle database.
S1. Bottleneck in PHP web serverS2. Bottleneck in SAP ERP system
S3. Bottleneck in Joget workflow service
S4. Bottleneck in MySQL A
S5. Bottleneck in MySQL B
S6. Bottleneck in Oracle DB
S7. Bottleneck in interconnection between systems (networking)
Each host/system has each performance characteristics that might contribute as application bottleneck. In general for each host, we have :

CPU usage. CPU bottleneck might occured if CPU usage > 70 or 80%
Memory usage. Memory could became bottleneck if swap is > 0 AND swap activity per second is high.
Disk usage -> this is much less related to performance. If free is 0, the host is disabled in some aspect.
Disk IO activity -> this is more likely related to performance. 20 .. 40% IO wait already indicates some kind of disk IO bottleneck.
Network usage -> in some conditions this could impact performance

For database hosts, in addition to these 5 characteristics, we have :

query execution bottlenecks. High disk IO is an indication of such bottleneck.
query parsing bottlenecks. High CPU is an indication of such bottleneck

Each database system might have tools, built in or additional, that help us to detect bottlenecks.

Diagnosis

From each hypothesis we could do test or checks that could strengthen or weaken the hyphotheses. Example, do repeated load testing while watching sql queries from a dba session. If there is a query bottleneck then we would find out from the most often sql text shown in database sessions monitoring. If the web server CPU is high then the bottleneck is more likely in the application script parsing.

Step 2. Enlist more specific performance bottlenecks and how could we test or check for such bottleneck.

Not all hypotheses I could check because limitation of what is being allowed for me to do in each system. Let me enlist some diagnosis steps I have done :

P1. check : do a 'top' in the linux web server to check for CPU bottleneck while load testing the application using apache bench ('ab'). watch for high cpu (%us), memory use, and I/O waits (%sy and %wa). I have to change the application to ignore authentication to make the test easier.

P2. check : monitor database sessions while load testing running repeatedly. The most often SQL query shown is identified as query bottleneck.

For application C, is similar:
C1. check : use 'top' command. Because web and database in the same system, watch for high cpu in the php-cgi process and mysqld process. For other configurations might want to watch for apache processes.

C2. check a: connect using mysql's root user, do a show full processlist while the application is being used by more than 100 users. Actual queries that frequently shown is identified, but unable to acted upon because there is too many query, and no query specific improvement could be done only by examining them.

C2. check b: enumerate pages being used by user, configure the PHP framework to display timing results and sql query executed in each page. Thus the bottleneck identified : the pages execution is not written in optimal way, inefficiently doing a lot for query for simple displays. Part of the page executes a query which fetch a lot of data that is not being used at all. Another part is identified to fetch a lot of data to determine simple fact.

For application P, is a lot more complicated. Because the user is too few, performance indicators will be weak (user will perceive terrible performance but by the time we go hunting for bottlenecks the performance indicator will already return to normal) , so we must simulate a larger load to get better indications. In the past we used testing tools like the Grinder or JMeter to simulate large load. But I must confess that I simply refreshed my Firefox page repeatedly to simulate load during application P's troubleshooting.

S1. check using top command in the web server. With apc activated in the PHP web server, CPU usage is low enough, bottleneck hypothesis is weakened.

S2. In the SAP server, OS-level measurements are simply off limits for me or not informative enough. So I did a timing measurement each time SAP is called. This is done by using PHP profiling functions to mark start and end each SAP RFC execution, the profiling functions being used is provided by the PHP framework we used (Yii::beginProfile and Yii::endProfile). Activating profile component in the PHP application's configuration will show the profiling result in the lower part of application page.

S3. Joget workflow is a tomcat-based Java application. Doing a 'top' in the joget server and joget's database server shows a lot of CPU activity by MySQL and also Java process. Thus the hypothesis is strenghtened. Further bottleneck identification is done by using jvisualvm. It was a hassle to setup (it has a bug that caused me unable to monitor remote apps), in short I used the thread dump functionality repeatedly to identify what part of the joget workflow service that became bottleneck.

For S4, and S6, no checks done but conceptually checks that I did in P2 and C2 could be done.

For S5, I did a show full processlist repeatedly, and find out some queries indeed became a query bottleneck by appearing often in the list.

For P4, C4, and S7 I didn't do any checks yet. Conceptually we could use iperf (wikiref, jperf) to test for end to end network bandwidth and packet loss. We could also plot network traffic using Munin (ref) in each host to determine whether traffic saturation occured or not. Or better yet, network traffic graph in network switches involved could help us strengthen or weaken these network related hypothesis.

These actions I summarize as :

Step 3. Check each bottleneck hypothesis. If the check confirms (strengthens) the hypothesis, breakdown the bottleneck into a smaller hypotheses relating smaller part of the system. Check the smaller hypothesis, if it is found to be true, breakdown it further into smaller parts if possible.

Step 3-A. The check&breakdown step should be applied recursively (repeatedly, for you non programmers) if possible.

Step 3-B. For database bottleneck, could be breakdown into :

SQL query (or queries). Frequently shown SQL queries could became a clue that such SQL became a bottleneck.
SQL Parse or SQL Execution. SQL parse problem could be fixed by changing the application to use query binding.
Database capability problem. If there is a problem with almost all SQL queries, then database capability is identified as bottleneck. This is further breakdown into: A) Database CPU bottleneck B) Database IO bottleneck C) Other Database Problems

Step 3-C. For bottlenecks in application script, identify which part of entire application is bigger bottleneck by using sampling or profiling techniques (choose one from the two different approach). Execute a repeated load testing in parallel with execution sampling/profiling. If using manual profiling, examine timing results and breakdown the larger part further.

Sampling techniques is essentially finding out what part of application logic is being run at one time. The most often part that shows up in the stack trace is identified as a bottleneck.

Profiling is essentially measure time taken to do some part of application logic. If we found, for example, one page is slow, then we apply profiling functions (change the application) to find out the time related accesses to external systems, and also time taken for parts of our application logic. If the result is, from application logic part A,B,and C, the time taken by part C is 90%, then we should profile parts of application logic C, breaking it into C1, C2, C3. Repeatedly checking the results and apply further profiling on the largest part will identify the smallest part of the application responsible for larger part of execution time.

For application using C/C++ programming language, execution sampling could be done by using GDB (gnu debugger). The application should be compiled with debug information enabled. Do 3x - 5x consecutive thread dumps with 2 seconds between each dump using thread apply all backtracecommand. In single-threaded multiprocess configuration we are to execute backtrace command in each and every processes.

For application using Java programming language, execution sampling could be done using thread dump functionality in jvisualvm tool. Java automated profiling tools also could be used, but because of the performance degradation associated with automated profiling techniques I seldom use any of them.

For application using PHP programming language, one of my previous blog post could be implemented to do stack dump for PHP pages that are running. Profiling could be done by modifying application to call profiling functions. These will degrade performance but selective implementation will ensure negligible performance impact.

The reader is encouraged to find out what kind of method works for other languages.

Performance Improvement Techniques

After specific bottlenecks is found out, we need to try improve condition by various means. Because of the many factors involved, not all bottlenecks could be avoided.

Step 4. For each positively identified bottleneck, try to improve performance by some means

For application P, I improved the application by adding index for tables that being used, by examining what column is used in the bottleneck queries. Note that adding unnecessary index will slow down data manipulation activity (insert, delete, updates).

For application C, improvement is done by rewriting PHP logic in the slow pages found out during testing. Because we use Yii's ActiveRecord classes, embedding a few with() function calls to some find() in the controller will result in automatic join to reduce query count done to the database. In some other find() we add more condition criteria so the rows retrieved is only the one we need.

In other part of the application, a flag column is add into one of tables. This flag is a substitute to querying another table. Called performance-improving denormalization, we add another place to store some data that normally obtained by aggregating data in another table, change each and every activity that modify the source data table to automatically update the flag, and replace the aggregation query to source table with simple query to the table containing the flag. The logic is that we trade the reduced time required for aggregation query with increased transaction time relating to the source table.

For application S, it is much more complex. Examining S2 profiling results, it is found out that excessive SAP RFC calls is being done for data that slowly changing. The step taken is to add a caching layer for such RFC calls. The Yii's cache component is found to be a good solution for this problem.

S3 bottleneck is shown to be existed, and from the thread dumps it is shown that SLA calculation is done by Joget while we only need general workflow activity information. The solution we take is by modifying the Joget workflow service so there is optional parameter called nosla that will prevent SLA (service level agreement) calculation. The PHP application is also modified to include parameter nosla=1 during specific calls to Joget workflow.

While S4 bottleneck hypothesis is found to be false, S5 hypothesis is shown to be strong.

Analyzing query bottleneck from S5 checks, it seems that MySQL 5.1 have troubles optimizing IN query which actually have only one element in the 'IN clause'. The solution is replacing the MySQL 5.1 with MariaDB 5.3.

These explanations is summarised as follows:

Step 4-A. Consider upgrading your database server if the bottleneck is shown to be fixed in the newer version

Step 4-B. Consider using caches. Data caches can be used to reduce calls to external system that didn't manipulate data. PHP Opcode caches speed up PHP parsing.

Step 4-C. Consider changing your data schema, denormalize data if necessary.

Step 4-D. Consider changing application logic implementation to access less data. Less rows is more impact than less columns. Avoid unnecessary query to BLOB columns when we only need data from other columns.

Step 4-E. Consider creating database indexes to enhance select query performance

For completeness, I include other ideas that I have done when optimizing another application not discussed in this post :

Step 4-F. Consider implement part of application logic in other faster language, such as Java or PL-SQL, when CPU bottleneck in web app is an issue. But try to improve the algorithm first before porting logic to other language.

Step 4-G. Consider increase memory pool configuration in the database server if buffer hit ratio is small (example, less than 95%).

Step 4-H. Consider increase parallel processes in the web server if there is a large amount of concurrent user and memory use is below 70%. This should only be done if we are certain that no database bottleneck exists and no CPU bottleneck in web server. Interpolate memory usage correctly (see maximum memory used for each php-cgi or httpd process, multiply by maximum process expected). Remember that OS will automatically use free memory as disk cache, and too small free memory will degrade IO performance.

And also some ideas that haven't been tested:

Step 4-I. Consider add more disk, enable striping with large blocks if there is I/O bottleneck. Consider add more CPU if the CPU bottleneck could not be reduced in other ways.

Summary

In this blog post I wrote steps that I have done to identify performance bottlenecks, and what steps I take to improve performance. I also tried to write a troubleshooting framework as a 4 step process:

Step 1. Enumerate all system involved, for each system create a hypothesis that the system is the one causing performance bottleneck

Step 2. Enlist more specific performance bottlenecks and how could we test or check for such bottleneck.

Step 3. Check each bottleneck hypothesis. If the check confirms (strengthens) the hypothesis, breakdown the bottleneck into a smaller hypotheses relating smaller part of the system. Check the smaller hypothesis, if it is found to be true, breakdown it further into smaller parts if possible.

Step 4. For each positively identified bottleneck, try to improve performance by some means

The practical implementation of each step is explained for each of three example application that chosen for this blog step. I hope these could serve as a guidelines for others (and myself) when the need to troubleshoot application performance arises.

Inventor's Paradox