How to Diagnose A Slow or Freezing Web Application

The freezing app problem

When dealing with a freezing application, most management (or even management with IT background) will say it is related to capacity problem. Usually the capacity are correlated to how many processor core, how many gigabytes memory available to the system, and how much bandwidth are available. The reality is not that simple.

When an application freeze or slows down, it is to the best interest of the application team, whether operational, development, devops team, or service team (as it is said in the ITIL), to find out how to make it accessible again, and in management terms, increase the system's capacity.

The first step to do that, is to actually determine the bottleneck, which traditionally is done by checking memory usage, CPU usage, and network link usage. But I am not thinking to make this a traditional post, so in this post I will try to describe other stuffs that could cause slowdown or freezing app problem.

The APM way of things

Today we have APM (Application Performance Management) tools, which should be able to pinpoint most of cases causing application slowdown, which is :

1) database slowdown caused by certain non-optimal queries

2) external service timeout or slowdown

3) inefficient loops in the application flow

So, in order to be able to diagnose these 3 kind of causes quickly, install your trusted commercial APM agent. Or not, because sometimes budget must be preserved at the expense of man hours & customer experience :D

Another alternative would be to use open source APM (such as elastic APM), but it will take a week or two to install in contrast to one day for commercial APM. Or, if you use something like PHP FPM, there is a slow log than can be configured to dump stack trace every 10 seconds or so.

In this post I would like to focus on some common things that could be causing freeze or slowdown, which might be undetectable by APMs (or.. otherwise..)

Cause No 1 : Missing indexes or Freezing database
To diagnose this, usually I would connect to the database (and if unable to, we found out the first cause), then check running SQL sessions for inefficient queries. A simple show processlist in mysql should be adequate for lightly loaded apps. In oracle I would query V$SESSION or GV$SESSION.

Problem sessions are sessions in Query state with large (running) time value. For such problem session, get the full-SQL using SHOW FULL PROCESSLIST, and do a EXPLAIN for the query.

Cause No 2 : External Service Availability problem
I previously wrote this as cause no 3, sometimes your application need to call external API service, and DNS resolution works but the API service is not accepting connections (or the network path to the service is too congested, or something else). How to detect this ? Simply run netstat -an | grep SYN and for open ports that still in SYN state between 3-4 second periods is the culprit. Sometimes a quick-and-dirty remedy is to disable the code calling some external services when the load is too high. If it is unacceptable, then the external service must be somehow be returned to available.

Cause no 4. Parallel process limits too small
When using PHP-FastCGI or PHP module, there are limits being configured in certain configuration files. Such as, process.max in /etc/php/7.4/fpm/pool.d/www.conf or /etc/php-fpm.d/www.conf. And MaxClients, ThreadsPerChild (see https://serverfault.com/questions/775855/how-to-configure-apache-workers-for-maximum-concurrency) How to diagnose ? Enable server-status for apache (see https://mediatemple.net/community/products/dv/204404734/how-do-i-enable-the-server-status-page-on-my-server) and php fpm status (see https://www.tecmint.com/enable-monitor-php-fpm-status-in-nginx/), then view the url for apache status and php fpm status periodically. In container-based environment, the quick-and-dirty diagnosis is by increasing your pods, you will get more concurrent users. Especially in openshift environment, you could increase each pods capacity by adding more RAM for the pods.

But be aware, that increasing this limits will increase memory burden for the application VM, and there is some other limit for apache that somehow limits <1000 connections per apache instance no matter what you set. This limit can be increased by using a nginx or haproxy in front of the apache server, or some people said using events MPM would help (I haven't tried it yet)

Cause no 5. Intermittent packet loss between application (VM or host) and other critical services (especially database).

This is some of the virtualization pitfalls, and still became a mystery why does migrating a VM to another host (and back) could sometime resolve such problem. How to detect this ? Look for :

a) connections visible in one sending host but not in the receiving host. Due to how TCP/IP works, even though the application is freezing then a connection from IP A to B should be visible in netstat on VM A and also on VM B. So if this symptom shows up, the problem must be either in network or virtualization tech (because VMs using virtual routers)

b) ping from IP A to B in the same network results in packet loss > 1 %. Well, it should be < 0,1% in my opinion. This shows network problem happening (or network virtualization problem).

Remediation options : VMotion (or migrate) the VM to other host, checking the warnings (in VMWare console, for example) regarding to NIC card in the host that might need replacement. Soft problems occurs more often than hard problem (meaning hardware replacement is rare).

Cause no 6. Out of sockets in the Web/Load balancer

When multiple application server is being hit by many many users, sometimes the connection capacity becames a problem. To diagnose this :

netstat -an | grep WAIT | wc -l

netstat -an | grep ESTA | wc -l

Results in the order of >10 thousand is sometimes a problem because a spike in the incoming connection requests could have some request unserviceable. Another symptom is that requests take a long time to complete, with no CPU or database activity visible.

The normal setting for linux ephemeral port range is 32768-61000 resulting in about 30 thousand maximum connections. The remedy is to tune for high-performance connections :

net.ipv4.ip_local_port_range="1024 65535" net.ipv4.tcp_tw_reuse=1

And while you were at it, also tune for these parameters :

net.netfilter.nf_conntrack_max=1048576

net.ipv4.conf.all.arp_announce=2

net.ipv4.neigh.default.gc_thresh1=8192

net.ipv4.neigh.default.gc_thresh2=32768

net.ipv4.neigh.default.gc_thresh3=65536

net.ipv6.neigh.default.gc_thresh1=8192

net.ipv6.neigh.default.gc_thresh2=32768

net.ipv6.neigh.default.gc_thresh3=65536

vm.max_map_count=262144

This is especially important for the load balancer host / VM.

Cause no 7. DNS Resolution problem

This problem have better chance to manifests itself when the operating system is configured to use some unusual DNS service (because if the primary DNS is down, normally the network team will detect it fairly quickly). How to detect this? I sometimes do a strace on php-cgi process or apache process, and alternatively we could run tcpdump on port 53 and seeing DNS requests that are not being replied to. An option for remedy is to reroute DNS request to servers capable handling them (such as, google's DNS..)

Cause no 8. Too much session data

This problem occurs when too much data being stored in session storage, which is stored external to the app server (memcached or redis or even in the db). The symptom is heavy network traffic to the session storage server even when the concurrent user is under 100. To detect this run iftop or something similar in the session storage server, and run tcpdump to find out what data being stored in the session (assuming, the connection is not encrypted). The remedy is to keep minimal amount of session data and put cache and session data in different places ( or at least, use different keys and only retrieve the cache data you will need)

Cause no 9. Other process are loading the application host

This might occurs when another app or another website is running on the application host. To diagnose this just run top in the application host and see if there something strange hogging the CPU or RAM. Sometimes, disk IO became a problem, you could check using iostat, but to pinpoint which process is responsible for the IO requires special tools such iotop. Oh, please remember that disk capacity is a different beast than disk IO capacity.. The latter only causes error or standstill when the a disk partition is 100% full.

Conclusion

The cause for a slow or freezing web application, more often than not, is unrelated to the CPU core count, memory capacity, and storage capacity of the infrastructure is running on. Only in some occassions, we could fix slowdown problem by increasing CPU / memory capacity, on other events we need to identify which case are actually causing the slowdown/freeze.

References

Openshift Origin 3.6 tuning profile at https://github.com/openshift/origin/blob/v3.6.1/contrib/tuned/origin-node-guest/tuned.conf
Openshift Origin 4.x Tuning Operator Profile at https://github.com/openshift/cluster-node-tuning-operator/blob/e67bc9dc5e691c6991787470f73835bca15cf9b3/assets/tuned/daemon/profiles/openshift/tuned.conf
EXPLAIN tutorial at https://www.sitepoint.com/using-explain-to-write-better-mysql-queries/
New Relic APM installation for PHP at https://docs.newrelic.com/docs/apm/agents/php-agent/installation/php-agent-installation-overview/
Elastic APM installation for PHP at https://www.elastic.co/guide/en/apm/agent/php/current/setup.html
PHP FPM Slow log tutorial at https://easyengine.io/tutorials/php/fpm-slow-log/

Inventor's Paradox