Case Study: 502 Bad Gateway

One of the applications hosted on our platform started returning a 502 Bad Gateway out of nowhere. A 502 Bad Gateway response code usually means that either the PHP-FPM pool has crashed and cannot be connected to, or all the workers are occupied by long-running code, deadlocks, I/O blocks.

Within minutes our team was investigating the service outage via SSH. We usually allot up to 15 minutes of live investigation time before “turning it off and on again”, unless we can reach the project owner and ask for more time. So 15 minutes is all we have.

First things first, ps aux | grep php to check the state of our pool.

PHP-FPM is up and running, all the processes are Sleeping. Must be an I/O block then. But making sure tail /var/log/php-fpm.log confirms that the pool is filled up. Yep:

…server reached pm.max_children setting (8), consider raising it

Not good. Let’s check MySQL connections. Clear…

File locks?

pgrep php-fpm | xargs -n1 lsof -p | grep -v mem

Uh-oh. All of the workers have a TCP socket open to an instagram server. And it hangs forever. Hitting limits? Firewall? Routing issues? Why is the code allowing this?

Okay. All in all 5 minutes downtime in, and we know the issue – a hanging remote connection. We can safely restart the pool and look for the offending code.

The output of lsof already hinted at the instashow plugin. And their source code is obfuscated:

Bad! We look for calls to wp_remote_… none… typical! curl, fsockopen – present. cURL, okay. We can probably patch a CURLOPT_TIMEOUT addition and set it to 5 seconds and send a bug report to the author asking them to use wp_remote_post which already handles shimming, and a default 5 second timeout.

A fix was pushed by the instashow developers within 12 hours and deployed over our hotfix. The issue was considered resolved.

And that, folks, is the Pressjitsu experience! :)