Case Study: 502 Bad Gateway
One of the applications hosted on our platform started returning a 502 Bad Gateway out of nowhere. A 502 Bad Gateway response code usually means that either the PHP-FPM pool has crashed and cannot be connected to, or all the workers are occupied by long-running code, deadlocks, I/O blocks.
Within minutes our team was investigating the service outage via SSH. We usually allot up to 15 minutes of live investigation time before “turning it off and on again”, unless we can reach the project owner and ask for more time. So 15 minutes is all we have.
First things first,
ps aux | grep php to check the state of our pool.
PHP-FPM is up and running, all the processes are Sleeping. Must be an I/O block then. But making sure
tail /var/log/php-fpm.log confirms that the pool is filled up. Yep:
…server reached pm.max_children setting (8), consider raising it
Not good. Let’s check MySQL connections. Clear…
pgrep php-fpm | xargs -n1 lsof -p | grep -v mem
Uh-oh. All of the workers have a TCP socket open to an instagram server. And it hangs forever. Hitting limits? Firewall? Routing issues? Why is the code allowing this?
Okay. All in all 5 minutes downtime in, and we know the issue – a hanging remote connection. We can safely restart the pool and look for the offending code.
The output of
lsof already hinted at the instashow plugin. And their source code is obfuscated:
Bad! We look for calls to
wp_remote_… none… typical!
fsockopen – present. cURL, okay. We can probably patch a
CURLOPT_TIMEOUT addition and set it to 5 seconds and send a bug report to the author asking them to use
wp_remote_post which already handles shimming, and a default 5 second timeout.
A fix was pushed by the instashow developers within 12 hours and deployed over our hotfix. The issue was considered resolved.
And that, folks, is the Pressjitsu experience! :)