Case Study: Elegant Themes Core Race Conditions

In one of our earlier case studies we looked a random 502 Bad Gateway error that happened because a remote endpoint was not responding. That was pretty easy. Ready for something more interesting?

Today, we’ll look at a seemingly mind-boggling case that was disregarded by Elegant Themes support as an unreproducible server issue or plugin conflict. To add injury to insult, the issue could not be reproduced on exact development and staging copies of the client’s website.

The incident

The error in question randomly started appearing on a client’s live multisite on sites that use the latest version of the Divi or Extra theme.

Uncaught Error: Call to undefined function et_core_is_gutenberg_active() in /home/redacted/www/wp-content/themes/Divi/includes/builder/feature/BlockEditorIntegration.php:579

An emergency team meeting was scheduled as this was affecting a couple of important sites and we started digging into why et_core_is_gutenberg_active was sometimes undefined. A lot of theories were thrown onto the table ranging from plugin load order and filesystem corruption to bytecode caching and a PHP race condition bug.

A designated live debugger was chosen from the team members to perform responsible debugging on the production server with hundreds of live requests per second coming in. She was sweating buckets as we rapidly threw together test scenarios for her to perform.

We quickly ruled out plugin load order and PHP race conditions as we isolated a single process and made it load one of the sites that was randomly failing. The loaded classes and their order was the same. As we dissecting the loading lifecycle of the file that contained the et_core_is_gutenberg_active function we quickly learned that it was loading a different Elegant Themes Core (ET_CORE).

Suspect number one

Here’s how it basically works:

  • core/init.php is loaded that contains the ET loader
  • it can’t be loaded multiple times, so one loader is picked
  • the loader is responsible to find and subsequently load the latest available version of ET core among active plugins an themes
  • the found version is stored in a site transient and used to optimize the find step later

Smart but maybe overengineered. Which lead to one tiny oversight on part of ET: site transients (despite the name) are shared across all sites.

This particular multisite has around 100 different sites in it, which are all quite active. Some sites were not using the Divi or Extra themes, but using an non-Elegant Theme theme. The loader ended up using a plugin (latest version of bloom) with an old ET core. This is fine. The sites worked. What was not fine is that the old ET core path was stored for all sites in the network and subsequently loaded by each and every one of them, until it expired in 24 hours.

The loader is “smart” though, and on sites where bloom was not active the transient was overwritten again by something that was active. So it fixed itself until a bloom-only site was hit again. So it was a constant game of cat and mouse between the available ET cores on a site. Add page caching into the whole mix and you get confused developers all around.

We managed to reproduce this on the staging site by first hitting any site URL which used the bloom plugin but did not use an Elegant Themes theme. It loaded fine. Then hitting one of the sites that use Divi and the fatal error was reproduced. Mystery solved. Case closed.

The aftermath

An upstream update to bloom core will be imminent. An update to the ET core loader will probably also happen. Meanwhile copying the ET core from Divi/core to bloom/core will most certainly solve the issue.

A report was sent to the client, relayed over to Elegant Themes, and the team was able to continue enjoying their well-deserved holiday.

If you’d like on-demand access our platform and team for a predictable monthly membership price contact us today and find out how we can help you keep your WordPress project humming along like never before.