php-fpm generated a lot of load when resetting opcache, rendering server unresponsive

mark asked:

In recent weeks we experienced a negative phenomenon when deploying code: the server becomes unresponsive sometimes for a few minutes.

Here’s an example of the load on the server when this happened:
enter image description here

The only related log I can find is from /var/log/php7.2-fpm.log and sometimes (but not always) I see entries like this (note: this is from a different event than the image shown above, nevertheless the same happens):

[22-Mar-2019 15:33:50] WARNING: [pool api] seems busy (you may need to increase pm.start_servers, or pm.min/max_spare_servers), spawning 32 children, there are 0 idle, and 231 total children
[22-Mar-2019 15:33:52] WARNING: [pool api] server reached pm.max_children setting (250), consider raising it
[22-Mar-2019 15:34:05] WARNING: [pool app] seems busy (you may need to increase pm.start_servers, or pm.min/max_spare_servers), spawning 8 children, there are 47 idle, and 104 total children

What happened was that we made a deployment to this server:

  • git status --porcelain to check for modifications
  • git pull origin master to update the files
  • resetting the opcache, i.e. we call an endpoint which performs opcache_reset()
  • clearing local cache files

After some experimenting I could reduce the problem of the loads to just this: opcache_reset()

Once I perform this call (independently of any prior or later deployment steps, this happens also in isolation when I just call this endpoint) there’s a chance the the system load suddenly spikes.

If that happens and the load is “too high” (I would say from experience > 200 or so), the system becomes unresponsive until for seconds or minutes, depending how long it takes everything calms down.

Specs:

PHP-FPM configuration:

  • We’re using a 6 pools with different virtual hosts
  • the start_servers directly totals to 320 php-fpm processes (also confirmed via ps auxw|grep -i fpm|grep -v grep |wc -l)
  • the total of max_children of all pools would be around 870

Maybe the totals are too high here, the intention was to cope with spikes on individual virtual hosts which happen to have sometimes.

Using htop, the system usually looks like this:
enter image description here

And usually the load is low unless we’ve this spike which are all related to opcache resetting (which I just recently found out):
enter image description here

I understand that resetting the cache and now all processes have to re-fill it is CPU consuming.

But what I don’t understand:

  • this only started recently to happen, e.g. maybe 1-2 months, but only in the recent two weeks the unresponsiveness was noticeable
  • it does not happen always, sometimes when resetting the cache, nothing happens

Here’s the output of opcache_get_status(false) right before the deployment:

{
  "opcache_enabled": true,
  "cache_full": false,
  "restart_pending": false,
  "restart_in_progress": false,
  "memory_usage": {
    "used_memory": 67353640,
    "free_memory": 66864088,
    "wasted_memory": 0,
    "current_wasted_percentage": 0
  },
  "interned_strings_usage": {
    "buffer_size": 8388608,
    "used_memory": 5215176,
    "free_memory": 3173432,
    "number_of_strings": 89109
  },
  "opcache_statistics": {
    "num_cached_scripts": 2873,
    "num_cached_keys": 5063,
    "max_cached_keys": 7963,
    "hits": 633581523,
    "start_time": 1553172771,
    "last_restart_time": 1553248200,
    "oom_restarts": 0,
    "hash_restarts": 0,
    "manual_restarts": 6,
    "misses": 9512,
    "blacklist_misses": 0,
    "blacklist_miss_ratio": 0,
    "opcache_hit_rate": 99.9984987161316
  }
}

and here afterwards:

{
  "opcache_enabled": true,
  "cache_full": false,
  "restart_pending": false,
  "restart_in_progress": false,
  "memory_usage": {
    "used_memory": 57745856,
    "free_memory": 76471872,
    "wasted_memory": 0,
    "current_wasted_percentage": 0
  },
  "interned_strings_usage": {
    "buffer_size": 8388608,
    "used_memory": 4337168,
    "free_memory": 4051440,
    "number_of_strings": 75163
  },
  "opcache_statistics": {
    "num_cached_scripts": 2244,
    "num_cached_keys": 3925,
    "max_cached_keys": 7963,
    "hits": 5893926,
    "start_time": 1553172771,
    "last_restart_time": 1553265235,
    "oom_restarts": 0,
    "hash_restarts": 0,
    "manual_restarts": 7,
    "misses": 4962,
    "blacklist_misses": 0,
    "blacklist_miss_ratio": 0,
    "opcache_hit_rate": 99.91588245106536
  }
}

Other things I observed:

  • php-fpm pretty soon stops responding
  • nginx still works unless the load gets REALLY high; I confirmed this because when php-fpm basically is unreachable, nginx delivers the configured 500 page

What causes these load spikes really ? How can I avoid them?

My answer:


By default PHP checks the file timestamp to invalidate the opcache entry. This can be turned off, and that is the only scenario I can think of in which opcache_reset() would be used. Of course, it also causes the problem you are having.

I recommend going back to the defaults:

opcache.validate_timestamps = 1
opcache.revalidate_freq = 2
opcache.revalidate_path = 0

View the full question and any other answers on Server Fault.

Creative Commons License
This work is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License.