gzip softlocks every midnight during nginx log rotation, causes kernel panic

Érik Desjardins asked:

Some of our servers run on Ubuntu 18.10. They host a few Node.JS services receiving TCP connections as upstream from nginx, as well as a websocket implementation in Dotnet Core. It’s a fairly simple setup sysadmin-wise.

One of the machines started behaving weirdly lately after upgrading from Ubuntu 18.04 to 18.10. A kernel panic will occur almost every midnight. The machine usually handles around 10’000+ connections real-time without any problem. It’s a pretty powerful setup.

From syslog (ryk is my sudoer user) :

Jun 29 00:00:02 localhost CRON[23415]: (ryk) CMD (/home/ryk/dev/downloadadscript.sh)
Jun 29 00:00:02 localhost CRON[23419]: (ryk) CMD (cd /home/ryk/ampcache && node narcity.index.js 200)
Jun 29 00:00:02 localhost CRON[23420]: (ryk) CMD (cd /home/ryk/ampcache && node mtlblog.index.js 200)
Jun 29 00:00:02 localhost CRON[23421]: (root) CMD (test -x /usr/bin/certbot -a \! -d /run/systemd/system && perl -e 'sleep int(rand(43200))' && certbot -q renew)
Jun 29 00:00:02 localhost CRON[23426]: (ryk) CMD (/home/ryk/dev/uploadsbackup.sh)
Jun 29 00:00:02 localhost CRON[23412]: (CRON) info (No MTA installed, discarding output)
Jun 29 00:00:04 localhost CRON[23409]: (CRON) info (No MTA installed, discarding output)
Jun 29 00:00:04 localhost CRON[23408]: (CRON) info (No MTA installed, discarding output)
Jun 29 00:00:05 localhost logrotate[23418]: error: destination /var/log/nginx/access.log.1.gz already exists, renaming to /var/log/nginx/access.log.1.gz-2019062900.backup
Jun 29 00:00:11 localhost CRON[23413]: (CRON) info (No MTA installed, discarding output)
Jun 29 00:00:58 localhost kernel: [65787.092653] watchdog: BUG: soft lockup - CPU#11 stuck for 26s! [node:8764]
Jun 29 00:00:58 localhost kernel: [65787.092695] watchdog: BUG: soft lockup - CPU#3 stuck for 30s! [node:8715]
Jun 29 00:00:58 localhost kernel: [65787.092735] watchdog: BUG: soft lockup - CPU#8 stuck for 32s! [gzip:23478]
Jun 29 00:00:58 localhost kernel: [65787.092739] Modules linked in:
Jun 29 00:00:58 localhost kernel: [65787.092746] watchdog: BUG: soft lockup - CPU#12 stuck for 27s! [node:23489]
Jun 29 00:00:58 localhost kernel: [65787.092752]  joydev
Jun 29 00:00:58 localhost kernel: [65787.092758] watchdog: BUG: soft lockup - CPU#1 stuck for 37s! [nginx:29255]
Jun 29 00:00:58 localhost kernel: [65787.092763] watchdog: BUG: soft lockup - CPU#10 stuck for 41s! [nginx:29256]
Jun 29 00:00:58 localhost kernel: [65787.092765] Modules linked in:
Jun 29 00:00:58 localhost kernel: [65787.092768] watchdog: BUG: soft lockup - CPU#13 stuck for 37s! [node:3327]
Jun 29 00:00:58 localhost kernel: [65787.092770]  input_leds
Jun 29 00:00:58 localhost kernel: [65787.092773] Modules linked in:
Jun 29 00:00:58 localhost kernel: [65787.092775]  joydev

And then, unreadable kernel dump until reboot 19 minutes later.

syslog kernel panic dump screenshot

Jun 29 00:19:53 localhost kernel: [    0.000000] Linux version 4.18.0-25-generic ([email protected]) (gcc version 8.3.0 (Ubuntu 8.3.0-6ubuntu1~18.10.1)) #26-Ubuntu SMP Mon Jun 24 09:32:08 UTC 2019 (Ubuntu 4.18.0-25.26-generic 4.18.20)
Jun 29 00:19:53 localhost kernel: [    0.000000] Command line: BOOT_IMAGE=/boot/vmlinuz-4.18.0-25-generic root=/dev/sda ro console=ttyS0,19200n8 net.ifnames=0

We use the default logrotate routine to make sure logs have a decent size, and are compressed for archiving. For nginx, it runs every day. Our nginx config does not store access logs except for one route which is not busy at all. Those are never heavier than 20MB before compression.

From /etc/logrotate.d/nginx :

/var/log/nginx/*.log {
    daily
    missingok
    rotate 14
    compress
    delaycompress
    notifempty
    create 0640 www-data adm
    sharedscripts
    prerotate
        if [ -d /etc/logrotate.d/httpd-prerotate ]; then \
            run-parts /etc/logrotate.d/httpd-prerotate; \
        fi \
    endscript
    postrotate
        invoke-rc.d nginx rotate >/dev/null 2>&1
    endscript
}

There are a few things happening at the same time at midnight, but 16 CPUs should be more than enough to provide enough power for all of those tasks.

  • 200 requests are sent to AMP caches to clear latest pages
  • Certbot will renew HTTPS certs if needed
  • logrotate will, well, rotate logs
  • rsync will save that day’s uploaded images and files to a backup server
  • A Node.JS service will invalidate old cache entries in RAM

At first, I tried delaying those tasks by a few minutes, but logrotate or gzip would still manage to crash the entire server at some point.

A couple information about the server :

  • Hosted with Linode
  • Handles millions of requests every day
  • 16 CPU, 64GB RAM
  • Main processes : NodeJS, Dotnet Core, nginx, a few C++ routines
  • There’s an additional disk attached to it for backups
  • There are no lag spikes or any abnormality during the day
  • We don’t run an express server, mongoose, or other overused libraries
  • Most of the requests don’t even reach any of the services; nginx will serve cached / static files
  • CPU is low, RAM is usually below 30% usage before cache invalidation

uname -a

Linux localhost 4.18.0-25-generic #26-Ubuntu SMP Mon Jun 24 09:32:08 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux

Any idea why logrotate or gzip would cause the entire machine to softlock? Am I missing something?

Thanks in advance.

My answer:


Your virtual machine may be on a physical server that is having hardware problems. Shut down the VM and then start it again, and Linode should start it on a different hypervisor.


View the full question and any other answers on Server Fault.

Creative Commons License
This work is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License.