Can't sync filesystem without reboot

Fabio asked:

I’m having an issue with a linux server. Once a week the running mysql instance hangs and there is no way to fully stop it. If I kill it, it remains in zombie status and init does not reap its pid.

The server is used for staging deployments and some internal tools, so it’s not under heavy load. The only process constantly used id mysql and for this I think that it’s the only process which suffer of this issue.

I’ve searched system logs for errors and the only thing I found is this error (repeated a couple of times) in dmesg output:

[706560.640085] INFO: task mysqld:31965 blocked for more than 120 seconds.
[706560.640198] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[706560.640312] mysqld          D ffff88032fd93f40     0 31965      1 0x00000000
[706560.640317]  ffff880242a27d18 0000000000000086 ffff88031a50dd00 ffff880242a27fd8
[706560.640321]  ffff880242a27fd8 ffff880242a27fd8 ffff88031e549740 ffff88031a50dd00
[706560.640325]  ffff88031a50dd00 ffff88032fd947f8 0000000000000002 ffffffff8112f250
[706560.640328] Call Trace:
[706560.640338]  [<ffffffff8112f250>] ? __lock_page+0x70/0x70
[706560.640344]  [<ffffffff816cb1b9>] schedule+0x29/0x70
[706560.640347]  [<ffffffff816cb28f>] io_schedule+0x8f/0xd0
[706560.640350]  [<ffffffff8112f25e>] sleep_on_page+0xe/0x20
[706560.640353]  [<ffffffff816c9900>] __wait_on_bit+0x60/0x90
[706560.640356]  [<ffffffff8112f390>] wait_on_page_bit+0x80/0x90
[706560.640360]  [<ffffffff8107dce0>] ? autoremove_wake_function+0x40/0x40
[706560.640363]  [<ffffffff8112f891>] filemap_fdatawait_range+0x101/0x190
[706560.640366]  [<ffffffff81130975>] filemap_write_and_wait_range+0x65/0x70
[706560.640371]  [<ffffffff8122e441>] ext4_sync_file+0x71/0x320
[706560.640376]  [<ffffffff811c3e6d>] do_fsync+0x5d/0x90
[706560.640379]  [<ffffffff811c40d0>] sys_fsync+0x10/0x20
[706560.640383]  [<ffffffff816d495d>] system_call_fastpath+0x1a/0x1f

When this happens the only way to make everything working again is a full reboot, but in order to do that I’m forced to use this command after I’ve manually stopped all running processes

echo b > /proc/sysrq-trigger

otherwise normal reboot process hangs forever. I’ve tracked reboots script and I’ve found out that also the reboot process hangs on a sync call, this one in /etc/init.d/sendsigs (I’m on ubuntu)

# Flush the kernel I/O buffer before we start to kill
# processes, to make sure the IO of already stopped services to
# not slow down the remaining processes to a point where they
# are accidentily killed with SIGKILL because they did not
# manage to shut down in time.

I’m almost sure that the cause of this is an hardware issue (the RAID controller???) also because I’ve other two machines with the same hardware and software configuration and they don’t suffer of this, but I can’t find any hint in syslog or dmesg. I’ve also installed smartmontools and mcelog packages but none of them did report any issue.

What can I do to track the cause of this issue?

Today is happened again, here is the status of system after triggering a reboot


And here is the status of sync process

# ps aux | grep sync
root      3637  0.1  0.0   4352   372 ?        D    05:53   0:00 sync

i.e. Uninterruptible sleep…

Hardware specs as reported by lshw

I think the raid controller is a fake raid. I usually don’t deal with hardware (and for the record I don’t have physical access to it)

description: Computer
product: X7DBP ()
vendor: Supermicro
version: 0123456789
serial: 0123456789
width: 64 bits
capabilities: smbios-2.4 dmi-2.4 vsyscall32
configuration: administrator_password=disabled boot=normal frontpanel_password=unknown keyboard_password=unknown power-on_password=disabled uuid=53D19F64-D663-A017-8922-0030487C1FEE
   description: Motherboard
   product: X7DBP
   vendor: Supermicro
   physical id: 0
   version: PCB Version
   serial: 0123456789
      description: BIOS
      vendor: Phoenix Technologies LTD
      physical id: 0
      version: 6.00
      date: 05/29/2007
      size: 106KiB
      capacity: 960KiB
      capabilities: pci pnp upgrade shadowing escd cdboot bootselect edd int13floppy2880 acpi usb ls120boot zipboot biosbootspecification

         description: RAID bus controller
         product: 631xESB/632xESB SATA RAID Controller
         vendor: Intel Corporation
         physical id: 1f.2
         bus info: [email protected]:00:1f.2
         version: 09
         width: 32 bits
         clock: 66MHz
         capabilities: storage pm bus_master cap_list
         configuration: driver=ahci latency=0
         resources: irq:19 ioport:18a0(size=8) ioport:1874(size=4) ioport:1878(size=8) ioport:1870(size=4) ioport:1880(size=32) memory:d8500400-d85007ff

My answer:

Your process state is D, which technically means uninterruptible sleep. Though, as I always say, D means Disk. Processes in this state are waiting for a disk I/O operation to complete.

We can see from your call trace that mysqld itself was trying to sync and got stuck for more than 120 seconds waiting for the sync to complete.

This indicates that something is wrong with your storage subsystem. You should look at your hard disks and disk controller (if local disks) or network and SAN (if remote storage).

View the full question and any other answers on Server Fault.

Creative Commons License
This work is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License.