How do I get notified of ECC errors in Linux?

Jens Erat asked:

How do I get notified, when a Linux machine equipped with ECC memory recognizes a memory failure? I’m interested in both correctable and uncorrectable errors.

  • if a message is written to dmesg/the syslog, this is already fine, but I’d love to know what to look for
  • installing additional daemons (like smartmontools for hard drives) is acceptable
  • Nagios/Icinga monitoring would be another way to go
  • not all machines to be monitored have IPMI

Systems of interest have Supermicro boards (X9SCM-F), regarding an HP N54L Microserver I’m just curios, but don’t care too much. All systems run Debian or Ubuntu Linux.

My answer:

mcelog will monitor the memory controller and report memory error events to syslog, and in some configurations can offline bad memory pages. This is, of course, in addition to its usual use to monitor machine check exceptions and a variety of other hardware errors.

Most Linux distributions have a service set up to run it as a daemon, e.g. for EL 6:

chkconfig mcelog on
service mcelog start

View the full question and any other answers on Server Fault.

Creative Commons License
This work is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License.