ECC errors in L3 cache – critical or not?

L3error asked:

On a linux server (8x Quad-Core AMD 8378), I’m getting the following errors:

[Hardware Error]: MC4_STATUS[-|CE|MiscV|-|AddrV|CECC]: 0x9c294c00001d018b
[Hardware Error]: Northbridge Error (node 4): ECC error in L3 cache tag.
[Hardware Error]: cache level: L3/GEN, tx: GEN, mem-tx: SNP
[Hardware Error]: Machine check events logged

This has happened three times during the last month, but never before (server running for 3 years).

From a quick google-search, it seems this is a serious matter.

However, the vendor support technician said:

I have seen these errors MANY times, and unless you are overclocking your CPU – or have had a fan failure or similar – it is VERY unlikely to be a processor
problem. It is more likely that the kernel is misreporting the error.

So – is this a critical error and I should order new parts (replace CPU?) or ignore it?

Many thanks.

My answer:

Best practice: Keep your own spare parts, when possible.

As for machine check exceptions, these are reported by the hardware; the kernel is just passing the message on to you, so that you can take action before the hardware problem gets out of hand and results in a real disaster.

The only instance I was able to find of a kernel “misreporting” a machine check exception was the following. In this case, it was a flaw in the processor causing the problem, not the kernel.

Intel Xeon processor E7 family processors have an issue in which some c-state transitions can cause false correctable Machine Check Exception (MCE) errors to be reported from MCE bank 6 to the user. On some E7 processor family systems, this resulted in “floods” of MCE errors. This patch disables MCE error reporting for bank 6.

Bottom line: It sounds to me like the vendor is trying to avoid replacing your defective hardware.

View the full question and any other answers on Server Fault.

Creative Commons License
This work is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License.