Why does yum index get corrupted?

TomOnTime asked:

Occasionally yum’s cache gets corrupted and we see errors like this:

error: db3 error(-30974) from dbenv->failchk: DB_RUNRECOVERY: Fatal error, run database recovery
error: cannot open Packages index using db3 -  (-30974)
error: cannot open Packages database in /var/lib/rpm

The workaround is rm -f /var/lib/rpm/__db* and then the next “yum” command regenerates the data.

My question is: what is likely to be causing this? Is there some common task that ignores locks or has other problem that causes this?

We have hundreds of CentOS machines and there is no pattern to which see this problem. It could be a “one in a million” issue, which at large scale is seen often.

NOTE: I realize this is a very “open ended” question, but if an answer finds the cause, I will go back and turn the question into something more canonical that directly relates to the specific issue.

My answer:

In the general case, this happens when rpm (or yum) crashes when updating the rpmdb, which is a Berkeley DB key-value store, and very sensitive. When such a crash happens, the rpmdb is left in an inconsistent state and this error occurs. All of the other files in /var/lib/rpm contain the same information, though in a less efficient format, so the database is easily rebuilt.

Two notable bugs you may have seen on older CentOS systems can cause this. The big one, a “nasty and subtle race in shared mmap’ed page writeback” as it appears in the changelog, was quietly fixed in a kernel update in 2007. This one presented itself slightly differently than your report, though.

The one you might see from 2009 happened when PackageKit would kill yum at an inopportune time, and was also fixed. This would be more likely to affect desktop systems or servers with a GUI, though.

All of these bugs predate EL 6, and you should almost never see this occur on EL 6 or 7, nor should you see it if your EL 5 systems are up to date. (I have no idea about EL 4. If you have one, kill it before it spreads.) That said, anything that causes yum or rpm to die while working with the rpmdb could cause it. This includes what you’re most likely to see these days, random cosmic rays flipping bits, or someone getting overzealous with kill -9.

In RHEL 7, yum traps more signals during the actual transaction run, and you’ll see the message (shutdown inhibited). This should help prevent most situations in which someone or something interrupts the transaction and causes this problem.

View the full question and any other answers on Server Fault.

Creative Commons License
This work is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License.