Machine Check Exception

So I was upgrading Samus to a new version of gcc/glibc, as provided by portage, and Chris stopped by before we went to lunch. We were chatting about some things, when suddenly we hear this mysterious BEEP! from behind us. In an amzingly surreal moment, we turn around and look behind us, and Chris notices that the drive lights are blinking on Samus in an unusual way. “I think it rebooted,” he says. No way, I’m thinking. Then I recognize the pattern of lights as the same pattern caused by my SCSI BIOS probing and resetting the BUS. Yup, Samus had spontaneously rebooted.

Checking the logs after the box came back up revealed the following terrible-looking message:

Jun 16 12:03:59 samus CPU 0: Machine Check Exception: 0000000000000004
Jun 16 12:03:59 samus Bank 2: f60020000000017a at 000000001a2a8080
Jun 16 12:03:59 samus Kernel panic: CPU context corrupt

A little help from Google revealed this informative thread, with some amusing posts by Alan Cox, and a post to a utility to parse the machine check exception. This utility decoded the above junk into the junk below.`

CPU 0 Status: (4) Machine Check in progress. Restart IP invalid. parsebank(2): f60020000000017a @ 1a2a8080 External tag parity error Uncorrectable ECC error CPU state corrupt. Restart not possible Address in addr register valid Error enabled in control register Error not corrected. Error overflow Memory hierarchy error Request: Generic error Transaction type : Generic Memory/IO : I/O


The utility might have been more useful to me if it printed:

Status: (4) Machine Check in progress.
Something bad happened.

I’ve done some more experimentation, and it seems that the machine only dies like this when I’ve nohuped the emerge, redirected the output, and backgrounded the process, like this (shell is tcsh):

nohup emerge -U glibc >& emerge.out &

When I just let it run and dump to a shell, nothing bad happens.

Apparantly, the Machine Check Exception reveals is a register in the CPU that gets set when something bad happens inside the CPU. “Something bad” is defined as things such as cosmic rays causing bits to randomly flip (less likely), or the CPU overheating causing electrons to drift. But why would this happen only when I dump to a file instead of a shell? That doesn’t make any sense to me.