Machine Check Exception

Technology

So I was upgrading Samus to a new version of gcc/glibc, as provided by portage, and Chris stopped by before we went to lunch. We were chatting about some things, when suddenly we hear this mysterious BEEP! from behind us. In an amzingly surreal moment, we turn around and look behind us, and Chris notices that the drive lights are blinking on Samus in an unusual way. "I think it rebooted," he says. No way, I'm thinking. Then I recognize the pattern of lights as the same pattern caused by my SCSI BIOS probing and resetting the BUS. Yup, Samus had spontaneously rebooted.

Checking the logs after the box came back up revealed the following terrible-looking message:

Jun 16 12:03:59 samus CPU 0: Machine Check Exception: 0000000000000004
Jun 16 12:03:59 samus Bank 2: f60020000000017a at 000000001a2a8080
Jun 16 12:03:59 samus Kernel panic: CPU context corrupt
A little help from Google revealed this informative thread, with some amusing posts by Alan Cox, and a post to a utility to parse the machine check exception. This utility decoded the above junk into the junk below.
CPU 0
Status: (4) Machine Check in progress.
Restart IP invalid.
parsebank(2): f60020000000017a @ 1a2a8080
        External tag parity error
        Uncorrectable ECC error
        CPU state corrupt. Restart not possible
        Address in addr register valid
        Error enabled in control register
        Error not corrected.
        Error overflow
        Memory hierarchy error
        Request: Generic error
        Transaction type : Generic
        Memory/IO : I/O

The utility might have been more useful to me if it printed:

CPU 0
Status: (4) Machine Check in progress.
Something bad happened.

I've done some more experimentation, and it seems that the machine only dies like this when I've nohuped the emerge, redirected the output, and backgrounded the process, like this (shell is tcsh):

nohup emerge -U glibc >& emerge.out &
When I just let it run and dump to a shell, nothing bad happens.

Apparantly, the Machine Check Exception reveals is a register in the CPU that gets set when something bad happens inside the CPU. "Something bad" is defined as things such as cosmic rays causing bits to randomly flip (less likely), or the CPU overheating causing electrons to drift. But why would this happen only when I dump to a file instead of a shell? That doesn't make any sense to me.

Tags:

Write a comment

  • Required fields are marked with *.

If you have trouble reading the code, click on the code itself to generate a new random code.
Security Code: