Life and code.
RSS icon Home icon
  • Machine Check Exception

    Posted on June 16th, 2004 Brian No comments

    So I was upgrading Samus to a new version of gcc/glibc, as provided by portage, and Chris stopped by before we went to lunch. We were chatting about some things, when suddenly we hear this mysterious BEEP! from behind us. In an amzingly surreal moment, we turn around and look behind us, and Chris notices that the drive lights are blinking on Samus in an unusual way. “I think it rebooted,” he says. No way, I’m thinking. Then I recognize the pattern of lights as the same pattern caused by my SCSI BIOS probing and resetting the BUS. Yup, Samus had spontaneously rebooted.

    Checking the logs after the box came back up revealed the following terrible-looking message:

    Jun 16 12:03:59 samus CPU 0: Machine Check Exception: 0000000000000004
    Jun 16 12:03:59 samus Bank 2: f60020000000017a at 000000001a2a8080
    Jun 16 12:03:59 samus Kernel panic: CPU context corrupt

    A little help from Google revealed this informative thread, with some amusing posts by Alan Cox, and a post to a utility to parse the machine check exception. This utility decoded the above junk into the junk below.

    CPU 0
    Status: (4) Machine Check in progress.
    Restart IP invalid.
    parsebank(2): f60020000000017a @ 1a2a8080
            External tag parity error
            Uncorrectable ECC error
            CPU state corrupt. Restart not possible
            Address in addr register valid
            Error enabled in control register
            Error not corrected.
            Error overflow
            Memory hierarchy error
            Request: Generic error
            Transaction type : Generic
            Memory/IO : I/O

    The utility might have been more useful to me if it printed:

    CPU 0
    Status: (4) Machine Check in progress.
    Something bad happened.

    I’ve done some more experimentation, and it seems that the machine only dies like this when I’ve nohuped the emerge, redirected the output, and backgrounded the process, like this (shell is tcsh):

    nohup emerge -U glibc >& emerge.out &
    


    When I just let it run and dump to a shell, nothing bad happens.

    Apparantly, the Machine Check Exception reveals is a register in the CPU that gets set when something bad happens inside the CPU. “Something bad” is defined as things such as cosmic rays causing bits to randomly flip (less likely), or the CPU overheating causing electrons to drift. But why would this happen only when I dump to a file instead of a shell? That doesn’t make any sense to me.

    Comments are closed.