The internal error (IERR) signal on Intel microprocessors (CPU's) can be triggered by a number of external events in addition to an Internal CPU Error. An algorithm and methodology is disclosed here that filters IERR events so that unnecessary replacement of CPU devices can be avoided and the actual causes of the event can be isolated.
Intel CPU IERR Filtering Algorithm and Fault Isolation Methodology
CPU's manufactured by Intel Corporation have an internal error or IERR signal that is usually tied to an interrupt line or to an external monitor. The original purpose of this signal was to indicate an internal, unrecoverable CPU error. The normal procedure is to replace any CPU that signals an IERR. More recently, the IERR signal is also triggered by non-CPU faults. Bus time-out's, forward progress stalls in multi-processor configurations, evaluation versions of Windows operating systems and other non-hardware fault triggers have been identified that result in an active IERR signal. On occasion, two or more CPU's signal an IERR simultaneously. Experiences in the laboratory have shown that the CPU(s) can be restarted and the IERR often does not re-occur. In the field, the result is that non-faulty CPU's are replaced first and the real source of the IERR is discovered only after multiple IERR events and one or more CPU replacements.
The IERR Filtering Algorithm and Fault Isolation Methodology differentiates IERR events that are the result of hardware faults from IERR events due to other causes. It enables a methodology for isolating the causes of the events that trigger the IERR signal. IERR events are filtered based on known causes, indicators, and most probable events and the system is restarted automatically in most cases. This filtering algorithm and methodology takes advantage of supplemental information, fault probabilities and prior knowledge about IERR events to determine when a likely false CPU internal error has been signaled.
The IERR Filtering algorithm can be implemented as shown in the following flow diagram. In this flow, the H8 is an external processor monitoring the server system.
Start (From H8-Reset)
Reset all IERR flags
I-Error Detected?
Waiting for I-Error No Yes
Restart all CPUs
Power Cycle system Restart all but faulted CPU
Yes
No
1. H8:Reset/Reboot System (With
no additional CPU's held off)
2. BIOS: Logs MC Regs to MM
3. BIOS: If 3-strike set send msg to
MM
4. BIOS: Clears MC Regs
Single CPU Active?
Yes
Log error msg #1 Clear IERR flags Turn Failed CPU LED ON Turn on System Error LED
No
Log error msg #2
No
H8 Rec. Ack from
BIOS
Yes
Set faulted CPU IERR flag Clear other CPU IERR flags Turn on System Information LED
Log error msg #3
Multiple CPU IERR?
Same
CPU IERR flag set?
Yes
No
No
Clear CPU IERR flags Turn on System Information LED
No
BIOS: Sends status to H8
Time Out
3 Strike Bit Set?
Log error msg #4 Clear IERR flags Turn on System Information LED
Yes
Yes
Turn on System Error LED Log error msg #5 Turn CPU LED ON Clear IERR flags Power Down
clp 10--16-2003 rev. wes 11-12-2003
In this implementation, the system runs until an IERR signal is detected by the
1
H8. The H8 warm restarts the system. On reboot, system BIOS retrieves the contents of the machine check registers loca...