Software Design Notes Crash Handling SECTION 18 Crash Handling 18.1 ILLEGAL OPERATION When the microcode detects an irrecoverable or "can't happen" error, it will crash the system. This process is known as ILLOP _____ after the Illegal Instruction Operation microcode routine that _____________________________ performs it. ILLOP causes the machine to halt. Further operation in the presence of the irrecoverable error may only worsen the situation or complicate it beyond analysis. Instead, ILLOP will make notes about the error in a crash record stored in the nonvolatile NVRAM on the System Interface Board, and then halt the machine. ILLOP is a very low level routine. It assumes very little about the state of the machine and it will just halt if it detects that any of its assumptions are wrong. It does not require that any of interrupts, device support, virtual memory, storage allocation, garbage collection, Lisp object support, function calling, or instruction execution be intact. It does assume that the processor is functioning properly, that the registers A-Zero, M-Zero, A-Ones, and M-Ones are set up to contain 0 or -1 as appropriate; that the NuBus is available; and the the NVRAM can be read and written. After the crash RAM has been written, the crash is indicated by complementing the video sense of the screen. The effect is dramatic. This may fail if the memory interface or the screen interface is not functioning properly but a failure of this operation will not affect the proper recording of the crash in the crash record. The analysis of crash records is performed by the Lisp crash _____ analyzer. The crash analyzer functions are documented fully in ________ the Explorer Tools and Utilities manual, as are many of the ____________________________ higher level details of crash reporting. This section is intended to be a supplement to that discussion directed at systems programmers. For the final word, consult the Lisp code in SYS:NVRAM; CRASH and the other files in the NVRAM system. 18-1 Crash Handling Software Design Notes NOTE Unless otherwise indicated, all symbols and functions named here are in the SYSTEM package. 18.2 CRASH RECORDING ILLOP stores some of the machine state in the NVRAM so that the next successful startup can explain the cause of the crash. If the system cannot be successfully started, field service can read the crash reason from diagnostic hardware and/or software. This data is called a crash record. ____________ In order that an unsuccessful attempt to restart the system will not lose the original crash data, crash records for the last few system shutdowns are kept in a circular buffer in NVRAM. Each time the system is started a record is allocated from the buffer. When the system halts, the reason is recorded in the crash record. NOTE In order for proper recording of crash information to take place, the structure of NVRAM must first be initialized using the SETUP-NVRAM function. Only then can valid crash records be recorded. Since NVRAM is non-volatile memory, the SETUP-NVRAM function should only need to be run when the system is first installed or after service maintenance has been performed on the System Interface Board. 18.2.1 Crash Record Allocation. Allocation of a crash record for the current system startup is performed by the microcode during the boot process. It occurs as early in the boot as possible so that useful information can be recorded by ILLOP if the machine crashes during the boot. 18-2 Software Design Notes Crash Handling Allocation of the crash record is controlled by four 16-bit numbers that are stored at a known place within the NVRAM. These are shown in Table 18-1. All of the allocation registers contain 16-bit byte offsets into the NVRAM, which is accessed using physical memory references. Offsets are expressed in hexadecimal. NVRAM is actually 8-bit memory mapped into the NuBus address space one-byte-per-32-bit-NuBus-word; hence only multiple-of-four byte addresses are used (eg, 0, 4, 8, #xC, #x10, ...). A 16-bit quantity (such as the crash record allocation registers) is stored with its low order bits in the lowest address and its high order bits in the address 4 higher. For example, #xF4B2 would be stored in NVRAM-CRASH-BUFF-POINTER with #xB2 in NVRAM-BASE + #x90 and #xF4 in NVRAM-BASE + #x94. NVRAM-CRASH-BUFF-POINTER is the offset into the NVRAM (in bytes) of the beginning of the currently selected crash record. The currently selected crash record describes the current system startup. When the system is running, it points to the record that will be filled in when the system next halts. When the system is not running, it points to the crash record for the last system shutdown. NVRAM-CRASH-BUFF-REC-LEN is the size of a crash record. It is the amount by which to increase NVRAM-CRASH-BUFF-POINTER to reach the next record. NVRAM-CRASH-BUFF-LAST is the offset to the beginning of the last crash record in the buffer. This is used by allocation and also when scanning the buffer backward to see the history of shutdowns. NVRAM-CRASH-BUFF-BASE is the offset to the beginning of the first crash record in the buffer. This is used by allocation to "wrap around" the buffer. The algorithm to allocate a new crash record, then, is: Add NVRAM-CRASH-BUFF-REC-LEN to NVRAM-CRASH-BUFF-POINTER. The result is the new NVRAM-CRASH-BUFF-POINTER. If the new NVRAM-CRASH- BUFF-POINTER is greater than NVRAM-CRASH-BUFF-LAST then it should be reset to NVRAM-CRASH-BUFF-BASE. The algorithm for finding the previous crash record is: Subtract NVRAM-CRASH-BUFF-REC-LEN from NVRAM-CRASH-BUFF-POINTER. If this is less than NVRAM-CRASH-BUFF-BASE, it should be set to NVRAM- CRASH-BUFF-LAST. 18-3 Crash Handling Software Design Notes Table 18-1 Crash Record Allocation Registers NVRAM-BASE plus (hex) Allocation Register ----------- ---------------------------------- 80 NVRAM-CRASH-BUFF-FORMAT-PROCESSOR 88 NVRAM-CRASH-BUFF-FORMAT-REV 90 NVRAM-CRASH-BUFF-POINTER 98 NVRAM-CRASH-BUFF-REC-LEN A0 NVRAM-CRASH-BUFF-LAST A8 NVRAM-CRASH-BUFF-BASE 18.2.2 Crash Record Contents. The crash record format is shown in Table 18-2. The templates for the crash table, for the rest of NVRAM, and for other information used by crash record support can be found in the file SYS:UCODE;LROY-QDEV. The list CRASH-RECORD-OFFSETS contains a list of symbolic names whose values are these offsets. 18-4 Software Design Notes Crash Handling Table 18-2 Crash Record Format NVRAM-CRASH-BUFF-POINTER Contents plus (hex) General information -------------------- ------------------------------ 0 Progress field. Indicates how far into the boot process the system progressed before halting. Load information -------------------- ------------------------------ 4 Disk controller slot number 8 Disk device number for microload C Disk device number for Load Band 10 Microload name (4) 20 Load Band name (4) 30 Microload version (2) 38 Load Band version (2) 40 Load Band revision (2) Date and time information -------------------- ------------------------------ 48 Month of boot 4C Day of boot 50 Year of boot 54 Hour of boot 58 Minute of boot 5C Month of crash 60 Day of crash 64 Year of crash 68 Hour of crash 6C Minute of crash Flags field -------------------- ------------------------------ 70 Report Flags Shutdown information -------------------- ------------------------------ 74 Halt Location (2) 7C Halt Kind Saved Registers -------------------- ------------------------------ 80 contents of M-1 (4) 90 contents of M-2 (4) A0 contents of MD (4) B0 contents of VMA (4) C0 contents of M-FEF (4) D0 contents of UPC-1 (2) 18-5 Crash Handling Software Design Notes D8 contents of UPC-2 (2) E0 Location counter (4) F0 contents of M-T (4) 100 Length of crash record The progress field indicates how far into the boot process the system progressed before crashing. Currently supported values for this field are listed in the variable CRASH-RECORD-PROGRESS- CODES. The load information fields are initialized during crash record allocation by the microcode. They reflect the load information saved by the boot process. The boot time is written to the crash record by a function on the WARM-INITIALIZATIONS-LIST after the system time base has been initialized. The crash time fields are updated about every five minutes from Lisp by the TM-Update process so that they will accurately reflect the time if the machine halts. The report flags are used to store information about which crash records have been logged to a crash-log file by the crash analyzer (see section on crash analyzer below). It is also contains a flag that is set if this boot was a warm boot so that information can be displayed in the crash record. Crashs that occur after a warm boot or warm boot attempt are often caused by problems in the warm booted environment, and hence are not as interesting as crashes that occur in a cold booted environment. The halt kind field indicates the type of the last shutdown. The field is initialized to the System Boot state by the crash record allocation microcode, then updated during ILLOP to reflect the crash kind. Possible values for halt kind are: * System Boot. The last shutdown was caused by a cold boot sequence or by a warm boot sequence during normal operation *that is, not a warm boot initiated after the machine crashed). Note that since the crash record allocation routine is called from warm-boot, the system can be warm booted after an abnormal shutdown and still retain all the crash record information from that shutdown. However, if the machine was in a hung state when warm or cold booted, ILLOP processing will not be done, and the shutdown will be reported in this category. META-CONTROL-META-CONTROL-C can be used to force a crash from a hard run, and therefore produces a crash record. 18-6 Software Design Notes Crash Handling * Microcode Halt. This indicates that the last crash was called from the microcode when it detected an unrecoverable error condition. In this case, ILLOP stores the micro-pc address from which ILLOP was called in the halt address field. This micro-pc is later looked up in the crash table database by the crash analyzer in order to provide the user with a text description of the microcode crash reason. See the section on the crash analyzer, below. * Hardware Halt. This halt kind is currently unsupported. All detectable hardware halt conditions (such as memroy parity errors, illegal page faults, and so forth) currently fall into the microcode halt category. * Lisp Halt. The last halt was called by Lisp through the si:%crash function. In this case, a Lisp crash code (which is one of the arguments to %CRASH) is stored in the halt address field. The second argument to %CRASH (an object) is stored in the M-1 field. Currently, the only valid Lisp crash code is 0 which indicates a normal system shutdown called from Lisp. This code is seen when the system was halted by a user-initiated call to either SHUTDOWN or SYSTEM-SHUTDOWN. The saved registers fields contain the values of the indicated processor registers when ILLOP was called. Their values can appear in the microcode crash descriptor text reported by the crash analyzer, and in any case their values are labeled and displayed in the crash analysis report. 18.2.3 The Crash Table. The crash analysis database consists of the table of crash codes and crash descriptions produced by the micro assembler. This table is kept in the file SYS:UBIT;.CRASH#nnn, where nnn is the microcode revision number and ucode-name is obtained ___ __________ by looking up the microcode name associated with the current value of MICROCODE-TYPE-CODE in the *MICROCODE-NAME-ALIST*. The crash analyzer loads the appropriate version of this file into memory when a microcode crash is being reported. The crash table file contains a complete list of all current crash descriptor textx. Note that there may be some microcode crashes that will not have a description in the crash database. This fact will be reported by the crash analyzer. The crash micro-pc for such crashes is valid, however, and can be used to determine the path taken to ILLOP. 18-7 Crash Handling Software Design Notes The crash table file is generated by the microassembler. At points in the microcode where ILLOP may be called a CRASH-TABLE pseudo-op in generated. From this a Crash Table Entry (CTE) is _________________ generated and added to the list of CTEs that make up the crash table file. The format of a CRASH-TABLE pseudo-op and the resulting CTE are as follows: Ucode CRASH-TABLE pseudo-op: ---------------------------- General form: (CRASH-TABLE ) Example: (CRASH-TABLE "Nasty data type ~a read from address ~x" (Q-DATA-TYPES (LDB %%Q-Data-Type MD)) VMA) Crash Table Entry (in .CRASH file): ----------------------------------- General form: ( ) Example: (34174 "Nasty data type ~a read from address ~x" (Q-DATA-TYPES (LDB %%Q-Data-Type MD)) VMA) The first element of the CTE is the microcode PC address from which the call to ILLOP was made. The second element is usually a string. In this case we essentially do the following to format the crash description. (APPLY #'FORMAT (CDR