1
2[NMI watchdog is available for x86 and x86-64 architectures]
3
4Is your system locking up unpredictably? No keyboard activity, just
5a frustrating complete hard lockup? Do you want to help us debugging
6such lockups? If all yes then this document is definitely for you.
7
8On many x86/x86-64 type hardware there is a feature that enables
9us to generate 'watchdog NMI interrupts'.  (NMI: Non Maskable Interrupt
10which get executed even if the system is otherwise locked up hard).
11This can be used to debug hard kernel lockups.  By executing periodic
12NMI interrupts, the kernel can monitor whether any CPU has locked up,
13and print out debugging messages if so.
14
15In order to use the NMI watchdoc, you need to have APIC support in your
16kernel. For SMP kernels, APIC support gets compiled in automatically. For
17UP, enable either CONFIG_X86_UP_APIC (Processor type and features -> Local
18APIC support on uniprocessors) or CONFIG_X86_UP_IOAPIC (Processor type and
19features -> IO-APIC support on uniprocessors) in your kernel config.
20CONFIG_X86_UP_APIC is for uniprocessor machines without an IO-APIC.
21CONFIG_X86_UP_IOAPIC is for uniprocessor with an IO-APIC. [Note: certain
22kernel debugging options, such as Kernel Stack Meter or Kernel Tracer,
23may implicitly disable the NMI watchdog.]
24
25For x86-64, the needed APIC is always compiled in, and the NMI watchdog is
26always enabled with I/O-APIC mode (nmi_watchdog=1). Currently, local APIC
27mode (nmi_watchdog=2) does not work on x86-64.
28
29Using local APIC (nmi_watchdog=2) needs the first performance register, so
30you can't use it for other purposes (such as high precision performance
31profiling.) However, at least oprofile and the perfctr driver disable the
32local APIC NMI watchdog automatically.
33
34To actually enable the NMI watchdog, use the 'nmi_watchdog=N' boot
35parameter.  Eg. the relevant lilo.conf entry:
36
37        append="nmi_watchdog=1"
38
39For SMP machines and UP machines with an IO-APIC use nmi_watchdog=1.
40For UP machines without an IO-APIC use nmi_watchdog=2, this only works
41for some processor types.  If in doubt, boot with nmi_watchdog=1 and
42check the NMI count in /proc/interrupts; if the count is zero then
43reboot with nmi_watchdog=2 and check the NMI count.  If it is still
44zero then log a problem, you probably have a processor that needs to be
45added to the nmi code.
46
47A 'lockup' is the following scenario: if any CPU in the system does not
48execute the period local timer interrupt for more than 5 seconds, then
49the NMI handler generates an oops and kills the process. This
50'controlled crash' (and the resulting kernel messages) can be used to
51debug the lockup. Thus whenever the lockup happens, wait 5 seconds and
52the oops will show up automatically. If the kernel produces no messages
53then the system has crashed so hard (eg. hardware-wise) that either it
54cannot even accept NMI interrupts, or the crash has made the kernel
55unable to print messages.
56
57NOTE: starting with 2.4.2-ac18 the NMI-oopser is disabled by default,
58you have to enable it with a boot time parameter.  Prior to 2.4.2-ac18
59the NMI-oopser is enabled unconditionally on x86 SMP boxes.
60
61[ feel free to send bug reports, suggestions and patches to
62  Ingo Molnar <mingo@redhat.com> or the Linux SMP mailing
63  list at <linux-smp@vger.kernel.org> ]
64
65