Windows System Software -- Consulting, Training, Development -- Engineering Excellent, Every Time.

Unexpected Case of Bugcheck IRQL_UNEXPECTED_VALUE (C8)

Unexpected Case of Bugcheck IRQL_UNEXPECTED_VALUE (C8)

Yet another interesting case lands on our doorstep thanks to NTDEV (original post here).

I firmly believe that you have zero chance in diagnosing a non-trivial crash if you don’t understand the bugcheck code. The bugcheck code is, in fact, THE definitive reason for the crash. Of course, just understanding the bugcheck code itself is hardly ever sufficient to diagnose the problem, but it’s a fundamental to how you approach the particular crash.

The OP’s crash was fun for me because I’d never seen the bugcheck code before and I’m always happy to meet a type of system crash (especially if I wasn’t the one that caused it):

IRQL_UNEXPECTED_VALUE (c8)
The processor's IRQL is not what it should be at this time. This is
usually caused by a lower level routine changing IRQL for some period
and not restoring IRQL at the end of that period (eg acquires spinlock
but doesn't release it).
 if UniqueValue is 0 or 1
     2 = APC->KernelRoutine
     3 = APC
     4 = APC->NormalRoutine
Arguments:
Arg1: 0000000000000000, (Current IRQL < < 16) | (Expected IRQL << 8) | UniqueValue
Arg2: 0000000000000002
Arg3: 0000000000000000
Arg4: 0000000000000000

Any time I'm presented with a crash, I try to dwell on the crash code and its arguments for a while before looking at anything else. In this case I was on board and feeling pretty good about the crash as I read the description. It seems reasonable to have a crash that results in someone changing the IRQL without ever restoring it.

Once I got to the arguments though I went right off a cliff. As I followed the description and decoded the arguments I learned that:

  1. The Current IRQL is 0
  2. The Expected IRQL is 0
  3. UniqueValue is 0, so:
    1. Arg2 is an APC's Kernel Routine and it's 2
    2. Arg3 is an APC and it's NULL
    3. Arg4 is an APC's Normal Routine and it's 0

Presumably this crash only happens if the "Current IRQL" doesn't match the "Expected IRQL", but what I just decoded doesn't support that. The other arguments don't make sense to me either because I'd expect some other kind of crash if someone queued an NULL APC or an APC with a Kernel Routine set to 2.

This got me curious as to what the crash code actually meant, so I broke out WinDbg and started poking. The call stack indicated that ndis!ndisExpandStack called some function exported by NT, which then ended up in some optimized code area and crashed the machine:

nt!KeBugCheckEx
nt! ?? ::FNODOBFM::`string'+0x18d14
ndis!ndisExpandStack+0x19

Based on the name of the NDIS function I had a guess as to what function this was, but to confirm I disassembled ndis!ndisExpandStack:

uf ndis!ndisexpandstack
    sub     rsp,38h
    and     qword ptr [rsp+20h],0
    xor     r9d,r9d
    mov     r8d,4CCCh
    call    qword ptr [ndis!_imp_KeExpandKernelStackAndCalloutEx]
    add     rsp,38h
    ret

The theory at this point is that KeExpandKernelStackAndCalloutEx is the one generating the bugcheck code.

Looking at that function I see the source of the 0xC8 bugcheck and the mystery of the arguments is solved:

 mov rsi, cr8             ; Get the current IRQL
 ...
 call qword ptr [rsp+60h] ; Call the stack expand callback
 ...
 mov rax, cr8             ; Get the current IRQL
 cmp al, sil              ; Are they the same?
 jz short loc_14007040C   ; If yes just return, otherwise crash with C8
 movzx r8d, al            ; BugcheckArg1 = Current IRQL
 movzx edx, sil           ; BugcheckArg2 = Previous IRQL
 ...
 mov ecx, 0C8h            ; IRQL_UNEXPECTED_VALUE
 call KeBugCheckEx        ; Grand closing...

The bugcheck makes much more sense now. Someone’s stack expansion callback was called at DISPATCH_LEVEL (Arg2 == 2) and returned at PASSIVE_LEVEL (Arg1 == 0). That’s against the rules, thus you get a system crash.

Personally I would call this a bug in KeExpandKernelStackAndCalloutEx seeing as how it is generating an IRQL_UNEXPECTED_VALUE using invalid (unexpected?) arguments. At a minimum the documentation is currently wrong though and I have filed a bug to try to get that addressed.