Windows System Software -- Consulting, Training, Development -- Engineering Excellent, Every Time.

Debugging Race Conditions

One of the situations in which post-mortem debugging is useful is for identifying race conditions – timing windows, in which the state of the machine changes between two points in the code execution, so that decisions made later in the code are incorrect because the state of the system has changed in some fundamental way.

In today’s case, we have what looks like a potential candidate for this, although we can’t really prove it – but then again, in my experience we never really do “prove” race conditions – we detect them, theorize them, change the code to eliminate them and hope they never come back.

Then again, I’ve seen a printf in code make the bug go away as well, so the frustrating part of these cases is that the solution is never really definitive.

In today’s case, we have a fairly mundane Windows 7 box.  I recently added 4GB of memory to this Windows 7 system, and shortly thereafter it crashed (bug check 0x3B.)  Naturally, having this happen right after a memory upgrade did make me slightly suspicious so I took the opportunity to explore this particular crash in a bit more detail.

One nice change (since Windows Vista) has been that the machines are configured by default to create kernel summary dumps, so that there is more that we can do than simply a superficial analysis of the problem.

Naturally, I started with the usual !analyze –v and looked at its results:

2: kd> !analyze -v
*******************************************************************************
*                                                                             *
*                        Bugcheck Analysis                                    *
*                                                                             *
*******************************************************************************

SYSTEM_SERVICE_EXCEPTION (3b)
An exception happened while executing a system service routine.
Arguments:
Arg1: 00000000c0000005, Exception code that caused the bugcheck
Arg2: fffff8800106f141, Address of the exception record for the exception that caused the bugcheck
Arg3: fffff88005f45960, Address of the context record for the exception that caused the bugcheck
Arg4: 0000000000000000, zero.

Debugging Details:
------------------

PEB is paged out (Peb.Ldr = 00000000`7efdf018).  Type ".hh dbgerr001" for details
PEB is paged out (Peb.Ldr = 00000000`7efdf018).  Type ".hh dbgerr001" for details

EXCEPTION_CODE: (NTSTATUS) 0xc0000005 - The instruction at 0x%08lx referenced memory at 0x%08lx.
 The memory could not be %s.

FAULTING_IP: 
fltmgr!TreeUnlinkMulti+51
fffff880`0106f141 488b4620        mov     rax,qword ptr [rsi+20h]

CONTEXT:  fffff88005f45960 -- (.cxr 0xfffff88005f45960)
rax=fffffaf7072f96b0 rbx=0000000000000000 rcx=fffffa8008e13318
rdx=fffffa8007e5b550 rsi=0000009d00000000 rdi=0000000000000000
rip=fffff8800106f141 rsp=fffff88005f46330 rbp=fffffa8008e13318
 r8=ffffffffffffffff  r9=ffffffffffffffff r10=fffffffffffffe4a
r11=0000000000000001 r12=fffffa8007e5b550 r13=fffffa8007e34684
r14=0000000000004000 r15=0000000000000000
iopl=0         nv up ei pl nz na po nc
cs=0010  ss=0018  ds=002b  es=002b  fs=0053  gs=002b             efl=00010206
fltmgr!TreeUnlinkMulti+0x51:
fffff880`0106f141 488b4620        mov     rax,qword ptr [rsi+20h] ds:002b:0000009d`00000020=????????????????
Resetting default scope

DEFAULT_BUCKET_ID:  VISTA_DRIVER_FAULT

BUGCHECK_STR:  0x3B

PROCESS_NAME:  CarboniteServi

CURRENT_IRQL:  0

LAST_CONTROL_TRANSFER:  from fffff8800106c460 to fffff8800106f141

STACK_TEXT:  
fffff880`05f46330 fffff880`0106c460 : fffffa80`07a96920 fffffa80`07e5b550 fffffa80`07a96920
 00000000`00000000 : fltmgr!TreeUnlinkMulti+0x51
fffff880`05f46380 fffff880`0106cbe9 : fffff880`05f48000 00000000`00000002 00000000`00000000 
00000000`00000000 : fltmgr!FltpPerformPreCallbacks+0x730
fffff880`05f46480 fffff880`0106b6c7 : fffffa80`08b93c10 fffffa80`07ca8de0 fffffa80`07b402c0 
00000000`00000000 : fltmgr!FltpPassThrough+0x2d9
fffff880`05f46500 fffff800`02da278e : fffffa80`07e5b550 fffffa80`07dfa8e0 fffffa80`07e5b550 
fffffa80`07ca8de0 : fltmgr!FltpDispatch+0xb7
fffff880`05f46560 fffff800`02a918b4 : fffffa80`07e34010 fffff800`02d8f260 fffffa80`06d17c90 
00000000`ff060001 : nt!IopDeleteFile+0x11e
fffff880`05f465f0 fffff800`02d900e6 : fffff800`02d8f260 00000000`00000000 fffff880`05f469e0 
fffffa80`08b93c10 : nt!ObfDereferenceObject+0xd4
fffff880`05f46650 fffff800`02d85e84 : fffffa80`07c3fcd0 00000000`00000000 fffffa80`07a17b10 
fffffa80`0a31e701 : nt!IopParseDevice+0xe86
fffff880`05f467e0 fffff800`02d8ae4d : fffffa80`07a17b10 fffff880`05f46940 0067006e`00000040 
fffffa80`06d17c90 : nt!ObpLookupObjectName+0x585
fffff880`05f468e0 fffff800`02d1ee3c : fffffa80`08cf07e0 00000000`00000007 fffffa80`00001f01 
00001f80`00f40200 : nt!ObOpenObjectByName+0x1cd
fffff880`05f46990 fffff800`02a8b993 : fffffa80`0a31e7e0 00000000`00000000 fffffa80`0a31e7e0 
00000000`7ef95000 : nt!NtQueryFullAttributesFile+0x14f
fffff880`05f46c20 00000000`77320eba : 00000000`00000000 00000000`00000000 00000000`00000000 
00000000`00000000 : nt!KiSystemServiceCopyEnd+0x13
00000000`0121e778 00000000`00000000 : 00000000`00000000 00000000`00000000 00000000`00000000 
00000000`00000000 : 0x77320eba


FOLLOWUP_IP: 
fltmgr!TreeUnlinkMulti+51
fffff880`0106f141 488b4620        mov     rax,qword ptr [rsi+20h]

SYMBOL_STACK_INDEX:  0

SYMBOL_NAME:  fltmgr!TreeUnlinkMulti+51

FOLLOWUP_NAME:  MachineOwner

MODULE_NAME: fltmgr

IMAGE_NAME:  fltmgr.sys

DEBUG_FLR_IMAGE_TIMESTAMP:  4a5bc11f

STACK_COMMAND:  .cxr 0xfffff88005f45960 ; kb

FAILURE_BUCKET_ID:  X64_0x3B_fltmgr!TreeUnlinkMulti+51

BUCKET_ID:  X64_0x3B_fltmgr!TreeUnlinkMulti+51

Followup: MachineOwner
---------

Oddly, the first thing that jumped out at me is that I had seen a similar crash recently (we use crash dump examples in our kernel debugging seminar).  This reminded me that it is not uncommon to see “patterns” like this as you look at crash dumps (particularly random crash dumps like this one).  It isn’t exactly the same, however, but it is intriguingly similar.  See the stack trace from that crash:

3: kd> k
Child-SP          RetAddr           Call Site
fffff880`06678208 fffff800`02afcb19 nt!KeBugCheckEx
fffff880`06678210 fffff800`02a7cfee nt! ?? ::FNODOBFM::`string'+0x40edb
fffff880`06678370 fffff800`02a624ae nt!KiPageFault+0x16e
fffff880`06678500 fffff880`010f6373 nt!RtlDeleteNoSplay+0x2a
fffff880`06678530 fffff880`010f2238 fltmgr!TreeUnlinkNoBalance+0x13
fffff880`06678560 fffff880`0111046f fltmgr!TreeUnlinkMulti+0x148
fffff880`066785b0 fffff880`01110dfe fltmgr!DeleteNameCacheNodes+0x9f
fffff880`066785f0 fffff880`011202af fltmgr!PurgeStreamNameCache+0x8e
fffff880`06678630 fffff880`01117a30 fltmgr!FltpPurgeVolumeNameCache+0x7f
fffff880`06678670 fffff880`01110d4b fltmgr! ?? ::NNGAKEGL::`string'+0x1a04
fffff880`066786b0 fffff880`010f306a fltmgr!FltpReinstateNameCachingAllFrames+0x4b
fffff880`066786e0 fffff800`02a81516 fltmgr!FltpPassThroughCompletion+0x8a
fffff880`06678720 fffff880`01250f30 nt!IopfCompleteRequest+0x3a6
fffff880`06678800 fffff880`012dadfa Ntfs!NtfsExtendedCompleteRequestInternal+0x110
fffff880`06678840 fffff880`01249e0c Ntfs!NtfsCommonSetInformation+0xef1
fffff880`06678920 fffff880`010f023f Ntfs!NtfsFsdSetInformation+0x11c
fffff880`066789a0 fffff880`010ee6df fltmgr!FltpLegacyProcessingAfterPreCallbacksCompleted+0x24f
fffff880`06678a30 fffff800`02d5f49d fltmgr!FltpDispatch+0xcf
fffff880`06678a90 fffff800`02a7e153 nt!NtSetInformationFile+0x909
fffff880`06678bb0 00000000`775f012a nt!KiSystemServiceCopyEnd+0x13

While not identical, what stuck out in my head is that they are both (essentially) in the same piece of logic, suggesting to me that there might be a data corruption or race condition issue going on here (although our original theory for the crash that we use in the seminar is that it is actually a single bit error).

Patterns of this type can (and have) brought insight in the past.  So let’s see where this analysis takes me, and whether or not it matches anything from that previous analysis.  One good point here is that I have some experience in walking the relevant data structures.

One other note about this particular machine: it’s not used for development; it’s used for video editing.  As such, it isn’t running anything other than “stock” software that one might find on a typical production computer system.

In this particular case we seem to have a garden variety NULL pointer dereference:

fffff880`0106f141 488b4620        mov     rax,qword ptr [rsi+20h] ds:002b:0000009d`00000020=????????????????

Thus, RSI is null in this case and the resulting user mode address is not valid.  As typical in a case like this, we’ll walk backwards in the code stream to try and figure out where this value originated, since it is often the case we can glean some understanding of what this code is trying to do.  See below for the code leading up to this point (via the u command):

fltmgr!TreeUnlinkMulti:
fffff880`0106f0f0 fff3            push    rbx
fffff880`0106f0f2 55              push    rbp
fffff880`0106f0f3 57              push    rdi
fffff880`0106f0f4 4883ec30        sub     rsp,30h
fffff880`0106f0f8 33ff            xor     edi,edi
fffff880`0106f0fa 488be9          mov     rbp,rcx
fffff880`0106f0fd 4883faff        cmp     rdx,0FFFFFFFFFFFFFFFFh
fffff880`0106f101 0f840c010000    je      fltmgr!TreeUnlinkMulti+0x123 (fffff880`0106f213)
fffff880`0106f107 4c89642458      mov     qword ptr [rsp+58h],r12
fffff880`0106f10c 4c8be2          mov     r12,rdx
fffff880`0106f10f 4983f8ff        cmp     r8,0FFFFFFFFFFFFFFFFh
fffff880`0106f113 0f85eb450000    jne     fltmgr! ?? ::FNODOBFM::`string'+0x504 (fffff880`01073704)
fffff880`0106f119 4889742450      mov     qword ptr [rsp+50h],rsi
fffff880`0106f11e 488b31          mov     rsi,qword ptr [rcx]
fffff880`0106f121 4885f6          test    rsi,rsi
fffff880`0106f124 7518            jne     fltmgr!TreeUnlinkMulti+0x4e (fffff880`0106f13e)
fffff880`0106f126 488bdf          mov     rbx,rdi
fffff880`0106f129 488b742450      mov     rsi,qword ptr [rsp+50h]
fffff880`0106f12e 488bc3          mov     rax,rbx
fffff880`0106f131 4c8b642458      mov     r12,qword ptr [rsp+58h]
fffff880`0106f136 4883c430        add     rsp,30h
fffff880`0106f13a 5f              pop     rdi
fffff880`0106f13b 5d              pop     rbp
fffff880`0106f13c 5b              pop     rbx
fffff880`0106f13d c3              ret
fffff880`0106f13e 488bdf          mov     rbx,rdi

Given that there is a ret instruction two lines earlier, we scan backwards for a jump or branch to that location.  This type of code sequence is typical of an if/else style statement (where there is a return in one of the two code blocks):

fffff880`0106f124 7518            jne     fltmgr!TreeUnlinkMulti+0x4e (fffff880`0106f13e)

At this point it certainly looks like there might be a logic bug here:  the test instruction does a bitwise AND of the RSI register with itself:

fffff880`0106f121 4885f6          test    rsi,rsi

…and then conditionally jumps based upon whether or not that value is zero:

fffff880`0106f124 7518            jne     fltmgr!TreeUnlinkMulti+0x4e (fffff880`0106f13e)

Thus, this tests the value in RSI to see if it is zero.  If it is, we jump down and use it, a condition that certainly does not seem to make much sense here.  When I saw this it made me question if I was interpreting the instruction sequence correctly.  Of course, a sanity check would be, “Well, if it weren’t zero, having it change to zero within the CPU would suggest a hardware problem of some sort.”

The encouraging insight here is that this certainly doesn’t look like it is related to the new memory added to the system.

Thus, the next logical step was to look and see, “So, where did the value in RSI originate?”  That’s also present in this instruction stream:

fffff880`0106f11e 488b31          mov     rsi,qword ptr [rcx]

So if we look at the value in RCX:

2: kd> dq @rcx l1
fffffa80`08e13318  fffff8a0`0d106d58

Lt is peculiar – this is not a zero value.  This could be due to the fact that not all registers are saved, although the fact we see RCX is non-zero suggests that it is saved (values that are not saved on the x64 platform show as zero in the trap frame and context record).

Thus, this suggests a few possibilities:

  1. The value captured here is not correct;
  2. The value loaded in the RSI register is incorrect (ergo, a CPU problem of some sort);
  3. The value to which RCX points has changed since we captured it in RSI.

The hardware error theory seems rather unlikely, so let’s focus on the other two for a bit.

If we look at the data type of RCX we can see some useful information:

2: kd> !pool @rcx
Pool page fffffa8008e13318 region is Nonpaged pool
 fffffa8008e13000 size:  150 previous size:    0  (Allocated)  File (Protected)
 fffffa8008e13150 size:   30 previous size:  150  (Allocated)  Io  
 fffffa8008e13180 size:   50 previous size:   30  (Allocated)  VadS
 fffffa8008e131d0 size:   80 previous size:   50  (Allocated)  MmSd
 fffffa8008e13250 size:   20 previous size:   80  (Free)       Io  
*fffffa8008e13270 size:   c0 previous size:   20  (Allocated) *FMsl
		Pooltag FMsl : STREAM_LIST_CTRL structure, Binary : fltmgr.sys
 fffffa8008e13330 size:  150 previous size:   c0  (Allocated)  File (Protected)
 fffffa8008e13480 size:   e0 previous size:  150  (Allocated)  NV  
 fffffa8008e13560 size:  160 previous size:   e0  (Allocated)  Ntfx
 fffffa8008e136c0 size:  160 previous size:  160  (Allocated)  Ntfx
 fffffa8008e13820 size:   20 previous size:  160  (Free)       FIPc
 fffffa8008e13840 size:   50 previous size:   20  (Allocated)  VadS
 fffffa8008e13890 size:  160 previous size:   50  (Allocated)  Ntfx
 fffffa8008e139f0 size:   c0 previous size:  160  (Allocated)  FMsl
 fffffa8008e13ab0 size:  160 previous size:   c0  (Allocated)  Ntfx
 fffffa8008e13c10 size:   c0 previous size:  160  (Allocated)  FMsl
 fffffa8008e13cd0 size:   10 previous size:   c0  (Free)       FIPc
 fffffa8008e13ce0 size:   60 previous size:   10  (Allocated)  Io  
 fffffa8008e13d40 size:  160 previous size:   60  (Allocated)  Ntfx
 fffffa8008e13ea0 size:  160 previous size:  160  (Allocated)  Ntfx

So this block of pool seems to include this structure.  On x64 systems the pool header is 16 bytes long:

2: kd> dt _POOL_HEADER
nt!_POOL_HEADER
   +0x000 PreviousSize     : Pos 0, 8 Bits
   +0x000 PoolIndex        : Pos 8, 8 Bits
   +0x000 BlockSize        : Pos 16, 8 Bits
   +0x000 PoolType         : Pos 24, 8 Bits
   +0x000 Ulong1           : Uint4B
   +0x004 PoolTag          : Uint4B
   +0x008 ProcessBilled    : Ptr64 _EPROCESS
   +0x008 AllocatorBackTraceIndex : Uint2B
   +0x00a PoolTagHash      : Uint2B

To properly compute the start of the structure we need to take the start address of the pool block and add 0x10 to it (to account for the pool header):

2: kd> dt fltmgr!_STREAM_LIST_CTRL fffffa8008e13270+10
   +0x000 Type             : _FLT_TYPE
   +0x008 ContextCtrl      : _FSRTL_PER_STREAM_CONTEXT
   +0x030 VolumeLink       : _LIST_ENTRY [ 0xfffffa80`07043c60 - 0xfffffa80`08e03390 ]
   +0x040 Flags            : 0x211 (No matching name)
   +0x044 UseCount         : 4
   +0x048 ContextLock      : _EX_PUSH_LOCK
   +0x050 StreamContexts   : _CONTEXT_LIST_CTRL
   +0x058 StreamHandleContexts : _CONTEXT_LIST_CTRL
   +0x060 NameCacheLock    : _EX_PUSH_LOCK
   +0x068 LastRenameCompleted : _LARGE_INTEGER 0x0
   +0x070 NormalizedNameCache : _NAME_CACHE_LIST_CTRL
   +0x080 ShortNameCache   : _NAME_CACHE_LIST_CTRL
   +0x090 OpenedNameCache  : _NAME_CACHE_LIST_CTRL
   +0x0a0 AllNameContextsTemporary : 0

And the value in RCX is an offset into this structure:

2: kd> ? @rcx-fffffa8008e13280
Evaluate expression: 152 = 00000000`00000098

Thus, this corresponds to something in the OpenedNameCache field.  By dumping the _NAME_ CACHE_LIST_CTRL, we can see what exactly that is:

2: kd> dt _NAME_CACHE_LIST_CTRL
fltmgr!_NAME_CACHE_LIST_CTRL
   +0x000 NameFormat       : Uint4B
   +0x008 List             : _TREE_ROOT

This in turn makes a bit of sense given the name of the function in which the fault occurs (ergo, we are manipulating a tree structure of some sort).

In expanding the _TREE_ROOT structure we see:

2: kd> dt fltmgr!_TREE_ROOT
   +0x000 Tree             : Ptr64 _RTL_SPLAY_LINKS

Thus, I suspect that this is embedded in some larger data structure (and that structure is in turn linked into this tree).  So let’s see what displaying the contents of the tree root tells us:

2: kd> dt _NAME_CACHE_LIST_CTRL @rcx-8 /b
fltmgr!_NAME_CACHE_LIST_CTRL
   +0x000 NameFormat       : 2
   +0x008 List             : _TREE_ROOT
      +0x000 Tree             : 0xfffff8a0`0d106d58

We can then use this address (which, oddly enough is not zero, making this dump a bit unusual – after all, we just loaded this value into the RSI register and it was zero) to inspect this structure:

2: kd> !pool 0xfffff8a0`0d106d58 
Pool page fffff8a00d106d58 region is Paged pool
 fffff8a00d106000 size:  130 previous size:    0  (Allocated)  Ntfo
 fffff8a00d106130 size:  140 previous size:  130  (Allocated)  MPsc
 fffff8a00d106270 size:   10 previous size:  140  (Free)       .tFs
 fffff8a00d106280 size:   40 previous size:   10  (Allocated)  NtFs
 fffff8a00d1062c0 size:  110 previous size:   40  (Allocated)  ;oNm
 fffff8a00d1063d0 size:  190 previous size:  110  (Allocated)  .Mfn
 fffff8a00d106560 size:  4d0 previous size:  190  (Allocated)  .tff
 fffff8a00d106a30 size:   40 previous size:  4d0  (Allocated)  NtFs
 fffff8a00d106a70 size:  180 previous size:   40  (Allocated)  {Mfn
 fffff8a00d106bf0 size:   c0 previous size:  180  (Allocated)  oIcs
 fffff8a00d106cb0 size:   40 previous size:   c0  (Allocated)  MmSm
 fffff8a00d106cf0 size:   40 previous size:   40  (Allocated)  .tFs
*fffff8a00d106d30 size:  190 previous size:   40  (Allocated) *FMfn
		Pooltag FMfn : NAME_CACHE_NODE structure, Binary : fltmgr.sys
 fffff8a00d106ec0 size:  140 previous size:  190  (Allocated)  APsc

Now let’s take this containing structure and dump it:

2: kd> dt fltmgr!_NAME_CACHE_NODE fffff8a00d106d40
   +0x000 Type             : _FLT_TYPE
   +0x008 ProvidingInstance : 0x00000081`00000000 _FLT_INSTANCE
   +0x010 CreationTime     : _LARGE_INTEGER 0x3f`00067e77
   +0x018 TreeLink         : _TREE_NODE
   +0x050 NameInfo         : _FLT_FILE_NAME_INFORMATION
   +0x0c8 UseCount         : 1

To be honest, this structure looks a bit suspicious to me – the “creation time” doesn’t seem plausible, and that _FLT_INSTANCE address definitely does not look valid.  So I decided to poke at the structure in a bit more detail:

2: kd> dt fltmgr!_NAME_CACHE_NODE fffff8a00d106d40 /b
   +0x000 Type             : _FLT_TYPE
      +0x000 Signature        : 0xf204
      +0x002 Size             : 0x176
   +0x008 ProvidingInstance : 0x00000081`00000000 
   +0x010 CreationTime     : _LARGE_INTEGER 0x3f`00067e77
      +0x000 LowPart          : 0x67e77
      +0x004 HighPart         : 63
      +0x000 u                : 
         +0x000 LowPart          : 0x67e77
         +0x004 HighPart         : 63
      +0x000 QuadPart         : 270583365239
   +0x018 TreeLink         : _TREE_NODE
      +0x000 Link             : _RTL_SPLAY_LINKS
         +0x000 Parent           : 0xfffff889`0d106d58 
         +0x008 LeftChild        : 0x0000001d`00000000 
         +0x010 RightChild       : 0xfffff835`0d17eb28 
      +0x018 TreeRoot         : 0xfffffaff`08e13318 
      +0x020 Key1             : 0xfffffaf7`072f96b0 
      +0x028 Key2             : (null) 
      +0x030 Flags            : 0x14000
   +0x050 NameInfo         : _FLT_FILE_NAME_INFORMATION
      +0x000 Size             : 0x78
      +0x002 NamesParsed      : 0
      +0x004 Format           : 2
      +0x008 Name             : _UNICODE_STRING "\Device\HarddiskVolume2\UsIrs\???"
         +0x000 Length           : 0xa4
         +0x002 MaximumLength    : 0xa6
         +0x008 Buffer           : 0xfffff8a0`0d106e10  "\Device\HarddiskVolume2\UsIrs\???"
      +0x018 Volume           : _UNICODE_STRING "\Device\HarddiskVolume2"
         +0x000 Length           : 0x2e
         +0x002 MaximumLength    : 0x2e
         +0x008 Buffer           : 0xfffff8a0`0d106e10  "\Device\HarddiskVolume2"
      +0x028 Share            : _UNICODE_STRING "--- memory read error at address 0x000000e8`00000000 ---"
         +0x000 Length           : 0
         +0x002 MaximumLength    : 0
         +0x008 Buffer           : 0x000000e8`00000000  "--- memory read error at address 0x000000e8`00000000 ---"
      +0x038 Extension        : _UNICODE_STRING "--- memory read error at address 0x000000f3`00000000 ---"
         +0x000 Length           : 0
         +0x002 MaximumLength    : 0
         +0x008 Buffer           : 0x000000f3`00000000  "--- memory read error at address 0x000000f3`00000000 ---"
      +0x048 Stream           : _UNICODE_STRING "--- memory read error at address 0x00000041`00000000 ---"
         +0x000 Length           : 0
         +0x002 MaximumLength    : 0
         +0x008 Buffer           : 0x00000041`00000000  "--- memory read error at address 0x00000041`00000000 ---"
      +0x058 FinalComponent   : _UNICODE_STRING "--- memory read error at address 0x000000d9`00000000 ---"
         +0x000 Length           : 0
         +0x002 MaximumLength    : 0
         +0x008 Buffer           : 0x000000d9`00000000  "--- memory read error at address 0x000000d9`00000000 ---"
      +0x068 ParentDir        : _UNICODE_STRING ""
         +0x000 Length           : 0
         +0x002 MaximumLength    : 0
         +0x008 Buffer           : (null) 
   +0x0c8 UseCount         : 1

This looks somewhat valid to me, actually.  The name for the volume looks to be properly set up, although several of the name components seem to be suspect – almost as if this structure is being initialized or torn down.

At this point I put together my working hypothesis: that there is a race condition present in this code somewhere, most likely in the initialization code.  That still doesn’t explain what seems to be a logic issue here.  I’m still left with more questions than answers – but having two crashes in the same general area, on two different machines, with two radically different usage profiles does suggest there is something interesting going on here.

Why do I propose a race condition here?  Because I see information in the crash that is inconsistent – the contents of a register are NULL, but the memory location from which it was loaded indicates it should be non-zero.  I cannot tell exactly when that data structure was added to this structure, but the choice seems to be “this is a CPU bug” or “this is a logic bug in this driver.”  Of the two, in my experience the latter is far more likely than the former.

Hopefully, we’ll see more of these crashes so that we can find a pattern as to what is happening and broaden our analysis.   It is also distinctly possible if we’re seeing this issue in “the real world” the filter manager team in Redmond has seen many more cases of this and have resolved the issue.   We’ll be watching for more of these – perhaps you have a crash like this one you’d like to share with us.

Summary
Article Name
Debugging Race Conditions
Description
Windows crash analysis is a learned skill, and learning how to debug race conditions is one of those skills that will serve you well down the road.
Author