Windows System Software -- Consulting, Training, Development -- Engineering Excellent, Every Time.

Win7 Crash Redux

In the last issue of The NT Insider we published a write-up describing analysis of a crash dump in filter manager.   A couple of readers commented about the analysis and had some solid points that needed to be considered.

This also underscores an important aspect of analysis, namely that it can be helpful to obtain a second opinion on one’s analysis, precisely because it is easy to miss some important point that aids in the analysis.

Now on to the specific issues raised by the readers:

  • The value of the RSI register is in fact not null (0000009d00000000)
  • There is a backwards jump in the instruction stream that impacts on the code flow analysis
  • There is some useful information related to determining the type of trap frame that in turn tells us more about which registers are valid.

x64 Trap Frame Observations

On the x64 platform, a kernel trap frame does not capture all register state:

  • For a processor exception or trap, the OS only captures the volatile registers (RAX, RCX, RDX, R8-R11 and XMM0-XMM5) and the RBP register.
  • For a system call entry, the OS only captures RBP, RSI and RDI.  No other registers are preserved.

By using the ExceptionActive field, we can determine which type of trap frame this is (a 0 or 1 value indicates this is an exception or trap and the volatile registers plus RBP are stored).

Further, the reader observed:

You can very often get reliable nonvolatile registers on x64 from “.frame /r (frame number)”. There is also the newly-documented (but long-present) “.frame /c (frame number)” command that sets your effective context to the values obtained from .frame /r. This works using the unwind metadata generated by the x64 compiler that the debugger stackwalker uses (it also works for Itanium, if you should be debugging that, but not x86). It should _always_ give you correct nonvolatile registers if you start from the context obtained by .cxr, .thread, or .exptr.

This was useful information in general, and hopefully will help our readers further hone their debugging skills into the future.

Code flow analysis

This reader had some valid observations here:

The first constructive point is that if your debugging takes you into the game of back-tracing, you need to study whole functions. This means unassembling not just at addresses before the faulting instruction, nor after, but at all places that fragments of the function have got scattered by optimisation. Basically, you need to raise your debugging to the foothills of reverse engineering. The reverse engineer will see that the faulting instruction, “mov rax,qword ptr [rsi+20h]” at …F141 is picked up for TreeUnlinkMulti by inlining TreeUnlinkMultiDoWalk, which in turn inlines TreeLookup, which in turn inlines TreeFindNodeOrParent. The loop that the analyst has missed is actually from the start of this last subroutine. The code’s overall intention is to walk a given tree, remove the nodes that match a given pair of keys, and return these nodes as a list (linked through the RightChild members only).

The reality is that with x64 it is proving to be far more often the case that we need to back track through the code flow in order to find local variables and reconstruct the stack.  When doing a thorough analysis, it is indeed important to look at the entire function (the uf function is good for this) but this is a bit more time-consuming (and certainly more daunting to those just approaching kernel debugging).

But these are valid points.

So with this said, let’s go back and revisit our analysis. The context record shows us:

And of course the registers are valid here (they are all captured in the context record,) as our reader noted.

This gives us the following stack:

Then let’s look at the invalid address:

This decodes the address, finds the relevant page table entries and decodes each of them.  From this, we can tell there is nothing within this 512GB memory region (since each PXE entry corresponds to a 512GB region of the address space).

Thus, while not the null pointer indicated previously, this is still an invalid address – within a large, undefined region of the address space.

As you so choose, you can look at the first function from the stack in its entirety, we see:

The current block starts at:

The mistake in the earlier analysis was to miss the jump backwards several instructions afterwards:

Thus, we really do have a small block of code under analysis, as shown below:

The reader that pointed out the loop here also pointed out the intent of this code fragment:

The code’s overall intention is to walk a given tree, remove the nodes that match a given pair of keys, and return these nodes as a list (linked through the RightChild members only).

The analyst has identified the given tree and has in Figure 7 dumped for us the root node, as the TreeLink member of a _NAME_CACHE_NODE.  See there that the LeftChild member is corrupt but not with the value of RSI at the time of the fault. Execution will have worked some distance into the RightChild subtree until reaching a node that has the faulting RSI as either its LeftChild or RightChild member. Most plausibly, this tree was already corrupt when TreeUnlinkMulti was entered. A race condition, whether inside TreeUnlinkMulti or out, is just one of many ways that links in a tree might get corrupted.

Thus, at this point we’re pretty much at a similar conclusion of the analysis: we have a data corruption; it doesn’t seem likely the corruption occurred here but it is clear there is a data corruption.

As noted previously, we’ve seen similar data corruption – on a different computer system, but on Windows 7 x64.  In the first dump, we observed what appears to be a single bit error in memory.  By itself it led us to suspect the machine.  Seeing this on a different computer in similar circumstances makes us suspect there is some source of data corruption in the code.  While a race condition is a potential data corruption source, it’s not the onlypossibility.

Data corruption issues are often the most difficult to track down.  Frequently the source of the corruption shows up from a pattern that materializes after reviewing a number of crash dumps, not a single crash dump.  While we still do not know the actual issue here, we’ll be on the look-out for it in the future and invite our readers to share their own observations if they see it as well.

Article Name
Win7 Crash Redux
Based on feedback from the community, we revisit a Windows 7 crash dump analysis.