Windows System Software -- Consulting, Training, Development -- Engineering Excellent, Every Time.

Debugging Race Conditions

One of the situations in which post-mortem debugging is useful is for identifying race conditions – timing windows, in which the state of the machine changes between two points in the code execution, so that decisions made later in the code are incorrect because the state of the system has changed in some fundamental way.

In today’s case, we have what looks like a potential candidate for this, although we can’t really prove it – but then again, in my experience we never really do “prove” race conditions – we detect them, theorize them, change the code to eliminate them and hope they never come back.

Then again, I’ve seen a printf in code make the bug go away as well, so the frustrating part of these cases is that the solution is never really definitive.

In today’s case, we have a fairly mundane Windows 7 box.  I recently added 4GB of memory to this Windows 7 system, and shortly thereafter it crashed (bug check 0x3B.)  Naturally, having this happen right after a memory upgrade did make me slightly suspicious so I took the opportunity to explore this particular crash in a bit more detail.

One nice change (since Windows Vista) has been that the machines are configured by default to create kernel summary dumps, so that there is more that we can do than simply a superficial analysis of the problem.

Naturally, I started with the usual !analyze –v and looked at its results:

Oddly, the first thing that jumped out at me is that I had seen a similar crash recently (we use crash dump examples in our kernel debugging seminar).  This reminded me that it is not uncommon to see “patterns” like this as you look at crash dumps (particularly random crash dumps like this one).  It isn’t exactly the same, however, but it is intriguingly similar.  See the stack trace from that crash:

While not identical, what stuck out in my head is that they are both (essentially) in the same piece of logic, suggesting to me that there might be a data corruption or race condition issue going on here (although our original theory for the crash that we use in the seminar is that it is actually a single bit error).

Patterns of this type can (and have) brought insight in the past.  So let’s see where this analysis takes me, and whether or not it matches anything from that previous analysis.  One good point here is that I have some experience in walking the relevant data structures.

One other note about this particular machine: it’s not used for development; it’s used for video editing.  As such, it isn’t running anything other than “stock” software that one might find on a typical production computer system.

In this particular case we seem to have a garden variety NULL pointer dereference:

Thus, RSI is null in this case and the resulting user mode address is not valid.  As typical in a case like this, we’ll walk backwards in the code stream to try and figure out where this value originated, since it is often the case we can glean some understanding of what this code is trying to do.  See below for the code leading up to this point (via the u command):

Given that there is a ret instruction two lines earlier, we scan backwards for a jump or branch to that location.  This type of code sequence is typical of an if/else style statement (where there is a return in one of the two code blocks):

At this point it certainly looks like there might be a logic bug here:  the test instruction does a bitwise AND of the RSI register with itself:

…and then conditionally jumps based upon whether or not that value is zero:

Thus, this tests the value in RSI to see if it is zero.  If it is, we jump down and use it, a condition that certainly does not seem to make much sense here.  When I saw this it made me question if I was interpreting the instruction sequence correctly.  Of course, a sanity check would be, “Well, if it weren’t zero, having it change to zero within the CPU would suggest a hardware problem of some sort.”

The encouraging insight here is that this certainly doesn’t look like it is related to the new memory added to the system.

Thus, the next logical step was to look and see, “So, where did the value in RSI originate?”  That’s also present in this instruction stream:

So if we look at the value in RCX:

Lt is peculiar – this is not a zero value.  This could be due to the fact that not all registers are saved, although the fact we see RCX is non-zero suggests that it is saved (values that are not saved on the x64 platform show as zero in the trap frame and context record).

Thus, this suggests a few possibilities:

  1. The value captured here is not correct;
  2. The value loaded in the RSI register is incorrect (ergo, a CPU problem of some sort);
  3. The value to which RCX points has changed since we captured it in RSI.

The hardware error theory seems rather unlikely, so let’s focus on the other two for a bit.

If we look at the data type of RCX we can see some useful information:

So this block of pool seems to include this structure.  On x64 systems the pool header is 16 bytes long:

To properly compute the start of the structure we need to take the start address of the pool block and add 0x10 to it (to account for the pool header):

And the value in RCX is an offset into this structure:

Thus, this corresponds to something in the OpenedNameCache field.  By dumping the _NAME_ CACHE_LIST_CTRL, we can see what exactly that is:

This in turn makes a bit of sense given the name of the function in which the fault occurs (ergo, we are manipulating a tree structure of some sort).

In expanding the _TREE_ROOT structure we see:

Thus, I suspect that this is embedded in some larger data structure (and that structure is in turn linked into this tree).  So let’s see what displaying the contents of the tree root tells us:

We can then use this address (which, oddly enough is not zero, making this dump a bit unusual – after all, we just loaded this value into the RSI register and it was zero) to inspect this structure:

Now let’s take this containing structure and dump it:

To be honest, this structure looks a bit suspicious to me – the “creation time” doesn’t seem plausible, and that _FLT_INSTANCE address definitely does not look valid.  So I decided to poke at the structure in a bit more detail:

This looks somewhat valid to me, actually.  The name for the volume looks to be properly set up, although several of the name components seem to be suspect – almost as if this structure is being initialized or torn down.

At this point I put together my working hypothesis: that there is a race condition present in this code somewhere, most likely in the initialization code.  That still doesn’t explain what seems to be a logic issue here.  I’m still left with more questions than answers – but having two crashes in the same general area, on two different machines, with two radically different usage profiles does suggest there is something interesting going on here.

Why do I propose a race condition here?  Because I see information in the crash that is inconsistent – the contents of a register are NULL, but the memory location from which it was loaded indicates it should be non-zero.  I cannot tell exactly when that data structure was added to this structure, but the choice seems to be “this is a CPU bug” or “this is a logic bug in this driver.”  Of the two, in my experience the latter is far more likely than the former.

Hopefully, we’ll see more of these crashes so that we can find a pattern as to what is happening and broaden our analysis.   It is also distinctly possible if we’re seeing this issue in “the real world” the filter manager team in Redmond has seen many more cases of this and have resolved the issue.   We’ll be watching for more of these – perhaps you have a crash like this one you’d like to share with us.

Article Name
Debugging Race Conditions
Windows crash analysis is a learned skill, and learning how to debug race conditions is one of those skills that will serve you well down the road.