Windows System Software -- Consulting, Training, Development -- Engineering Excellent, Every Time.

Bug in New Function ExAllocatePoolZero Results in Security Vulnerability and Crashes

Bug in New Function ExAllocatePoolZero Results in Security Vulnerability and Crashes

tl;dr

Last week (week of 5 July 2020) OSR found and reported a bug to Microsoft that has both security and reliability implications for driver developers. New functions introduced in the Windows 2004 WDK that are designed to zero pool allocations before they are returned to the driver, do not zero those allocations when running on Windows 1909 systems (only). The functions work as intended on both older and newer systems. The functions affected are ExAllocatePoolZero, ExAllocatePoolQuotaZero, and ExAllocatePoolPriorityZero, when used in combination with the newly introduced POOL_ZERO_DOWN_LEVEL_SUPPORT flag and ExInitializeDriverRuntime.

More details, including the underlying cause of the bug and a suggested work-around, are described below.

The Problem and Its Cause

Windows 2004 (20H1, build 19041) introduces several interesting new security features. Among the most interesting is automatic zeroing of pool allocations. This is supported with the new functions ExAllocatePoolZero, ExAllocatePoolQuotaZero, and ExAllocatePoolPriorityZero.

These new functions are actually implemented as macros that call the old, traditional, pool allocation functions such as ExAllocatePoolWithTag, passing the desired pool type OR’ed with the new flag “POOL_ZERO_ALLOCATION”. The first version of Windows that understands this new flag is Windows 2004 (20H1, build 19041).

Better still, following in the pattern established by our old friend POOL_NX_OPTIN, the auto-zeroing allocation functions can be made to do the right thing down-level as well. All you need to do is:

  • Define the symbol POOL_ZERO_DOWN_LEVEL_SUPPORT when you build your driver
  • Call ExInitializeDriverRuntime right at the start of DriverEntry

You can see how this all works in the code for ExAllocatePoolZero from WDM.H (from the Windows 2004 WDK) shown below:

PVOID
NTAPI
ExAllocatePoolZero (
    _In_ __drv_strictTypeMatch(__drv_typeExpr) POOL_TYPE PoolType,
    _In_ SIZE_T NumberOfBytes,
    _In_ ULONG Tag
    )
{
    PVOID Allocation;

    Allocation = ExAllocatePoolWithTag((POOL_TYPE) (PoolType | POOL_ZERO_ALLOCATION),
                                       NumberOfBytes,
                                       Tag);

#if defined(POOL_ZERO_DOWN_LEVEL_SUPPORT)

    if ((!ExPoolZeroingNativelySupported) && (Allocation != NULL)) {
        RtlZeroMemory(Allocation, NumberOfBytes);
    }

#endif

    return Allocation;
}

As you can see from the code above, ExAllocatePoolZero just calls ExAllocatePoolWithTag, and passing in the POOL_ZERO_ALLOCATION flag. When you define POOL_ZERO_DOWN_LEVEL_SUPPORT, on systems where there’s support for native zeroing of pool allocations (that is, Windows 2004 and later) no additional code is required. But when your driver is run on down-level systems, RtlZeroMemory is called for you. In either case, you’re guaranteed to get back a zeroed allocation.

Or, at least, that’s how it’s supposed to work. Except: On Windows 1909 (build number 18363) there’s a bug that causes the allocation not to be zeroed.

To understand the cause of this bug, note in the code above (at line 17), the check for whether RtlZeroMemory is called is conditioned on the state the variable ExPoolZeroingNativelySupported. This is a global that is set by ExInitializeDriverRuntime, the code for which is shown below (again, from the Windows 2004 WDK):

FORCEINLINE
VOID
ExInitializeDriverRuntime(
    _In_ ULONG RuntimeFlags
    )

{
#if defined(POOL_ZERO_DOWN_LEVEL_SUPPORT) || (POOL_NX_OPTIN && !POOL_NX_OPTOUT)
    ULONG MajorVersion;
    ULONG MinorVersion;
    ULONG BuildNumber;
    NTSTATUS Status;
    RTL_OSVERSIONINFOW VersionInfo;

    VersionInfo.dwOSVersionInfoSize = sizeof (VersionInfo);

    Status = RtlGetVersion (&VersionInfo);

    if (!NT_VERIFY (NT_SUCCESS (Status))) {
        MajorVersion = 5;
        MinorVersion = 0;
        BuildNumber = 0;
    } else {
        MajorVersion = VersionInfo.dwMajorVersion;
        MinorVersion = VersionInfo.dwMinorVersion;
        BuildNumber = VersionInfo.dwBuildNumber;
    }
#endif

#ifdef POOL_ZERO_DOWN_LEVEL_SUPPORT
    //
    // If the version is 20H1 or later, the pool allocator supports zeroing
    // natively. BuildNumber 18362 corresponds to 19H2.
    //

    if ((MajorVersion > 10) ||
        (MajorVersion == 10 &&
         (MinorVersion > 0 ||
          BuildNumber > 18362))) {

        ExPoolZeroingNativelySupported = TRUE;
    }
#endif

#if POOL_NX_OPTIN && !POOL_NX_OPTOUT
    if ((RuntimeFlags & DrvRtPoolNxOptIn) != 0) {

        //
        // Discover whether NX pool support is available on this platform, and,
        // if so, initialize the default non-paged pool type.
        //

        if ((MajorVersion > 6) ||
            (MajorVersion == 6 &&
             MinorVersion >= 2)) {

            ExDefaultNonPagedPoolType = NonPagedPoolNx;
            ExDefaultMdlProtection = MdlMappingNoExecute;
        }
    }
#else
    UNREFERENCED_PARAMETER (RuntimeFlags);
#endif
}

Looking at the code, above, for ExInitializeDriverRuntime you can see it calls RtlGetVersion (at line 17) and, based on the result (lines 36 through 39) it sets the global variable ExPoolZeroingNativelySupported to TRUE if it is running on a system where native pool zeroing is supported. Native support for zeroing pool allocations appeared in Windows builds after Windows 1909.

But looking at the code (at line 39) you can see that check is made for build numbers later than 18362. The problem with that is that Windows 1909 isn’t build 18362, it’s 18363. So the check is incorrect, and will result in drivers running on Windows 1909 to assume that native support for pool zeroing is present, when it is not.

The Impact

To say “this is not good” doesn’t convey the seriousness of this issue.

First, there’s the chance of an information disclosure security vulnerability because the pool allocation is returned without being zeroed (and you don’t have to zero them yourself anymore, right… because they’re automagically zeroed for you). Also, it is very likely that if you assume your pool allocations are pre-zeroed you’re going to get a big surprise (and a BSOD) when you discover those fields that you expect to be zero aren’t. But only on Windows 1909.

Here at OSR, one of our senior team members discovered this problem about a week ago as we were updating one of our products to use the latest version of Visual Studio (VS 2019) and the latest WDK (2004). Always eager to adhere to best practices, he changed all our calls to ExAllocatePoolWithTag that were followed by call to RtlZeroMemory to ExAllocatePoolZero. He defined POOL_ZERO_DOWN_LEVEL_SUPPORT (right next to POOL_NX_OPTIN) in the Preprocessor Definitions section of the Project’s property pages. He already calls ExInitializeDriverRuntime in DriverEntry, so that was all set.

It tested fine on Windows 2004. It tested fine on Win 7. As for testing on 1909, our teammate wrote “my modified drivers started BSODing spectacularly as they walked off into non-zeroed-ness.”

We Report — Microsoft “Investigates”

We reported this problem (and what we judged to be the impact) to our friends at Microsoft on 9 July. The problem was confirmed to us on 10 July, but we were asked to please hold-off notifying the community while MSFT evaluated the potential security implications of the bug. We were happy to oblige. On 14 July MSFT gave us clearance to inform the community. Microsoft also posted a brief note acknowledging the problem in their Dev Kits forum.

Microsoft are “investigating a fix for this issue.”

Aside: Peter Complains

I need to take a detour here to complain about how very sloppy this bug is. First of all, if you look at the code for ExInitializeDriverRuntime, you’ll see it’s all “not very good.” It doesn’t even meet Microsoft internal coding standards (or, at least, not any that I am aware of). Local variables start with Upper Case letters? Since when is that “Cutler Normal Form?” The code copies info from the stack-based RTL_OSVERSIONINFOW structure into some other stack-based local variables, for reasons I can’t fathom. And then there’s that check for the build number: I mean… did support for native pool zeroing really get added into the build immediately after the build for Windows 1909? I guess that’s possible (it was in a branch that got merged immediately after 1909 shipped). But you’d think there would be a specific version that this code was verified to be in… not just “after 1909.” Why not check for the actual build number in which the new code appears?

Of course, getting the 1909 build number wrong is sloppy and silly, but we all do make mistakes. I, personally, don’t find that to be particularly egregious or annoying.

But… does nobody review this code? Does nobody test anything on down-level builds of the OS? Answer: obviously not. What I find to be particularly vexing — bordering on the unfathomable — is the fact not that this bug could get introduced by some intern (or whatever), but that it could get introduced, integrated up into MAIN, and nobody would find it until we found it after the OS shipped. I’ve heard of “test in production” but this borders on the ridiculous, folks.

How To Fix This: Our Work-Around

So, what do we do about this? The easiest, laziest, thing to do is just not bother with these new functions and continue doing whatever you were doing before Windows 2004, such as calling ExAllocatePoolWithTag followed by RtlZeroMemory (or not). Truly? That’s probably the best course for most devs and most drivers.

But, given the fact that the code for the functions involved in this problem is public, we can see other ways that we could fix this problem. Here at OSR, the guy who found this bug decided to work around this problem by writing a function named OsrFixExPoolZeroingNativelySupported. He calls this function immediately after returning from ExInitializeDriverRuntime. That function looks like this:

VOID OsrFixExPoolZeroingNativelySupported()
{
    RTL_OSVERSIONINFOW versionInfo;
    if (ExPoolZeroingNativelySupported) {
        //
        // test again for 1909 - which is
        // 10.0.18363. But rather than testing for 1909
        // test against not being 2004 (19041)
        //
        RtlZeroMemory(&versionInfo, sizeof(PRTL_OSVERSIONINFOW));
        versionInfo.dwOSVersionInfoSize = sizeof(versionInfo);

        if (!NT_SUCCESS(RtlGetVersion(&versionInfo))) {
            ExPoolZeroingNativelySupported = FALSE;
        } else if ( (versionInfo.dwMajorVersion == 10) &&
               (versionInfo.dwMinorVersion == 0) &&
               (versionInfo.dwBuildNumber < 19041) ) {
            ExPoolZeroingNativelySupported = FALSE;
        }
    }
}

This code has been tested and is known to work (assuming I didn’t break it but cutting/pasting it here). In this code, note that we check to see if the global ExPoolZeroingNativelySupported is set to TRUE. If it is, we check to see if we’re running on an OS version earlier than 20H1, and if so, we force the global ExPoolZeroingNativelySupported to FALSE. You can see we’re trying to be as careful and conservative as possible.

We do one other thing as well, that’s not shown in the code above. We include a C_ASSERT that checks the WDK version that we’re building with, and if it’s not the Windows 2004 WDK (10.0.19041.0) it throws an error. This will remind us to take a look at this code when the next version of the WDK is released, and (hopefully) remove it. Again, we’re trying to be as careful as we can.

So… When Does This Get Fixed For Real?

Now you know what we know. You know how we plan to fix it in the code that goes into our products. This does beg the question, however: When will this get fixed “for real”? To date, the practice has been that WDKs are only released when a new version of Windows is released.

This worked great when the WDK was more solid, and perhaps had less complex functionality. I think we will all agree, however, that having the WDK integrated into Visual Studio has very much been a double-edged sword. Each release of the WDK seems to come with additional bugs, some of them serious. For example, nobody here at OSR can even run SDV using VS 2019 and the 2004 WDK. I mean, we can’t run it at all. We know there’s an acknowledged bug in which Driver Verification Logs can’t be generated with VS 2019 and WDK 2004. That will certainly be handy when it comes to trying to get WHQL certification, won’t it? And now we add this native pool zeroing bug to the pile.

Again, readily acknowledging that “shit happens” and bugs make it to out into the real world… I don’t find the bugs particularly disturbing. At least not today, as I’m writing this. What really, truly, rankles is that nobody cares enough about these problems to devise a way to release WDK updates between OS releases. So now, unless something drastic changes, we face a period of at least five more months of using this version of the WDK with serious known bugs. And that’s assuming the ever changing versions of Visual Studio don’t introduce other, new, and exciting problems for our pleasure.

We get a WDK that isn’t properly tested on down-level platforms. We get a WDK with key features that just don’t work. And we have to wait until the next version of Windows is released to get these WDK issues fixed? Ugh. I don’t understand why IHVs and OEMs don’t rate a better set of tools and a better experience than what we’re getting.