By Malcolm Smith
As has been covered in a previous issue (see here), file systems including NTFS record three sizes per stream: allocation size, file size, and valid data length. This article explores in more detail the third of these. Documentation on this topic is limited, and the best available sample for developers is the FastFat sample which doesn’t implement valid data length fully, and also does not have the same behavior as NTFS or ReFS.
Valid data length (VDL) sounds deceptively straightforward. Conceptually, data prior to this location in the file is considered ‘valid’, meaning it may contain user data. Data after this point in the file is ‘invalid’, meaning it should be zero. The complexity with this stems from the interactions between valid data at different layers in the system. Consider that data may be valid in the cache which is not yet valid on disk, and data may be valid due to a memory mapped write that the cache or file system have no awareness of.
So rather than considering valid data length as one piece of data, it can be thought of as three separate values:
- Valid data on disk;
- Valid data in cache, which is greater than or equal to valid data on disk;
- Valid data in a memory mapping, which is greater than or equal to valid data in cache.
Basic Writes – File System and Cache Manager Cooperating
Obviously the file system must update its on-disk value in response to user non-cached writes, and the cache value in response to user cached writes with CcSetFileSizes. For paging writes however, the process can be more complicated.
The Cache Manager aims to ensure writes are as sequential as possible and are of an appropriate individual size and priority. As part of doing this, the Cache Manager can also track valid data and update the on disk value in an aggregated way. For this support to be available, the file system must notify the cache of any regions being implicitly converted from invalid to valid via CcZeroData, and having done so, the file system will be notified (via FileEndOfFileInformation call with its IO_STACK_LOCATION’s Parameters.SetFile.AdvanceOnly member set to TRUE) that a series of paging writes have completed, which allows the on-disk notion of valid data to advance.
For example, if a 3MB file currently has 1MB of valid data, and the user writes 1MB at offset 2MB into the cache, the file system must call CcZeroData on the range between 1MB and 2MB, then CcCopyWrite at 2MB to 3MB, then CcSetFileSizes. Since the cache now has a sequential data to write, it can write the data as it chooses and later indicate to the file system to advance on disk valid data to 3MB.
The consequence of using this support is that the file system must consider the origin of paging writes. If a paging write occurs as part of the Cache Manager’s lazy writer, valid data does not need to be updated as part of the write but will instead be updated in the AdvanceOnly callback. Conversely, if the write originates from the Memory Manager’s mapped page writer or a cache flush, no such callback will be made, and the file system must perform any required zeroing and update both on disk and in cache valid data at the time of the write. These can be distinguished by their preacquire callbacks; by convention file systems should call IoSetTopLevelIrp with FSRTL_CACHE_TOP_LEVEL_IRP for lazy write operations, and having done so, IoGetTopLevelIrp can be used to test for this condition.
Read and Write Interactions due to the Cache Manager
The situation with reads is also complicated somewhat by this optimization. Naively, an attempt to read from disk data beyond valid data on disk should result in zeroes. But this optimization introduces a race condition, because pages can be written to disk, become clean, and leave memory before valid data on disk has been updated. In normal operation, data between valid data on disk and valid data in cache would always be dirty in the cache and the file system would not need to consider this case; but due to this race condition, reads for this region can occur and must therefore be treated as valid. So, counter intuitively, a paging read from disk should be compared to valid data in the cache to determine how to zero it.
The reason for updating cached valid data at the time of a paging write is due to the third type of valid data: writes made via memory mapped views. In this case a page can be modified outside of the file system or Cache Manager’s control or awareness, resulting in the otherwise paradoxical valid data beyond valid data length. It is not valid in the presence of memory mapping to assume that pages beyond valid data length must be zero due to this condition. For the same reason, a call to CcZeroData is not guaranteed to result in zero data, because pages may have been modified that the file system does not yet know about, and those pages must be considered valid. CcZeroData will silently decline to zero pages in that case.
Although dirty data is tracked at the page level, valid data is a byte level concept. A mapped view may also modify data on the same page as a cached write. When this occurs, the Cache Manager will attempt to write the page, but is not aware that it contains data which is beyond the cache’s current valid data length, and would not be included in any AdvanceOnly callback. It is important that the file system allow the cache to write the page (as the cache will not allow the mapped page writer to write it), but it follows that the file system must be willing to update cached VDL via CcSetFileSizes during the paging write to ensure a subsequent AdvanceOnly callback covers the entire range that was written. Because valid data is a high water mark, this issue is only problematic for the final page of valid data, where valid data length falls within that page.
Cache Manager Cooperation, or Not
The Cache Manager optimization is used by NTFS but is optional for other file systems. A file system may indicate that the Cache Manager should not attempt to track valid data by using the special value of MAXLONGLONG as the valid data length in a CcSetFileSizes call. Having done so, the file system must take care to update valid data on disk in response to all paging writes. This is the approach that ReFS takes as of Windows 8.1 and Server 2012 R2, leaving all data potentially valid in the cache and zeroing to the disk as needed when paging writes occur.
Regardless of the approach chosen, some form of synchronization is required when implicitly zeroing to ensure that user data is not destroyed inadvertently by racing writes to different ranges. The change between NTFS behavior and ReFS behavior in 8.1 is where this synchronization is performed. In NTFS, because data is zeroed through the cache with CcZeroData, synchronization must be provided with extending cached writes. In ReFS, where data is zeroed on disk and not through the cache, synchronization is used at noncached or paging write time, eliminating the need to serialize cache extensions.
This difference surfaces via differing implementations of FSCTL_QUERY_FILE_REGIONS. Because NTFS only remembers and enforces against cached VDL, it can only answer queries for cached VDL (FILE_REGION_USAGE_VALID_CACHED_DATA). Because ReFS only remembers and enforces against on disk VDL, it can only answer queries for noncached VDL (FILE_REGION_USAGE_VALID_NONCACHED_DATA). In both cases, note this API has limited applicability because there is no guarantee that data outside of a valid range contains zeroes due to the memory mapped writing case discussed above, and in the case of NTFS, it is also not guaranteed that ranges within the valid range are necessarily stored on disk.
The definition of FSCTL_QUERY_FILE_REGIONS allows for a file system to implement validity tracking on a per-range basis as opposed to a high water mark, which is what classic valid data length describes. Future versions of ReFS exercise this capability by recording on a much finer granularity which regions contain valid data on disk, eliminating the need for implicit zeroing and the synchronization required to implement it.
Malcolm has been working on File Systems and Filters at Microsoft for over a decade. Despite eschewing UI, he occasionally contributes to fsutil, and is the author of the hugely unpopular sdir tool at sdir.codeplex.com. He can be found loitering on ntfsd.