One of the interesting challenges in building a robust isolation framework is the need to deal with file sizes. Our initial explorations into this area resulted from our work on our File Encryption Solution Framework (FESF). However, we expect our work on isolation to form the basis of a generalized data virtualization framework that will eventually become one of our commercially available Solution Frameworks.
The fact that file sizes present a major challenge might surprise the casual reader. After all, one might wonder why file sizes would be any more complex than managing any other file attribute. However, because those file sizes are used by other parts of the OS, notably the VM system, understanding how these sizes are interpreted and the rules around that interpretation are important aspects of ensuring correct behavior.
From the perspective of a file system isolation filter, there are six different “file sizes” that we are managing. These six sizes fall into two groups:
- Logical Sizes: for the allocation, end of file (“file size” or EOF) and Valid Data Length (VDL)
- Physical Sizes: for the allocation, end of file and Valid Data Length
Logical sizes are what we present to the Virtual Memory system, and what applications perceive as the sizes of those files. What does each one represent?
Allocation Size represents the amount of disk space allocated. Typically, this is enough to ensure that we don’t need to allocate more space during critical processing. For example, during paging write operations, there shouldn’t be any need for allocating additional storage space. Of course, a file system (or isolation filter) could be written to handle this appropriately. For example, a file system might simply do space reservation and then actually pick the space when it’s really required. The primary benefit to deferring allocation is that some access patterns can cause significant fragmentation, which is an issue for typical rotating media, but is actually not a concern for memory devices, such as NVME disks or SSDs.
End Of File is what we generally refer to as the “size” of the file. It indicates the number of bytes that this file contains. In isolation, a logical file might differ from the physical file because there is either additional data in the file (e.g., encryption headers, keys, additional state) or the data has different information density (e.g., compression or expanding encryption algorithms). Indeed, one of the primary justifications for constructing an isolation filter is to permit us to manage the file size independent of the physical storage. In that way, applications are blissfully unaware of the details of what we’re doing within the isolation layer.
Valid Data Length represents the “high water mark” of data that has been written to the file. This differs from End Of File (which can be quite large) because it marks the region in which data has been written to the file.
Valid Data Length
Of these, Valid Data Length (VDL) is likely the most confusing. The reason Windows has this concept is to avoid an issue we refer to as disk scavenging. Traditional media file systems do not erase data contents on the drive when space is freed. Thus, to avoid exposing the contents of newly allocated space to the application, we track the region of the file that’s been written by the application. Space beyond this point may be allocated, but the data contents are not returned to the caller – instead, they receive known safe data – usually zero filled memory, though nothing requires that specific pattern. In order to ensure that the region from VDL to EOF returns zeros, the file systems employ various techniques. In our experience we’ve seen four such mechanisms:
- Data between VDL and EOF is zeroed as part of allocation. This is the most resilient in terms of crash recovery semantics (there’s never exposure of data inadvertently) and is also the most expensive in terms of performance.
- Data between VDL and EOF is zeroed as part of closing the file. This is the technique that FAT uses, and it does it during IRP_MJ_CLEANUP handling for the last open handle on the file. This may expose data in case of a crash, but it is generally much better at performance.
- The VDL of the file is stored as persistent information within the file information itself. This is the technique that NTFS uses. When a file is opened, NTFS can determine what the VDL of the file is and preserve correct behavior, even after a crash.
- Similar to FAT, VDL to EOF is zeroed on the last user handle close of the file, but with a journaling technique the file blocks are zeroed from VDL to EOF after a crash as well.
The VM system (Cache Manager and Memory Manager) components are concerned about VDL as well. This points to the fact that the valid data length is not only an “on disk” concept, but also an “in memory” concept. Thus, the VDL for data stored in cache may differ from the VDL as recorded on the disk itself.
This can lead to interesting behaviors in an isolation filter. For example, if an isolation filter skips over a disk region – saving space for some of its own meta data – the file system beneath the isolation filter will typically zero out that region. Normally it would do so via the VM system, which then triggers re-entrant I/O operations. If your isolation filter is not expecting to observe paging I/O operations in the midst of its own non-cached I/O operations to the underlying file system, it will likely not handle the situation properly.
Thus, in addition to the logical and physical sizes we have an “in memory” and “on disk” valid data length to consider! If you review the FASTFAT file system example in the WDK, you can see how FAT handles this case both during write (zeroing holes in the file) as well as during cleanup processing when the file is closed and it zeros the region from VDL to EOF.
There are specific rules around the various file sizes and how they can be affected by I/O operations. To add even more complexity, these rules vary upon the context in which the I/O is being performed:
- User Cached I/O – in this case, writes can – and do – extend the size of the file. The file size may be reduced (truncation) via a call to either “set EOF” or “set Allocation”. For NTFS and FAT, a user with appropriate privileges (SeManageVolumePrivilege) may set the VDL. Note that there is no mechanism to query the value of the VDL, which turns out to be a problem for isolation filters that change file layout. Thus, a write past EOF moves the VDL, EOF and possibly the Allocation size. The file system zeros data between the current VDL and the start of the new write. Filters will typically see such requests because the file systems use CcZeroData to perform this operation, which then issues a paging I/O (IoSynchronousPageWrite). But the zeroing is avoided when the user writes the entire region – even if it is written to the FSD out of order. This magic is possible because the FSD captures the identity of the lazy writer thread (though NTFS and FAT do this differently).
- User Non-Cached I/O – the operations must be sector aligned and sector sized, except at EOF, though the assumption is that we can use the remainder of the buffer, so it should be aligned to permit that as well. What we have noticed is that some applications (e.g., the xcopy utility) make worst case assumptions here: they sector size write and then truncate back to the correct EOF. Other applications behave differently. Note that extending writes are permitted in cases like this. Just like the cached case – but far more obvious – any holes in the file will be zero filled, unless the file permits sparse allocation.
- Paging I/O – all paging operations are done non-cached. There are special rules involved for paging operations with respect to file sizes. Most important is that paging writes do not move the EOF. Similarly, they do not move the allocation of the file. As a result, we are limited in what we can do in this path. This should be OK – but it means that any EOF management must be done in the set information path. Note that Paging I/O operations may move the VDL.
In working with our isolation framework, we’ve found some surprising behaviors, which we’ll attempt to capture in hopes they will help future developers in this area:
- The filter verifier has a bug where it will assert that your buffer is misaligned if you perform a sector aligned write, with a page aligned buffer, but a non-integral length at end of file. The file systems expect this case and all works as intended. For us, this meant that we had to avoid using FltWriteFile (or FltWriteFileEx if you’re using the MDL passing interface) and instead explicitly built our own callback data structures for the writes – that path works correctly. The error that filter manager throws up is not only confusing (the buffer is aligned) but actually permitted by the file system.
- Paging writes to NTFS or FAT that are beyond the EOF do not fail. Instead, they return STATUS_SUCCESS and indicate zero bytes written. This was surprising to us and we spent quite a bit of time understanding the circumstances around this behavior. Much of this has to do with serialization and parallel file activity, as well as the manner in which the file systems handle files that have been deleted and/or truncated.
- Unlike allocation size and file size, there is no mechanism for obtaining the VDL of a file from the underlying file system. This is a complicating issue for an isolation filter that is independently tracking this, as opening an existing NTFS file might cause re-entrant file zeroing. It is possible to set the VDL of a file, though it requires the privilege as we mentioned earlier. That is because moving the VDL out without zeroing the file could expose information to the application. We decided against this approach and instead fell back to the FAT model: zeroing the file at final cleanup or anytime a “hole” is created within the file. That’s unfortunate because it will create situations in which we perform I/O unnecessarily. Hopefully this is an oversight that Microsoft will rectify in a future version of Windows – that just doesn’t help us now.
Our approach to isolation is also complicated when it comes to size management because we explicitly do not do any serialization around calls to the file system. This makes it far less likely that we will end up with a deadlock – the original intent – but it also means that we can never assume that the file won’t change sizes while we are operating on it.
Of course, we have to assume file sizes can change at any time when accessing a file across the network. Simultaneous shared access to a file over the network is a reality that we must support (and will be a topic of future articles). Thus, we are prepared to deal with errors being returned to us when we issue I/O operations. But for the non-network case, how could this happen? We’ve observed several interesting cases:
- The file size changes while we are attempting to do I/O. This is the harsh reality of performing I/O operations without holding locks.
- The VM system may perform I/O operations while we are performing I/O. While in our own isolation filter we avoid cached writes, the file system beneath us is always free to convert a non-cached I/O into a cached I/O. We have observed this in practice with NTFS in recent versions of Windows.
- A file may discard I/O operations because it is in the process of being deleted. This makes sense, as its inefficient to do I/O to a file that’s going to be discarded, but it leads to the peculiar situation of getting STATUS_SUCCESS back, even though zero bytes are written to the file. In our case, we just treat it as expected behavior and continue on ahead.
In our own work, we’ve found that managing file sizes was more complex than originally anticipated. While we are now successfully managing this, it took many weeks of concentrated effort and work as we constructed a viable model for how to manage this. Hopefully this brief description will help future developers in this area deal with the surprises that are in store for them.