By Daniel Terhell
Windows is not a real-time operating system is a phrase that’s often echoed on the NTDEV forum. Frequently it comes up when someone runs into trouble trying to write a Windows driver for a device that’s not designed with Windows compatibility in mind, such as a device that expects the software to respond within a short time frame.
The defining characteristic of a real-time operating system is that it offers predictable execution that can meet deadline requirements. It must offer absolute determinism by giving guarantees of being capable of responding to requests within short time frames. Often, a distinction is made between soft real-time and hard real-time environments. For soft real-time, responsiveness is highly desired but a missed deadline does not count as total failure. In hard real-time environments, by comparison, there is no tolerance for unexpected latencies and missing any deadline means total failure. Often such environments are responsible for handling life critical tasks.
On Windows, all requests to the operating system are processed on a best-effort basis, without any response time guarantee. While Windows is certainly capable of servicing hundreds of thousands of large requests per second, in general, it won’t be able to guarantee that every single request is processed within a specifically defined short time frame.
Windows is designed and optimized with performance and throughput in mind for general purpose uses, not for real-time tasks and low latency, which often have conflicting interests. Whereas a solution that is optimized for general purposes cares about things such as average throughput performance, a solution that has real-time requirements instead cares about maximum response times, the worst cases that can possibly lead to a deadline being missed. This means that every individual request or operation matters.
There exist many types of solutions that may have real-time requirements of some sort, ranging from industrial, medical, military, aviation, research, and high precision to multimedia applications. A common example of an application that has real-time demands is “real-time audio”, such as a software synthesizer that needs to produce sound in response to keys being pressed on a MIDI keyboard with no more than a few milliseconds latency. An audio stream must be continuous and any interruption of the audio stream has audible consequences recognized as clicks, pops and drop outs. This means all requests must meet their deadlines.
It’s interesting to note that the problems of latency in real-time applications as well as the “key press before sound” latency problems are ancient and not merely artifacts of the computer era. Organ players, for example, must read ahead and hear back what they play with many seconds delay after the air has been processed through the pipes using sophisticated mechanisms and the sound has travelled from the corner of the church to the ear.
Latency problems in real-time audio applications occur often as software synthesizers and audio plugins most often have their logic implemented in user-mode. This means they are more exposed to possibilities of interruptions (e.g. paging, scheduling) than a device driver that runs at elevated IRQL.
Finally, note that for the purposes of this article, we’ll ignore certain Windows features such as IRQL PASSIVE_LEVEL interrupt handling and UMDF which, while handy for certain specific uses, are generally unsuitable designs for real-time processing.
Enemies of Low Latency
In order to better understand why Windows is not always a suitable operating system for real-time processing, we will first take a closer look at the path of an interrupt-driven device with deadline requirements that has its requests processed in user-mode. We will then examine possible problems that can cause undesired latencies on the way.
For a real-time sensitive device that is interrupt driven, it is essential that Windows can respond to requests in a timely manner. Because of the enemies of real-time processing discussed later, latency can be introduced during the processing path. Logically, during the processing of an interrupt, the lower the IRQL drops as processing proceeds from ISR to DPC and then on to user process servicing, the higher and more frequent latency spikes are to be expected.
A common way for a device to indicate that it requires attention from the software is to generate an interrupt. After a device fires an interrupt signal, it is received by a CPU which will execute the Interrupt Service Routine (ISR) that was registered for that interrupt.
The execution of the ISR can be delayed by various factors. The CPU may be temporarily deferring interrupts, the operating system could be servicing a higher level interrupt, temporarily have its maskable interrupts disabled as a result of some system wide lock state or otherwise. Also, the interrupt may be delayed by hardware factors. Unfortunately it is not possible to measure such interrupt latencies through software alone. It would require assistance of the hardware or by using a bus analyzer [In modern Intel architecture systems, we know that this hardware latency is very low, almost always less than one microsecond – Editors]. In this article, we will only discuss latencies during the software part of interrupt processing, that is, from the moment an ISR started executing until the moment the interrupt got serviced by the software.
As Windows does not allow prioritization control of device interrupts, at any time while an ISR routine is executing it can be preempted by a higher level interrupt unless it takes measures to prevent such preemption by raising IRQL. Latencies during ISR processing can also be introduced by other factors such as the operating system interrupt processing of the system clock, Inter Processor Interrupt (IPI) routines or even factors beyond the control of the operating system such as System Management Interrupt (SMI) routines and various hardware factors which will be discussed later.
If the servicing of the interrupt demands lengthy processing, then the ISR typically schedules a Deferred Procedure Call (DPC). This DPC continues processing the I/O operation at a lower IRQL where device and system interrupts can preempt its execution. A DPC is also required if interrupt servicing needs to take place in user-mode, as it’s not allowable to resume waitable objects or wake up user-mode threads under the IRQL restrictions imposed on ISR routines.
Generally, the DPC starts executing immediately on the same processor as it was requested on. The operating system maintains a DPC queue per logical processor and it may be possible that other DPC routines will execute first before yours when the DPC queue on your processor is being drained, thus adding latency.
Because DPC routines execute at IRQL DISPATCH_LEVEL (an exception to this is threaded DPCs which execute at PASSIVE_LEVEL), latency can also be introduced by any type of device or hardware interrupt that can occur during the processing of the DPC.
The time interval from the moment an interrupt is received on a CPU until a DPC routine starts executing is often referred to as “ISR to DPC latency”. If the driver has deadline requirements it will want to minimize the worst case ISR to DPC latencies. This is an important point, don’t miss it: For real-time processing what we care about minimizing is the worst case latency, and not the average latency. Average ISR to DPC latency is almost always very good. On the other hand, worst case ISR to DPC latency can sometimes be astonishingly bad. And this is where artifacts are most often introduced (See sidebar, Play By the Rules, below)
On a side note, as a kernel developer, even if your driver has no real-time requirements whatsoever, you should still care to play fair and avoid spending too much time at elevated IRQL so that you won’t compromise the real-time capabilities of the system that your driver is running on. That includes time spent in ISRs, in DPCs as well as code executed while a spinlock is held or the IRQL is raised through other means. Or if you are a BIOS/firmware developer that counts even more.
The MSDN guidelines on this are that no DPC or ISR routine should ever be executing for longer than 100 µs. In practice, many drivers are violating this rule, at least when run on certain hardware. If a driver is causing high latencies, it doesn’t necessarily mean it’s the software at fault, in many cases such problems can be fully attributed to the hardware. Network drivers (WiFi in particular), storage port drivers and ACPI battery drivers are among the most notorious for causing high latencies during interrupt processing. Often such drivers need to be disabled when configuring a system to handle tasks with real-time demands.
Executing user-mode code in the request processing path of a real-time solution should generally be avoided. However there are situations and solutions that require this. As the user-mode code executes in a critical path, obviously that code should execute as quickly as possible. The user-mode code should preferably not block, wait or even call any Windows API library functions at all. It should only make use of resident memory as hard pagefaults can cause long suspensions of execution of a thread while the page fault handler needs to resolve it by synchronously reading in data from disk, which can take seconds.
Note that Windows API functions such as VirtualLock only allow an allocation to be assigned to the working set of a process and do not provide the application with memory which is guaranteed to be resident in RAM. A sure way to provide an application with resident memory is to have a driver lock buffers provided by the user into memory. There are numerous ways to do this, some more complex than others. One simple method is for the user application to allocate the buffer and provide it to the driver as the OutBuffer in an IOCTL using METHOD_OUT_DIRECT. The driver then keeps this buffer – with the locked memory – in progress until the application exits. There are other, more complex, ways of locking shared memory between user-mode and kernel-mode as well. The advantages and disadvantages of each are outside the scope of this article.
User-mode threads that execute as part of interrupt processing are usually kept in a pool of real-time priority threads suspended in an idle wait state and woken up by a dispatcher object signal such as an event or I/O completion, set from a DPC routine.
The time interval from the moment an interrupt was received on a processor until a user-mode thread starts executing after having been woken up is often referred to as “ISR to process latency”. This includes the time required to schedule and execute a DPC routine as part of this process. In operating system courses this is sometimes referred to as “Process Dispatch Latency”. A solution with deadline requirements that does part of its processing in user-mode must minimize its ISR to process latencies.
In case a real-time critical user-mode thread needs to execute for extended periods of time, the scheduler of the operating system makes tricks available that allow non-preemptive scheduling of a user mode thread by giving it a “super” real-time priority so that its quantum never expires. This involves assigning the process to a job object with specific scheduler class settings.
Apart from problems with hardware interrupts, ISR routines, DPC routines, user-mode processing and hard page faults, there are some other enemies of low latency under the hood that cannot be neglected, as they can cause significant delays. Let’s discuss some of these now.
Inter Processor Interrupts are OS initiated interrupts that block all device interrupts while a specified routine is executed on all logical processors. They can be triggered by both hardware and software. Drivers such as ndis.sys make use of this technique which, along with heavy duty DPC processing, is one of the reasons why systems configured with networking enabled are often not suitable for real-time processing.
System Management Mode (SMM) is a special-purpose operating mode of the processor provided for handling system-wide functions like power management, system hardware control, or proprietary OEM provided code. It is intended for use only by the BIOS or firmware interface, not by applications or the operating system. System Management Interrupts are interrupt routines that the CPU executes in SMM beyond the control of the operating system. SMIs take precedence over any other maskable or non-maskable interrupt. Since an SMI can interrupt any processing path at any time, it needs to execute as quickly as possible or it will render an entire system unusable for processing real-time tasks.
A CPU core with a variable speed setting activated can reach a high temperature trip point after which it is temporarily kept in a “stop-clock mode” for several milliseconds to have it cool down. Interrupts occurring on this processor will be deferred until the CPU is allowed to run again.
Unexplainable CPU stalls cay also be caused by bugs in the design of the CPU. Several CPU bugs in modern CPUs are related to processor C-states (which implement CPU idle behavior) and P-states (which implement CPU speed changes) and can freeze a processor indefinitely until certain conditions are met. CPU manufacturers publish errata (spec updates) with such information.
Many of the factors which cause latencies that have been mentioned are impossible to control from software by a developer without controlling the system configuration. It remains a possibility however to measure most of the capabilities of a system for real-time processing through software so that an end-user can be notified in case his system appears not to be compliant.
We have outlined a great number of potential dangers for a driver with real-time requirements. If a driver is to support an off-the-shelf piece of hardware that has any sort of real-time requirements that is to run on an arbitrary system, basically all bets are off for its success. Therefore it may help to test the capabilities of a customer system before installation or acquisition to prevent failure and customer disappointment, or to help the end user overcome the obstacles of running your solution by helping him configure his system.
There are several utilities that allow you to measure execution times of ISRs and DPCs. One of them is XPERF which comes with the optionally installed Windows Performance Toolkit. The Windows Performance Toolkit is included with both the Windows WDK and SDK. For more information about how to use XPERF, check out The NT Insider article Get Low – Collecting Detailed Performance Data with Xperf.
Another utility is LatencyMon which reports ISR and DPC execution times as well as hard page faults. It also offers some different latency measuring tools including ISR to DPC and ISR to user process latencies. LatencyMon is written by the author of this article.
As the author of the driver code, it is fairly trivial to add measuring points yourself by using the KeQueryPerformanceCounter function. To measure the execution time of your DPC routine, you can simply query the performance counter at the beginning and end of the routine and check the difference. You can use the same technique to measure ISR to DPC and ISR to user process latencies as well.
A technique that has been used in the past to measure “DPC latencies” was to install a periodic kernel timer of which the accuracy of the interval was measured. Timers in Windows are software based and are dependent on the resolution of the clock interrupt. The resolution of the system clock is a global resource that several software components and applications compete for in a way that only requests to lower the clock interval will be honored (i.e. the application with the lowest request wins). Because timers on Windows do not have direct support from hardware, it means that they are not very accurate. That is even more so on Windows 8 which introduces a new feature called “dynamic clock tick”, where the system clock does not interrupt at fixed intervals but rather, when the operating system deems it necessary for power saving reasons. This has caused any measuring method dependent on a kernel timer to become unreliable. If a device should need accurate periodic attention from software, the hardware should come equipped with its own timer that can fire interrupts so that it can be serviced by software in a timely fashion. This is true despite the fact that the Windows 8.1 WDK has a new feature called high resolution timers.
Starting with Windows Vista, Windows offers a set of functions and classes to retrieve event tracing information which is collected by the operating system kernel. Among other things, this will allow you to obtain information about ISRs and DPCs as well as hard page faults executed in the system. Unfortunately, these classes do not allow you to collect information about time spent at elevated IRQL due to causes other than ISRs and DPCs, such as for example code running under a spinlock and IPIs.
One method used to measure the maximum execution times of SMIs and unexplainable stalls of the processor is to measure the highest interruption of a tight loop that spins at IRQL HIGH_LEVEL. This opportunistic method of polling will not, however, measure the execution of SMI routines initiated by software.
Similarly, you can measure the execution times of IPIs by performing the spinning loop at an IRQL of just below IPI_LEVEL. One thing to take into consideration here is that some versions of Windows can use an IRQL management technique known as “lazy IRQL.” In this technique the IRQL value stored by the system does not always correspond to the actual processor’s task priority state.
Interrupts are bound to processors. While Windows allows full configuration over what processors in your system are to execute threads by setting affinities, there is little control over which processors handle device interrupts. The processor(s) that execute the ISR associated with a particular device is mostly dependent on the hardware, and therefore Windows normally honors the BIOS settings for interrupt affinity. Some chipsets cause interrupts to be spread across all processors while others cause interrupts to be exclusively executed on CPU 0. This makes it not very useful for software that is to run on arbitrary hardware to choose the processors on which it is to run through affinities. This situation is different if you have the option of supplying your own system which is discussed below. The GetProcessorSystemCycleTime API function is a quick and easy way to find out which processors in the system are actually servicing interrupts and DPCs.
Controlling the Hardware
As was mentioned previously, if you are developing drivers for a device or solution with real-time requirements, chances are that it may not meet its goals on certain systems it’s deployed on, due to the specifics of those systems’ configuration.
If you have the luxury option of supplying a controlled configuration that includes all the hardware on which your solution is to be run then you have many options at your disposal to avoid failure. These options are not available to developers of an off-the-shelf product. Depending on the deadline and market requirements of your solution, choosing a specific system configuration may allow you to deliver an end solution that is guaranteed to work.
Once the system configuration is entirely under your control, you have the option of adding in additional hardware to assist your latency sensitive tasks. One option is to implement the real-time logic in hardware, for instance, by using a FPGA or DSP board. There are also real-time (software) extensions for Windows that allow you to meet real-time without additional hardware such as IntervalZero RTX and Tenasys Intime. These solutions work by running a separate operating system alongside Windows which will be responsible for executing the real-time logic.
But even without exotic hardware or real-time extensions there is a lot that can be done if you control the overall system configuration. If the required response time (or accuracy) of your solution is not very tight, say ten microseconds or longer, then you also have the option of configuring a Windows system with selected hardware and drivers that are known to not introduce high latencies in the system and be able to deliver a solution that is entirely running on Windows which is still guaranteed to meet its deadline requirements.
By carefully selecting the motherboard chipset and drivers in the configuration, it’s possible to “control” what processors will be connected to interrupts, allowing you to reserve one or more processors for real-time critical tasks through affinities while effectively avoiding latencies induced by ISRs and DPCs. Control over what CPUs handle ISRs, DPCs and threads can be even taken further by booting the system with special parameters for processor group awareness testing.
An important part of a custom configuration is power management. As previously discussed, several power management features of the CPU and the BIOS can cause severe real-time processing latencies. Disabling these features, where possible, can avoid unsuitable latencies in real-time processing.
People in the audio industry have done lots of research on how to configure their Windows workstations for low latency so there is plenty of information available on the Internet. A good keyword here is DAW (Digital Audio Workstation). Also there are computer manufacturers who specialize in tailoring and delivering desktops and notebooks for low latency tasks running on Windows.
As you can see, getting a Windows system ready to be able to handle real-time tasks requires some measuring, puzzling, and even taking into consideration some factors that are quite uncertain. After having tweaked and configured a system along with your solution so that it works and having tested it, you may still feel uncomfortable providing a “guarantee” to your customer of which its proof requires analysis of statistics of measurements of operation.
Remember: Windows is not a RTOS. If your solution requires guaranteed response times from Windows, you may have your solution working most of the time, depending on the end user’s configuration. The questions then become how frequently are your real-time requirements not met, what are the consequences of not meeting those requirements, and does any further tailoring or system configuration need to be done to improve the result. Hopefully this article explains the problems, how to overcome some of the issues, and what resources are available to you as a kernel developer who needs to deliver a real-time sensitive solution on Microsoft Windows.
Daniel started out programming at a young age on a ZX-Spectrum on which he soon learnt BASIC and assembly language and later C. After a career as an application developer he mainly specialized in system software and device drivers. Daniel is founder of Resplendence Software Projects which develops advanced system tools and developer components for Microsoft Windows. He can be contacted at firstname.lastname@example.org