1 2hrtimers - subsystem for high-resolution kernel timers 3---------------------------------------------------- 4 5This patch introduces a new subsystem for high-resolution kernel timers. 6 7One might ask the question: we already have a timer subsystem 8(kernel/timers.c), why do we need two timer subsystems? After a lot of 9back and forth trying to integrate high-resolution and high-precision 10features into the existing timer framework, and after testing various 11such high-resolution timer implementations in practice, we came to the 12conclusion that the timer wheel code is fundamentally not suitable for 13such an approach. We initially didn't believe this ('there must be a way 14to solve this'), and spent a considerable effort trying to integrate 15things into the timer wheel, but we failed. In hindsight, there are 16several reasons why such integration is hard/impossible: 17 18- the forced handling of low-resolution and high-resolution timers in 19 the same way leads to a lot of compromises, macro magic and #ifdef 20 mess. The timers.c code is very "tightly coded" around jiffies and 21 32-bitness assumptions, and has been honed and micro-optimized for a 22 relatively narrow use case (jiffies in a relatively narrow HZ range) 23 for many years - and thus even small extensions to it easily break 24 the wheel concept, leading to even worse compromises. The timer wheel 25 code is very good and tight code, there's zero problems with it in its 26 current usage - but it is simply not suitable to be extended for 27 high-res timers. 28 29- the unpredictable [O(N)] overhead of cascading leads to delays which 30 necessitate a more complex handling of high resolution timers, which 31 in turn decreases robustness. Such a design still led to rather large 32 timing inaccuracies. Cascading is a fundamental property of the timer 33 wheel concept, it cannot be 'designed out' without unevitably 34 degrading other portions of the timers.c code in an unacceptable way. 35 36- the implementation of the current posix-timer subsystem on top of 37 the timer wheel has already introduced a quite complex handling of 38 the required readjusting of absolute CLOCK_REALTIME timers at 39 settimeofday or NTP time - further underlying our experience by 40 example: that the timer wheel data structure is too rigid for high-res 41 timers. 42 43- the timer wheel code is most optimal for use cases which can be 44 identified as "timeouts". Such timeouts are usually set up to cover 45 error conditions in various I/O paths, such as networking and block 46 I/O. The vast majority of those timers never expire and are rarely 47 recascaded because the expected correct event arrives in time so they 48 can be removed from the timer wheel before any further processing of 49 them becomes necessary. Thus the users of these timeouts can accept 50 the granularity and precision tradeoffs of the timer wheel, and 51 largely expect the timer subsystem to have near-zero overhead. 52 Accurate timing for them is not a core purpose - in fact most of the 53 timeout values used are ad-hoc. For them it is at most a necessary 54 evil to guarantee the processing of actual timeout completions 55 (because most of the timeouts are deleted before completion), which 56 should thus be as cheap and unintrusive as possible. 57 58The primary users of precision timers are user-space applications that 59utilize nanosleep, posix-timers and itimer interfaces. Also, in-kernel 60users like drivers and subsystems which require precise timed events 61(e.g. multimedia) can benefit from the availability of a separate 62high-resolution timer subsystem as well. 63 64While this subsystem does not offer high-resolution clock sources just 65yet, the hrtimer subsystem can be easily extended with high-resolution 66clock capabilities, and patches for that exist and are maturing quickly. 67The increasing demand for realtime and multimedia applications along 68with other potential users for precise timers gives another reason to 69separate the "timeout" and "precise timer" subsystems. 70 71Another potential benefit is that such a separation allows even more 72special-purpose optimization of the existing timer wheel for the low 73resolution and low precision use cases - once the precision-sensitive 74APIs are separated from the timer wheel and are migrated over to 75hrtimers. E.g. we could decrease the frequency of the timeout subsystem 76from 250 Hz to 100 HZ (or even smaller). 77 78hrtimer subsystem implementation details 79---------------------------------------- 80 81the basic design considerations were: 82 83- simplicity 84 85- data structure not bound to jiffies or any other granularity. All the 86 kernel logic works at 64-bit nanoseconds resolution - no compromises. 87 88- simplification of existing, timing related kernel code 89 90another basic requirement was the immediate enqueueing and ordering of 91timers at activation time. After looking at several possible solutions 92such as radix trees and hashes, we chose the red black tree as the basic 93data structure. Rbtrees are available as a library in the kernel and are 94used in various performance-critical areas of e.g. memory management and 95file systems. The rbtree is solely used for time sorted ordering, while 96a separate list is used to give the expiry code fast access to the 97queued timers, without having to walk the rbtree. 98 99(This separate list is also useful for later when we'll introduce 100high-resolution clocks, where we need separate pending and expired 101queues while keeping the time-order intact.) 102 103Time-ordered enqueueing is not purely for the purposes of 104high-resolution clocks though, it also simplifies the handling of 105absolute timers based on a low-resolution CLOCK_REALTIME. The existing 106implementation needed to keep an extra list of all armed absolute 107CLOCK_REALTIME timers along with complex locking. In case of 108settimeofday and NTP, all the timers (!) had to be dequeued, the 109time-changing code had to fix them up one by one, and all of them had to 110be enqueued again. The time-ordered enqueueing and the storage of the 111expiry time in absolute time units removes all this complex and poorly 112scaling code from the posix-timer implementation - the clock can simply 113be set without having to touch the rbtree. This also makes the handling 114of posix-timers simpler in general. 115 116The locking and per-CPU behavior of hrtimers was mostly taken from the 117existing timer wheel code, as it is mature and well suited. Sharing code 118was not really a win, due to the different data structures. Also, the 119hrtimer functions now have clearer behavior and clearer names - such as 120hrtimer_try_to_cancel() and hrtimer_cancel() [which are roughly 121equivalent to del_timer() and del_timer_sync()] - so there's no direct 1221:1 mapping between them on the algorithmical level, and thus no real 123potential for code sharing either. 124 125Basic data types: every time value, absolute or relative, is in a 126special nanosecond-resolution type: ktime_t. The kernel-internal 127representation of ktime_t values and operations is implemented via 128macros and inline functions, and can be switched between a "hybrid 129union" type and a plain "scalar" 64bit nanoseconds representation (at 130compile time). The hybrid union type optimizes time conversions on 32bit 131CPUs. This build-time-selectable ktime_t storage format was implemented 132to avoid the performance impact of 64-bit multiplications and divisions 133on 32bit CPUs. Such operations are frequently necessary to convert 134between the storage formats provided by kernel and userspace interfaces 135and the internal time format. (See include/linux/ktime.h for further 136details.) 137 138hrtimers - rounding of timer values 139----------------------------------- 140 141the hrtimer code will round timer events to lower-resolution clocks 142because it has to. Otherwise it will do no artificial rounding at all. 143 144one question is, what resolution value should be returned to the user by 145the clock_getres() interface. This will return whatever real resolution 146a given clock has - be it low-res, high-res, or artificially-low-res. 147 148hrtimers - testing and verification 149---------------------------------- 150 151We used the high-resolution clock subsystem ontop of hrtimers to verify 152the hrtimer implementation details in praxis, and we also ran the posix 153timer tests in order to ensure specification compliance. We also ran 154tests on low-resolution clocks. 155 156The hrtimer patch converts the following kernel functionality to use 157hrtimers: 158 159 - nanosleep 160 - itimers 161 - posix-timers 162 163The conversion of nanosleep and posix-timers enabled the unification of 164nanosleep and clock_nanosleep. 165 166The code was successfully compiled for the following platforms: 167 168 i386, x86_64, ARM, PPC, PPC64, IA64 169 170The code was run-tested on the following platforms: 171 172 i386(UP/SMP), x86_64(UP/SMP), ARM, PPC 173 174hrtimers were also integrated into the -rt tree, along with a 175hrtimers-based high-resolution clock implementation, so the hrtimers 176code got a healthy amount of testing and use in practice. 177 178 Thomas Gleixner, Ingo Molnar 179