1<!DOCTYPE book PUBLIC "-//OASIS//DTD DocBook V3.1//EN"[]> 2 3<book id="LKLockingGuide"> 4 <bookinfo> 5 <title>Unreliable Guide To Locking</title> 6 7 <authorgroup> 8 <author> 9 <firstname>Paul</firstname> 10 <othername>Rusty</othername> 11 <surname>Russell</surname> 12 <affiliation> 13 <address> 14 <email>rusty@rustcorp.com.au</email> 15 </address> 16 </affiliation> 17 </author> 18 </authorgroup> 19 20 <copyright> 21 <year>2000</year> 22 <holder>Paul Russell</holder> 23 </copyright> 24 25 <legalnotice> 26 <para> 27 This documentation is free software; you can redistribute 28 it and/or modify it under the terms of the GNU General Public 29 License as published by the Free Software Foundation; either 30 version 2 of the License, or (at your option) any later 31 version. 32 </para> 33 34 <para> 35 This program is distributed in the hope that it will be 36 useful, but WITHOUT ANY WARRANTY; without even the implied 37 warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. 38 See the GNU General Public License for more details. 39 </para> 40 41 <para> 42 You should have received a copy of the GNU General Public 43 License along with this program; if not, write to the Free 44 Software Foundation, Inc., 59 Temple Place, Suite 330, Boston, 45 MA 02111-1307 USA 46 </para> 47 48 <para> 49 For more details see the file COPYING in the source 50 distribution of Linux. 51 </para> 52 </legalnotice> 53 </bookinfo> 54 55 <toc></toc> 56 <chapter id="intro"> 57 <title>Introduction</title> 58 <para> 59 Welcome, to Rusty's Remarkably Unreliable Guide to Kernel 60 Locking issues. This document describes the locking systems in 61 the Linux Kernel as we approach 2.4. 62 </para> 63 <para> 64 It looks like <firstterm linkend="gloss-smp"><acronym>SMP</acronym> 65 </firstterm> is here to stay; so everyone hacking on the kernel 66 these days needs to know the fundamentals of concurrency and locking 67 for SMP. 68 </para> 69 70 <sect1 id="races"> 71 <title>The Problem With Concurrency</title> 72 <para> 73 (Skip this if you know what a Race Condition is). 74 </para> 75 <para> 76 In a normal program, you can increment a counter like so: 77 </para> 78 <programlisting> 79 very_important_count++; 80 </programlisting> 81 82 <para> 83 This is what they would expect to happen: 84 </para> 85 86 <table> 87 <title>Expected Results</title> 88 89 <tgroup cols=2 align=left> 90 91 <thead> 92 <row> 93 <entry>Instance 1</entry> 94 <entry>Instance 2</entry> 95 </row> 96 </thead> 97 98 <tbody> 99 <row> 100 <entry>read very_important_count (5)</entry> 101 <entry></entry> 102 </row> 103 <row> 104 <entry>add 1 (6)</entry> 105 <entry></entry> 106 </row> 107 <row> 108 <entry>write very_important_count (6)</entry> 109 <entry></entry> 110 </row> 111 <row> 112 <entry></entry> 113 <entry>read very_important_count (6)</entry> 114 </row> 115 <row> 116 <entry></entry> 117 <entry>add 1 (7)</entry> 118 </row> 119 <row> 120 <entry></entry> 121 <entry>write very_important_count (7)</entry> 122 </row> 123 </tbody> 124 125 </tgroup> 126 </table> 127 128 <para> 129 This is what might happen: 130 </para> 131 132 <table> 133 <title>Possible Results</title> 134 135 <tgroup cols=2 align=left> 136 <thead> 137 <row> 138 <entry>Instance 1</entry> 139 <entry>Instance 2</entry> 140 </row> 141 </thead> 142 143 <tbody> 144 <row> 145 <entry>read very_important_count (5)</entry> 146 <entry></entry> 147 </row> 148 <row> 149 <entry></entry> 150 <entry>read very_important_count (5)</entry> 151 </row> 152 <row> 153 <entry>add 1 (6)</entry> 154 <entry></entry> 155 </row> 156 <row> 157 <entry></entry> 158 <entry>add 1 (6)</entry> 159 </row> 160 <row> 161 <entry>write very_important_count (6)</entry> 162 <entry></entry> 163 </row> 164 <row> 165 <entry></entry> 166 <entry>write very_important_count (6)</entry> 167 </row> 168 </tbody> 169 </tgroup> 170 </table> 171 172 <para> 173 This overlap, where what actually happens depends on the 174 relative timing of multiple tasks, is called a race condition. 175 The piece of code containing the concurrency issue is called a 176 critical region. And especially since Linux starting running 177 on SMP machines, they became one of the major issues in kernel 178 design and implementation. 179 </para> 180 <para> 181 The solution is to recognize when these simultaneous accesses 182 occur, and use locks to make sure that only one instance can 183 enter the critical region at any time. There are many 184 friendly primitives in the Linux kernel to help you do this. 185 And then there are the unfriendly primitives, but I'll pretend 186 they don't exist. 187 </para> 188 </sect1> 189 </chapter> 190 191 <chapter id="locks"> 192 <title>Two Main Types of Kernel Locks: Spinlocks and Semaphores</title> 193 194 <para> 195 There are two main types of kernel locks. The fundamental type 196 is the spinlock 197 (<filename class=headerfile>include/asm/spinlock.h</filename>), 198 which is a very simple single-holder lock: if you can't get the 199 spinlock, you keep trying (spinning) until you can. Spinlocks are 200 very small and fast, and can be used anywhere. 201 </para> 202 <para> 203 The second type is a semaphore 204 (<filename class=headerfile>include/asm/semaphore.h</filename>): it 205 can have more than one holder at any time (the number decided at 206 initialization time), although it is most commonly used as a 207 single-holder lock (a mutex). If you can't get a semaphore, 208 your task will put itself on the queue, and be woken up when the 209 semaphore is released. This means the CPU will do something 210 else while you are waiting, but there are many cases when you 211 simply can't sleep (see <xref linkend="sleeping-things">), and so 212 have to use a spinlock instead. 213 </para> 214 <para> 215 Neither type of lock is recursive: see 216 <xref linkend="techniques-deadlocks">. 217 </para> 218 219 <sect1 id="uniprocessor"> 220 <title>Locks and Uniprocessor Kernels</title> 221 222 <para> 223 For kernels compiled without <symbol>CONFIG_SMP</symbol>, spinlocks 224 do not exist at all. This is an excellent design decision: when 225 no-one else can run at the same time, there is no reason to 226 have a lock at all. 227 </para> 228 229 <para> 230 You should always test your locking code with <symbol>CONFIG_SMP</symbol> 231 enabled, even if you don't have an SMP test box, because it 232 will still catch some (simple) kinds of deadlock. 233 </para> 234 235 <para> 236 Semaphores still exist, because they are required for 237 synchronization between <firstterm linkend="gloss-usercontext">user 238 contexts</firstterm>, as we will see below. 239 </para> 240 </sect1> 241 242 <sect1 id="rwlocks"> 243 <title>Read/Write Lock Variants</title> 244 245 <para> 246 Both spinlocks and semaphores have read/write variants: 247 <type>rwlock_t</type> and <structname>struct rw_semaphore</structname>. 248 These divide users into two classes: the readers and the writers. If 249 you are only reading the data, you can get a read lock, but to write to 250 the data you need the write lock. Many people can hold a read lock, 251 but a writer must be sole holder. 252 </para> 253 254 <para> 255 This means much smoother locking if your code divides up 256 neatly along reader/writer lines. All the discussions below 257 also apply to read/write variants. 258 </para> 259 </sect1> 260 261 <sect1 id="usercontextlocking"> 262 <title>Locking Only In User Context</title> 263 264 <para> 265 If you have a data structure which is only ever accessed from 266 user context, then you can use a simple semaphore 267 (<filename>linux/asm/semaphore.h</filename>) to protect it. This 268 is the most trivial case: you initialize the semaphore to the number 269 of resources available (usually 1), and call 270 <function>down_interruptible()</function> to grab the semaphore, and 271 <function>up()</function> to release it. There is also a 272 <function>down()</function>, which should be avoided, because it 273 will not return if a signal is received. 274 </para> 275 276 <para> 277 Example: <filename>linux/net/core/netfilter.c</filename> allows 278 registration of new <function>setsockopt()</function> and 279 <function>getsockopt()</function> calls, with 280 <function>nf_register_sockopt()</function>. Registration and 281 de-registration are only done on module load and unload (and boot 282 time, where there is no concurrency), and the list of registrations 283 is only consulted for an unknown <function>setsockopt()</function> 284 or <function>getsockopt()</function> system call. The 285 <varname>nf_sockopt_mutex</varname> is perfect to protect this, 286 especially since the setsockopt and getsockopt calls may well 287 sleep. 288 </para> 289 </sect1> 290 291 <sect1 id="lock-user-bh"> 292 <title>Locking Between User Context and BHs</title> 293 294 <para> 295 If a <firstterm linkend="gloss-bh">bottom half</firstterm> shares 296 data with user context, you have two problems. Firstly, the current 297 user context can be interrupted by a bottom half, and secondly, the 298 critical region could be entered from another CPU. This is where 299 <function>spin_lock_bh()</function> 300 (<filename class=headerfile>include/linux/spinlock.h</filename>) is 301 used. It disables bottom halves on that CPU, then grabs the lock. 302 <function>spin_unlock_bh()</function> does the reverse. 303 </para> 304 305 <para> 306 This works perfectly for <firstterm linkend="gloss-up"><acronym>UP 307 </acronym></firstterm> as well: the spin lock vanishes, and this macro 308 simply becomes <function>local_bh_disable()</function> 309 (<filename class=headerfile>include/asm/softirq.h</filename>), which 310 protects you from the bottom half being run. 311 </para> 312 </sect1> 313 314 <sect1 id="lock-user-tasklet"> 315 <title>Locking Between User Context and Tasklets/Soft IRQs</title> 316 317 <para> 318 This is exactly the same as above, because 319 <function>local_bh_disable()</function> actually disables all 320 softirqs and <firstterm linkend="gloss-tasklet">tasklets</firstterm> 321 on that CPU as well. It should really be 322 called `local_softirq_disable()', but the name has been preserved 323 for historical reasons. Similarly, 324 <function>spin_lock_bh()</function> would now be called 325 spin_lock_softirq() in a perfect world. 326 </para> 327 </sect1> 328 329 <sect1 id="lock-bh"> 330 <title>Locking Between Bottom Halves</title> 331 332 <para> 333 Sometimes a bottom half might want to share data with 334 another bottom half (especially remember that timers are run 335 off a bottom half). 336 </para> 337 338 <sect2 id="lock-bh-same"> 339 <title>The Same BH</title> 340 341 <para> 342 Since a bottom half is never run on two CPUs at once, you 343 don't need to worry about your bottom half being run twice 344 at once, even on SMP. 345 </para> 346 </sect2> 347 348 <sect2 id="lock-bh-different"> 349 <title>Different BHs</title> 350 351 <para> 352 Since only one bottom half ever runs at a time once, you 353 don't need to worry about race conditions with other bottom 354 halves. Beware that things might change under you, however, 355 if someone changes your bottom half to a tasklet. If you 356 want to make your code future-proof, pretend you're already 357 running from a tasklet (see below), and doing the extra 358 locking. Of course, if it's five years before that happens, 359 you're gonna look like a damn fool. 360 </para> 361 </sect2> 362 </sect1> 363 364 <sect1 id="lock-tasklets"> 365 <title>Locking Between Tasklets</title> 366 367 <para> 368 Sometimes a tasklet might want to share data with another 369 tasklet, or a bottom half. 370 </para> 371 372 <sect2 id="lock-tasklets-same"> 373 <title>The Same Tasklet</title> 374 <para> 375 Since a tasklet is never run on two CPUs at once, you don't 376 need to worry about your tasklet being reentrant (running 377 twice at once), even on SMP. 378 </para> 379 </sect2> 380 381 <sect2 id="lock-tasklets-different"> 382 <title>Different Tasklets</title> 383 <para> 384 If another tasklet (or bottom half, such as a timer) wants 385 to share data with your tasklet, you will both need to use 386 <function>spin_lock()</function> and 387 <function>spin_unlock()</function> calls. 388 <function>spin_lock_bh()</function> is 389 unnecessary here, as you are already in a tasklet, and 390 none will be run on the same CPU. 391 </para> 392 </sect2> 393 </sect1> 394 395 <sect1 id="lock-softirqs"> 396 <title>Locking Between Softirqs</title> 397 398 <para> 399 Often a <firstterm linkend="gloss-softirq">softirq</firstterm> might 400 want to share data with itself, a tasklet, or a bottom half. 401 </para> 402 403 <sect2 id="lock-softirqs-same"> 404 <title>The Same Softirq</title> 405 406 <para> 407 The same softirq can run on the other CPUs: you can use a 408 per-CPU array (see <xref linkend="per-cpu">) for better 409 performance. If you're going so far as to use a softirq, 410 you probably care about scalable performance enough 411 to justify the extra complexity. 412 </para> 413 414 <para> 415 You'll need to use <function>spin_lock()</function> and 416 <function>spin_unlock()</function> for shared data. 417 </para> 418 </sect2> 419 420 <sect2 id="lock-softirqs-different"> 421 <title>Different Softirqs</title> 422 423 <para> 424 You'll need to use <function>spin_lock()</function> and 425 <function>spin_unlock()</function> for shared data, whether it 426 be a timer (which can be running on a different CPU), bottom half, 427 tasklet or the same or another softirq. 428 </para> 429 </sect2> 430 </sect1> 431 </chapter> 432 433 <chapter id="hardirq-context"> 434 <title>Hard IRQ Context</title> 435 436 <para> 437 Hardware interrupts usually communicate with a bottom half, 438 tasklet or softirq. Frequently this involves putting work in a 439 queue, which the BH/softirq will take out. 440 </para> 441 442 <sect1 id="hardirq-softirq"> 443 <title>Locking Between Hard IRQ and Softirqs/Tasklets/BHs</title> 444 445 <para> 446 If a hardware irq handler shares data with a softirq, you have 447 two concerns. Firstly, the softirq processing can be 448 interrupted by a hardware interrupt, and secondly, the 449 critical region could be entered by a hardware interrupt on 450 another CPU. This is where <function>spin_lock_irq()</function> is 451 used. It is defined to disable interrupts on that cpu, then grab 452 the lock. <function>spin_unlock_irq()</function> does the reverse. 453 </para> 454 455 <para> 456 This works perfectly for UP as well: the spin lock vanishes, 457 and this macro simply becomes <function>local_irq_disable()</function> 458 (<filename class=headerfile>include/asm/smp.h</filename>), which 459 protects you from the softirq/tasklet/BH being run. 460 </para> 461 462 <para> 463 <function>spin_lock_irqsave()</function> 464 (<filename>include/linux/spinlock.h</filename>) is a variant 465 which saves whether interrupts were on or off in a flags word, 466 which is passed to <function>spin_lock_irqrestore()</function>. This 467 means that the same code can be used inside an hard irq handler (where 468 interrupts are already off) and in softirqs (where the irq 469 disabling is required). 470 </para> 471 </sect1> 472 </chapter> 473 474 <chapter id="common-techniques"> 475 <title>Common Techniques</title> 476 477 <para> 478 This section lists some common dilemmas and the standard 479 solutions used in the Linux kernel code. If you use these, 480 people will find your code simpler to understand. 481 </para> 482 483 <para> 484 If I could give you one piece of advice: never sleep with anyone 485 crazier than yourself. But if I had to give you advice on 486 locking: <emphasis>keep it simple</emphasis>. 487 </para> 488 489 <para> 490 Lock data, not code. 491 </para> 492 493 <para> 494 Be reluctant to introduce new locks. 495 </para> 496 497 <para> 498 Strangely enough, this is the exact reverse of my advice when 499 you <emphasis>have</emphasis> slept with someone crazier than yourself. 500 </para> 501 502 <sect1 id="techniques-no-writers"> 503 <title>No Writers in Interrupt Context</title> 504 505 <para> 506 There is a fairly common case where an interrupt handler needs 507 access to a critical region, but does not need write access. 508 In this case, you do not need to use 509 <function>read_lock_irq()</function>, but only 510 <function>read_lock()</function> everywhere (since if an interrupt 511 occurs, the irq handler will only try to grab a read lock, which 512 won't deadlock). You will still need to use 513 <function>write_lock_irq()</function>. 514 </para> 515 516 <para> 517 Similar logic applies to locking between softirqs/tasklets/BHs 518 which never need a write lock, and user context: 519 <function>read_lock()</function> and 520 <function>write_lock_bh()</function>. 521 </para> 522 </sect1> 523 524 <sect1 id="techniques-deadlocks"> 525 <title>Deadlock: Simple and Advanced</title> 526 527 <para> 528 There is a coding bug where a piece of code tries to grab a 529 spinlock twice: it will spin forever, waiting for the lock to 530 be released (spinlocks, rwlocks and semaphores are not 531 recursive in Linux). This is trivial to diagnose: not a 532 stay-up-five-nights-talk-to-fluffy-code-bunnies kind of 533 problem. 534 </para> 535 536 <para> 537 For a slightly more complex case, imagine you have a region 538 shared by a bottom half and user context. If you use a 539 <function>spin_lock()</function> call to protect it, it is 540 possible that the user context will be interrupted by the bottom 541 half while it holds the lock, and the bottom half will then spin 542 forever trying to get the same lock. 543 </para> 544 545 <para> 546 Both of these are called deadlock, and as shown above, it can 547 occur even with a single CPU (although not on UP compiles, 548 since spinlocks vanish on kernel compiles with 549 <symbol>CONFIG_SMP</symbol>=n. You'll still get data corruption 550 in the second example). 551 </para> 552 553 <para> 554 This complete lockup is easy to diagnose: on SMP boxes the 555 watchdog timer or compiling with <symbol>DEBUG_SPINLOCKS</symbol> set 556 (<filename>include/linux/spinlock.h</filename>) will show this up 557 immediately when it happens. 558 </para> 559 560 <para> 561 A more complex problem is the so-called `deadly embrace', 562 involving two or more locks. Say you have a hash table: each 563 entry in the table is a spinlock, and a chain of hashed 564 objects. Inside a softirq handler, you sometimes want to 565 alter an object from one place in the hash to another: you 566 grab the spinlock of the old hash chain and the spinlock of 567 the new hash chain, and delete the object from the old one, 568 and insert it in the new one. 569 </para> 570 571 <para> 572 There are two problems here. First, if your code ever 573 tries to move the object to the same chain, it will deadlock 574 with itself as it tries to lock it twice. Secondly, if the 575 same softirq on another CPU is trying to move another object 576 in the reverse direction, the following could happen: 577 </para> 578 579 <table> 580 <title>Consequences</title> 581 582 <tgroup cols=2 align=left> 583 584 <thead> 585 <row> 586 <entry>CPU 1</entry> 587 <entry>CPU 2</entry> 588 </row> 589 </thead> 590 591 <tbody> 592 <row> 593 <entry>Grab lock A -> OK</entry> 594 <entry>Grab lock B -> OK</entry> 595 </row> 596 <row> 597 <entry>Grab lock B -> spin</entry> 598 <entry>Grab lock A -> spin</entry> 599 </row> 600 </tbody> 601 </tgroup> 602 </table> 603 604 <para> 605 The two CPUs will spin forever, waiting for the other to give up 606 their lock. It will look, smell, and feel like a crash. 607 </para> 608 609 <sect2 id="techs-deadlock-prevent"> 610 <title>Preventing Deadlock</title> 611 612 <para> 613 Textbooks will tell you that if you always lock in the same 614 order, you will never get this kind of deadlock. Practice 615 will tell you that this approach doesn't scale: when I 616 create a new lock, I don't understand enough of the kernel 617 to figure out where in the 5000 lock hierarchy it will fit. 618 </para> 619 620 <para> 621 The best locks are encapsulated: they never get exposed in 622 headers, and are never held around calls to non-trivial 623 functions outside the same file. You can read through this 624 code and see that it will never deadlock, because it never 625 tries to grab another lock while it has that one. People 626 using your code don't even need to know you are using a 627 lock. 628 </para> 629 630 <para> 631 A classic problem here is when you provide callbacks or 632 hooks: if you call these with the lock held, you risk simple 633 deadlock, or a deadly embrace (who knows what the callback 634 will do?). Remember, the other programmers are out to get 635 you, so don't do this. 636 </para> 637 </sect2> 638 639 <sect2 id="techs-deadlock-overprevent"> 640 <title>Overzealous Prevention Of Deadlocks</title> 641 642 <para> 643 Deadlocks are problematic, but not as bad as data 644 corruption. Code which grabs a read lock, searches a list, 645 fails to find what it wants, drops the read lock, grabs a 646 write lock and inserts the object has a race condition. 647 </para> 648 649 <para> 650 If you don't see why, please stay the fuck away from my code. 651 </para> 652 </sect2> 653 </sect1> 654 655 <sect1 id="per-cpu"> 656 <title>Per-CPU Data</title> 657 658 <para> 659 A great technique for avoiding locking which is used fairly 660 widely is to duplicate information for each CPU. For example, 661 if you wanted to keep a count of a common condition, you could 662 use a spin lock and a single counter. Nice and simple. 663 </para> 664 665 <para> 666 If that was too slow [it's probably not], you could instead 667 use a counter for each CPU [don't], then none of them need an 668 exclusive lock [you're wasting your time here]. To make sure 669 the CPUs don't have to synchronize caches all the time, align 670 the counters to cache boundaries by appending 671 `__cacheline_aligned' to the declaration 672 (<filename class=headerfile>include/linux/cache.h</filename>). 673 [Can't you think of anything better to do?] 674 </para> 675 676 <para> 677 They will need a read lock to access their own counters, 678 however. That way you can use a write lock to grant exclusive 679 access to all of them at once, to tally them up. 680 </para> 681 </sect1> 682 683 <sect1 id="brlock"> 684 <title>Big Reader Locks</title> 685 686 <para> 687 A classic example of per-CPU information is Ingo's `big 688 reader' locks 689 (<filename class=headerfile>linux/include/brlock.h</filename>). These 690 use the Per-CPU Data techniques described above to create a lock which 691 is very fast to get a read lock, but agonizingly slow for a write 692 lock. 693 </para> 694 695 <para> 696 Fortunately, there are a limited number of these locks 697 available: you have to go through a strict interview process 698 to get one. 699 </para> 700 </sect1> 701 702 <sect1 id="lock-avoidance-rw"> 703 <title>Avoiding Locks: Read And Write Ordering</title> 704 705 <para> 706 Sometimes it is possible to avoid locking. Consider the 707 following case from the 2.2 firewall code, which inserted an 708 element into a single linked list in user context: 709 </para> 710 711 <programlisting> 712 new->next = i->next; 713 i->next = new; 714 </programlisting> 715 716 <para> 717 Here the author (Alan Cox, who knows what he's doing) assumes 718 that the pointer assignments are atomic. This is important, 719 because networking packets would traverse this list on bottom 720 halves without a lock. Depending on their exact timing, they 721 would either see the new element in the list with a valid 722 <structfield>next</structfield> pointer, or it would not be in the 723 list yet. A lock is still required against other CPUs inserting 724 or deleting from the list, of course. 725 </para> 726 727 <para> 728 Of course, the writes <emphasis>must</emphasis> be in this 729 order, otherwise the new element appears in the list with an 730 invalid <structfield>next</structfield> pointer, and any other 731 CPU iterating at the wrong time will jump through it into garbage. 732 Because modern CPUs reorder, Alan's code actually read as follows: 733 </para> 734 735 <programlisting> 736 new->next = i->next; 737 wmb(); 738 i->next = new; 739 </programlisting> 740 741 <para> 742 The <function>wmb()</function> is a write memory barrier 743 (<filename class=headerfile>include/asm/system.h</filename>): neither 744 the compiler nor the CPU will allow any writes to memory after the 745 <function>wmb()</function> to be visible to other hardware 746 before any of the writes before the <function>wmb()</function>. 747 </para> 748 749 <para> 750 As i386 does not do write reordering, this bug would never 751 show up on that platform. On other SMP platforms, however, it 752 will. 753 </para> 754 755 <para> 756 There is also <function>rmb()</function> for read ordering: to ensure 757 any previous variable reads occur before following reads. The simple 758 <function>mb()</function> macro combines both 759 <function>rmb()</function> and <function>wmb()</function>. 760 </para> 761 762 <para> 763 Some atomic operations are defined to act as a memory barrier 764 (ie. as per the <function>mb()</function> macro, but if in 765 doubt, be explicit. 766 <!-- Rusty Russell 2 May 2001, 2.4.4 --> 767 Also, 768 spinlock operations act as partial barriers: operations after 769 gaining a spinlock will never be moved to precede the 770 <function>spin_lock()</function> call, and operations before 771 releasing a spinlock will never be moved after the 772 <function>spin_unlock()</function> call. 773 <!-- Manfred Spraul <manfreds@colorfullife.com> 774 24 May 2000 2.3.99-pre9 --> 775 </para> 776 </sect1> 777 778 <sect1 id="lock-avoidance-atomic-ops"> 779 <title>Avoiding Locks: Atomic Operations</title> 780 781 <para> 782 There are a number of atomic operations defined in 783 <filename class=headerfile>include/asm/atomic.h</filename>: these 784 are guaranteed to be seen atomically from all CPUs in the system, thus 785 avoiding races. If your shared data consists of a single counter, say, 786 these operations might be simpler than using spinlocks (although for 787 anything non-trivial using spinlocks is clearer). 788 </para> 789 790 <para> 791 Note that the atomic operations do in general not act as memory 792 barriers. Instead you can insert a memory barrier before or 793 after <function>atomic_inc()</function> or 794 <function>atomic_dec()</function> by inserting 795 <function>smp_mb__before_atomic_inc()</function>, 796 <function>smp_mb__after_atomic_inc()</function>, 797 <function>smp_mb__before_atomic_dec()</function> or 798 <function>smp_mb__after_atomic_dec()</function> 799 respectively. The advantage of using those macros instead of 800 <function>smp_mb()</function> is, that they are cheaper on some 801 platforms. 802 <!-- Sebastian Wilhelmi <seppi@seppi.de> 2002-03-04 --> 803 </para> 804 </sect1> 805 806 <sect1 id="ref-counts"> 807 <title>Protecting A Collection of Objects: Reference Counts</title> 808 809 <para> 810 Locking a collection of objects is fairly easy: you get a 811 single spinlock, and you make sure you grab it before 812 searching, adding or deleting an object. 813 </para> 814 815 <para> 816 The purpose of this lock is not to protect the individual 817 objects: you might have a separate lock inside each one for 818 that. It is to protect the <emphasis>data structure 819 containing the objects</emphasis> from race conditions. Often 820 the same lock is used to protect the contents of all the 821 objects as well, for simplicity, but they are inherently 822 orthogonal (and many other big words designed to confuse). 823 </para> 824 825 <para> 826 Changing this to a read-write lock will often help markedly if 827 reads are far more common that writes. If not, there is 828 another approach you can use to reduce the time the lock is 829 held: reference counts. 830 </para> 831 832 <para> 833 In this approach, an object has an owner, who sets the 834 reference count to one. Whenever you get a pointer to the 835 object, you increment the reference count (a `get' operation). 836 Whenever you relinquish a pointer, you decrement the reference 837 count (a `put' operation). When the owner wants to destroy 838 it, they mark it dead, and do a put. 839 </para> 840 841 <para> 842 Whoever drops the reference count to zero (usually implemented 843 with <function>atomic_dec_and_test()</function>) actually cleans 844 up and frees the object. 845 </para> 846 847 <para> 848 This means that you are guaranteed that the object won't 849 vanish underneath you, even though you no longer have a lock 850 for the collection. 851 </para> 852 853 <para> 854 Here's some skeleton code: 855 </para> 856 857 <programlisting> 858 void create_foo(struct foo *x) 859 { 860 atomic_set(&x->use, 1); 861 spin_lock_bh(&list_lock); 862 ... insert in list ... 863 spin_unlock_bh(&list_lock); 864 } 865 866 struct foo *get_foo(int desc) 867 { 868 struct foo *ret; 869 870 spin_lock_bh(&list_lock); 871 ... find in list ... 872 if (ret) atomic_inc(&ret->use); 873 spin_unlock_bh(&list_lock); 874 875 return ret; 876 } 877 878 void put_foo(struct foo *x) 879 { 880 if (atomic_dec_and_test(&x->use)) 881 kfree(foo); 882 } 883 884 void destroy_foo(struct foo *x) 885 { 886 spin_lock_bh(&list_lock); 887 ... remove from list ... 888 spin_unlock_bh(&list_lock); 889 890 put_foo(x); 891 } 892 </programlisting> 893 894 <sect2 id="helpful-macros"> 895 <title>Macros To Help You</title> 896 <para> 897 There are a set of debugging macros tucked inside 898 <filename class=headerfile>include/linux/netfilter_ipv4/lockhelp.h</filename> 899 and <filename class=headerfile>listhelp.h</filename>: these are very 900 useful for ensuring that locks are held in the right places to protect 901 infrastructure. 902 </para> 903 </sect2> 904 </sect1> 905 906 <sect1 id="sleeping-things"> 907 <title>Things Which Sleep</title> 908 909 <para> 910 You can never call the following routines while holding a 911 spinlock, as they may sleep. This also means you need to be in 912 user context. 913 </para> 914 915 <itemizedlist> 916 <listitem> 917 <para> 918 Accesses to 919 <firstterm linkend="gloss-userspace">userspace</firstterm>: 920 </para> 921 <itemizedlist> 922 <listitem> 923 <para> 924 <function>copy_from_user()</function> 925 </para> 926 </listitem> 927 <listitem> 928 <para> 929 <function>copy_to_user()</function> 930 </para> 931 </listitem> 932 <listitem> 933 <para> 934 <function>get_user()</function> 935 </para> 936 </listitem> 937 <listitem> 938 <para> 939 <function> put_user()</function> 940 </para> 941 </listitem> 942 </itemizedlist> 943 </listitem> 944 945 <listitem> 946 <para> 947 <function>kmalloc(GFP_KERNEL)</function> 948 </para> 949 </listitem> 950 951 <listitem> 952 <para> 953 <function>down_interruptible()</function> and 954 <function>down()</function> 955 </para> 956 <para> 957 There is a <function>down_trylock()</function> which can be 958 used inside interrupt context, as it will not sleep. 959 <function>up()</function> will also never sleep. 960 </para> 961 </listitem> 962 </itemizedlist> 963 964 <para> 965 <function>printk()</function> can be called in 966 <emphasis>any</emphasis> context, interestingly enough. 967 </para> 968 </sect1> 969 970 <sect1 id="sparc"> 971 <title>The Fucked Up Sparc</title> 972 973 <para> 974 Alan Cox says <quote>the irq disable/enable is in the register 975 window on a sparc</quote>. Andi Kleen says <quote>when you do 976 restore_flags in a different function you mess up all the 977 register windows</quote>. 978 </para> 979 980 <para> 981 So never pass the flags word set by 982 <function>spin_lock_irqsave()</function> and brethren to another 983 function (unless it's declared <type>inline</type>. Usually no-one 984 does this, but now you've been warned. Dave Miller can never do 985 anything in a straightforward manner (I can say that, because I have 986 pictures of him and a certain PowerPC maintainer in a compromising 987 position). 988 </para> 989 </sect1> 990 991 <sect1 id="racing-timers"> 992 <title>Racing Timers: A Kernel Pastime</title> 993 994 <para> 995 Timers can produce their own special problems with races. 996 Consider a collection of objects (list, hash, etc) where each 997 object has a timer which is due to destroy it. 998 </para> 999 1000 <para> 1001 If you want to destroy the entire collection (say on module 1002 removal), you might do the following: 1003 </para> 1004 1005 <programlisting> 1006 /* THIS CODE BAD BAD BAD BAD: IF IT WAS ANY WORSE IT WOULD USE 1007 HUNGARIAN NOTATION */ 1008 spin_lock_bh(&list_lock); 1009 1010 while (list) { 1011 struct foo *next = list->next; 1012 del_timer(&list->timer); 1013 kfree(list); 1014 list = next; 1015 } 1016 1017 spin_unlock_bh(&list_lock); 1018 </programlisting> 1019 1020 <para> 1021 Sooner or later, this will crash on SMP, because a timer can 1022 have just gone off before the <function>spin_lock_bh()</function>, 1023 and it will only get the lock after we 1024 <function>spin_unlock_bh()</function>, and then try to free 1025 the element (which has already been freed!). 1026 </para> 1027 1028 <para> 1029 This can be avoided by checking the result of 1030 <function>del_timer()</function>: if it returns 1031 <returnvalue>1</returnvalue>, the timer has been deleted. 1032 If <returnvalue>0</returnvalue>, it means (in this 1033 case) that it is currently running, so we can do: 1034 </para> 1035 1036 <programlisting> 1037 retry: 1038 spin_lock_bh(&list_lock); 1039 1040 while (list) { 1041 struct foo *next = list->next; 1042 if (!del_timer(&list->timer)) { 1043 /* Give timer a chance to delete this */ 1044 spin_unlock_bh(&list_lock); 1045 goto retry; 1046 } 1047 kfree(list); 1048 list = next; 1049 } 1050 1051 spin_unlock_bh(&list_lock); 1052 </programlisting> 1053 1054 <para> 1055 Another common problem is deleting timers which restart 1056 themselves (by calling <function>add_timer()</function> at the end 1057 of their timer function). Because this is a fairly common case 1058 which is prone to races, you should use <function>del_timer_sync()</function> 1059 (<filename class=headerfile>include/linux/timer.h</filename>) 1060 to handle this case. It returns the number of times the timer 1061 had to be deleted before we finally stopped it from adding itself back 1062 in. 1063 </para> 1064 </sect1> 1065 </chapter> 1066 1067 <chapter id="references"> 1068 <title>Further reading</title> 1069 1070 <itemizedlist> 1071 <listitem> 1072 <para> 1073 <filename>Documentation/spinlocks.txt</filename>: 1074 Linus Torvalds' spinlocking tutorial in the kernel sources. 1075 </para> 1076 </listitem> 1077 1078 <listitem> 1079 <para> 1080 Unix Systems for Modern Architectures: Symmetric 1081 Multiprocessing and Caching for Kernel Programmers: 1082 </para> 1083 1084 <para> 1085 Curt Schimmel's very good introduction to kernel level 1086 locking (not written for Linux, but nearly everything 1087 applies). The book is expensive, but really worth every 1088 penny to understand SMP locking. [ISBN: 0201633388] 1089 </para> 1090 </listitem> 1091 </itemizedlist> 1092 </chapter> 1093 1094 <chapter id="thanks"> 1095 <title>Thanks</title> 1096 1097 <para> 1098 Thanks to Telsa Gwynne for DocBooking, neatening and adding 1099 style. 1100 </para> 1101 1102 <para> 1103 Thanks to Martin Pool, Philipp Rumpf, Stephen Rothwell, Paul 1104 Mackerras, Ruedi Aschwanden, Alan Cox, Manfred Spraul and Tim 1105 Waugh for proofreading, correcting, flaming, commenting. 1106 </para> 1107 1108 <para> 1109 Thanks to the cabal for having no influence on this document. 1110 </para> 1111 </chapter> 1112 1113 <glossary id="glossary"> 1114 <title>Glossary</title> 1115 1116 <glossentry id="gloss-bh"> 1117 <glossterm>bh</glossterm> 1118 <glossdef> 1119 <para> 1120 Bottom Half: for historical reasons, functions with 1121 `_bh' in them often now refer to any software interrupt, e.g. 1122 <function>spin_lock_bh()</function> blocks any software interrupt 1123 on the current CPU. Bottom halves are deprecated, and will 1124 eventually be replaced by tasklets. Only one bottom half will be 1125 running at any time. 1126 </para> 1127 </glossdef> 1128 </glossentry> 1129 1130 <glossentry id="gloss-hwinterrupt"> 1131 <glossterm>Hardware Interrupt / Hardware IRQ</glossterm> 1132 <glossdef> 1133 <para> 1134 Hardware interrupt request. <function>in_irq()</function> returns 1135 <returnvalue>true</returnvalue> in a hardware interrupt handler (it 1136 also returns true when interrupts are blocked). 1137 </para> 1138 </glossdef> 1139 </glossentry> 1140 1141 <glossentry id="gloss-interruptcontext"> 1142 <glossterm>Interrupt Context</glossterm> 1143 <glossdef> 1144 <para> 1145 Not user context: processing a hardware irq or software irq. 1146 Indicated by the <function>in_interrupt()</function> macro 1147 returning <returnvalue>true</returnvalue> (although it also 1148 returns true when interrupts or BHs are blocked). 1149 </para> 1150 </glossdef> 1151 </glossentry> 1152 1153 <glossentry id="gloss-smp"> 1154 <glossterm><acronym>SMP</acronym></glossterm> 1155 <glossdef> 1156 <para> 1157 Symmetric Multi-Processor: kernels compiled for multiple-CPU 1158 machines. (CONFIG_SMP=y). 1159 </para> 1160 </glossdef> 1161 </glossentry> 1162 1163 <glossentry id="gloss-softirq"> 1164 <glossterm>softirq</glossterm> 1165 <glossdef> 1166 <para> 1167 Strictly speaking, one of up to 32 enumerated software 1168 interrupts which can run on multiple CPUs at once. 1169 Sometimes used to refer to tasklets and bottom halves as 1170 well (ie. all software interrupts). 1171 </para> 1172 </glossdef> 1173 </glossentry> 1174 1175 <glossentry id="gloss-swinterrupt"> 1176 <glossterm>Software Interrupt / Software IRQ</glossterm> 1177 <glossdef> 1178 <para> 1179 Software interrupt handler. <function>in_irq()</function> returns 1180 <returnvalue>false</returnvalue>; <function>in_softirq()</function> 1181 returns <returnvalue>true</returnvalue>. Tasklets, softirqs and 1182 bottom halves all fall into the category of `software interrupts'. 1183 </para> 1184 </glossdef> 1185 </glossentry> 1186 1187 <glossentry id="gloss-tasklet"> 1188 <glossterm>tasklet</glossterm> 1189 <glossdef> 1190 <para> 1191 A dynamically-registrable software interrupt, 1192 which is guaranteed to only run on one CPU at a time. 1193 </para> 1194 </glossdef> 1195 </glossentry> 1196 1197 <glossentry id="gloss-up"> 1198 <glossterm><acronym>UP</acronym></glossterm> 1199 <glossdef> 1200 <para> 1201 Uni-Processor: Non-SMP. (CONFIG_SMP=n). 1202 </para> 1203 </glossdef> 1204 </glossentry> 1205 1206 <glossentry id="gloss-usercontext"> 1207 <glossterm>User Context</glossterm> 1208 <glossdef> 1209 <para> 1210 The kernel executing on behalf of a particular 1211 process or kernel thread (given by the <function>current()</function> 1212 macro.) Not to be confused with userspace. Can be interrupted by 1213 software or hardware interrupts. 1214 </para> 1215 </glossdef> 1216 </glossentry> 1217 1218 <glossentry id="gloss-userspace"> 1219 <glossterm>Userspace</glossterm> 1220 <glossdef> 1221 <para> 1222 A process executing its own code outside the kernel. 1223 </para> 1224 </glossdef> 1225 </glossentry> 1226 1227 </glossary> 1228</book> 1229 1230