1<!DOCTYPE book PUBLIC "-//OASIS//DTD DocBook V3.1//EN"[]> 2 3<book id="lk-hacking-guide"> 4 <bookinfo> 5 <title>Unreliable Guide To Hacking The Linux Kernel</title> 6 7 <authorgroup> 8 <author> 9 <firstname>Paul</firstname> 10 <othername>Rusty</othername> 11 <surname>Russell</surname> 12 <affiliation> 13 <address> 14 <email>rusty@rustcorp.com.au</email> 15 </address> 16 </affiliation> 17 </author> 18 </authorgroup> 19 20 <copyright> 21 <year>2001</year> 22 <holder>Rusty Russell</holder> 23 </copyright> 24 25 <legalnotice> 26 <para> 27 This documentation is free software; you can redistribute 28 it and/or modify it under the terms of the GNU General Public 29 License as published by the Free Software Foundation; either 30 version 2 of the License, or (at your option) any later 31 version. 32 </para> 33 34 <para> 35 This program is distributed in the hope that it will be 36 useful, but WITHOUT ANY WARRANTY; without even the implied 37 warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. 38 See the GNU General Public License for more details. 39 </para> 40 41 <para> 42 You should have received a copy of the GNU General Public 43 License along with this program; if not, write to the Free 44 Software Foundation, Inc., 59 Temple Place, Suite 330, Boston, 45 MA 02111-1307 USA 46 </para> 47 48 <para> 49 For more details see the file COPYING in the source 50 distribution of Linux. 51 </para> 52 </legalnotice> 53 54 <releaseinfo> 55 This is the first release of this document as part of the kernel tarball. 56 </releaseinfo> 57 58 </bookinfo> 59 60 <toc></toc> 61 62 <chapter id="introduction"> 63 <title>Introduction</title> 64 <para> 65 Welcome, gentle reader, to Rusty's Unreliable Guide to Linux 66 Kernel Hacking. This document describes the common routines and 67 general requirements for kernel code: its goal is to serve as a 68 primer for Linux kernel development for experienced C 69 programmers. I avoid implementation details: that's what the 70 code is for, and I ignore whole tracts of useful routines. 71 </para> 72 <para> 73 Before you read this, please understand that I never wanted to 74 write this document, being grossly under-qualified, but I always 75 wanted to read it, and this was the only way. I hope it will 76 grow into a compendium of best practice, common starting points 77 and random information. 78 </para> 79 </chapter> 80 81 <chapter id="basic-players"> 82 <title>The Players</title> 83 84 <para> 85 At any time each of the CPUs in a system can be: 86 </para> 87 88 <itemizedlist> 89 <listitem> 90 <para> 91 not associated with any process, serving a hardware interrupt; 92 </para> 93 </listitem> 94 95 <listitem> 96 <para> 97 not associated with any process, serving a softirq, tasklet or bh; 98 </para> 99 </listitem> 100 101 <listitem> 102 <para> 103 running in kernel space, associated with a process; 104 </para> 105 </listitem> 106 107 <listitem> 108 <para> 109 running a process in user space. 110 </para> 111 </listitem> 112 </itemizedlist> 113 114 <para> 115 There is a strict ordering between these: other than the last 116 category (userspace) each can only be pre-empted by those above. 117 For example, while a softirq is running on a CPU, no other 118 softirq will pre-empt it, but a hardware interrupt can. However, 119 any other CPUs in the system execute independently. 120 </para> 121 122 <para> 123 We'll see a number of ways that the user context can block 124 interrupts, to become truly non-preemptable. 125 </para> 126 127 <sect1 id="basics-usercontext"> 128 <title>User Context</title> 129 130 <para> 131 User context is when you are coming in from a system call or 132 other trap: you can sleep, and you own the CPU (except for 133 interrupts) until you call <function>schedule()</function>. 134 In other words, user context (unlike userspace) is not pre-emptable. 135 </para> 136 137 <note> 138 <para> 139 You are always in user context on module load and unload, 140 and on operations on the block device layer. 141 </para> 142 </note> 143 144 <para> 145 In user context, the <varname>current</varname> pointer (indicating 146 the task we are currently executing) is valid, and 147 <function>in_interrupt()</function> 148 (<filename>include/asm/hardirq.h</filename>) is <returnvalue>false 149 </returnvalue>. 150 </para> 151 152 <caution> 153 <para> 154 Beware that if you have interrupts or bottom halves disabled 155 (see below), <function>in_interrupt()</function> will return a 156 false positive. 157 </para> 158 </caution> 159 </sect1> 160 161 <sect1 id="basics-hardirqs"> 162 <title>Hardware Interrupts (Hard IRQs)</title> 163 164 <para> 165 Timer ticks, <hardware>network cards</hardware> and 166 <hardware>keyboard</hardware> are examples of real 167 hardware which produce interrupts at any time. The kernel runs 168 interrupt handlers, which services the hardware. The kernel 169 guarantees that this handler is never re-entered: if another 170 interrupt arrives, it is queued (or dropped). Because it 171 disables interrupts, this handler has to be fast: frequently it 172 simply acknowledges the interrupt, marks a `software interrupt' 173 for execution and exits. 174 </para> 175 176 <para> 177 You can tell you are in a hardware interrupt, because 178 <function>in_irq()</function> returns <returnvalue>true</returnvalue>. 179 </para> 180 <caution> 181 <para> 182 Beware that this will return a false positive if interrupts are disabled 183 (see below). 184 </para> 185 </caution> 186 </sect1> 187 188 <sect1 id="basics-softirqs"> 189 <title>Software Interrupt Context: Bottom Halves, Tasklets, softirqs</title> 190 191 <para> 192 Whenever a system call is about to return to userspace, or a 193 hardware interrupt handler exits, any `software interrupts' 194 which are marked pending (usually by hardware interrupts) are 195 run (<filename>kernel/softirq.c</filename>). 196 </para> 197 198 <para> 199 Much of the real interrupt handling work is done here. Early in 200 the transition to <acronym>SMP</acronym>, there were only `bottom 201 halves' (BHs), which didn't take advantage of multiple CPUs. Shortly 202 after we switched from wind-up computers made of match-sticks and snot, 203 we abandoned this limitation. 204 </para> 205 206 <para> 207 <filename class=headerfile>include/linux/interrupt.h</filename> lists the 208 different BH's. No matter how many CPUs you have, no two BHs will run at 209 the same time. This made the transition to SMP simpler, but sucks hard for 210 scalable performance. A very important bottom half is the timer 211 BH (<filename class=headerfile>include/linux/timer.h</filename>): you 212 can register to have it call functions for you in a given length of time. 213 </para> 214 215 <para> 216 2.3.43 introduced softirqs, and re-implemented the (now 217 deprecated) BHs underneath them. Softirqs are fully-SMP 218 versions of BHs: they can run on as many CPUs at once as 219 required. This means they need to deal with any races in shared 220 data using their own locks. A bitmask is used to keep track of 221 which are enabled, so the 32 available softirqs should not be 222 used up lightly. (<emphasis>Yes</emphasis>, people will 223 notice). 224 </para> 225 226 <para> 227 tasklets (<filename class=headerfile>include/linux/interrupt.h</filename>) 228 are like softirqs, except they are dynamically-registrable (meaning you 229 can have as many as you want), and they also guarantee that any tasklet 230 will only run on one CPU at any time, although different tasklets can 231 run simultaneously (unlike different BHs). 232 </para> 233 <caution> 234 <para> 235 The name `tasklet' is misleading: they have nothing to do with `tasks', 236 and probably more to do with some bad vodka Alexey Kuznetsov had at the 237 time. 238 </para> 239 </caution> 240 241 <para> 242 You can tell you are in a softirq (or bottom half, or tasklet) 243 using the <function>in_softirq()</function> macro 244 (<filename class=headerfile>include/asm/softirq.h</filename>). 245 </para> 246 <caution> 247 <para> 248 Beware that this will return a false positive if a bh lock (see below) 249 is held. 250 </para> 251 </caution> 252 </sect1> 253 </chapter> 254 255 <chapter id="basic-rules"> 256 <title>Some Basic Rules</title> 257 258 <variablelist> 259 <varlistentry> 260 <term>No memory protection</term> 261 <listitem> 262 <para> 263 If you corrupt memory, whether in user context or 264 interrupt context, the whole machine will crash. Are you 265 sure you can't do what you want in userspace? 266 </para> 267 </listitem> 268 </varlistentry> 269 270 <varlistentry> 271 <term>No floating point or <acronym>MMX</acronym></term> 272 <listitem> 273 <para> 274 The <acronym>FPU</acronym> context is not saved; even in user 275 context the <acronym>FPU</acronym> state probably won't 276 correspond with the current process: you would mess with some 277 user process' <acronym>FPU</acronym> state. If you really want 278 to do this, you would have to explicitly save/restore the full 279 <acronym>FPU</acronym> state (and avoid context switches). It 280 is generally a bad idea; use fixed point arithmetic first. 281 </para> 282 </listitem> 283 </varlistentry> 284 285 <varlistentry> 286 <term>A rigid stack limit</term> 287 <listitem> 288 <para> 289 The kernel stack is about 6K in 2.2 (for most 290 architectures: it's about 14K on the Alpha), and shared 291 with interrupts so you can't use it all. Avoid deep 292 recursion and huge local arrays on the stack (allocate 293 them dynamically instead). 294 </para> 295 </listitem> 296 </varlistentry> 297 298 <varlistentry> 299 <term>The Linux kernel is portable</term> 300 <listitem> 301 <para> 302 Let's keep it that way. Your code should be 64-bit clean, 303 and endian-independent. You should also minimize CPU 304 specific stuff, e.g. inline assembly should be cleanly 305 encapsulated and minimized to ease porting. Generally it 306 should be restricted to the architecture-dependent part of 307 the kernel tree. 308 </para> 309 </listitem> 310 </varlistentry> 311 </variablelist> 312 </chapter> 313 314 <chapter id="ioctls"> 315 <title>ioctls: Not writing a new system call</title> 316 317 <para> 318 A system call generally looks like this 319 </para> 320 321 <programlisting> 322asmlinkage int sys_mycall(int arg) 323{ 324 return 0; 325} 326 </programlisting> 327 328 <para> 329 First, in most cases you don't want to create a new system call. 330 You create a character device and implement an appropriate ioctl 331 for it. This is much more flexible than system calls, doesn't have 332 to be entered in every architecture's 333 <filename class=headerfile>include/asm/unistd.h</filename> and 334 <filename>arch/kernel/entry.S</filename> file, and is much more 335 likely to be accepted by Linus. 336 </para> 337 338 <para> 339 If all your routine does is read or write some parameter, consider 340 implementing a <function>sysctl</function> interface instead. 341 </para> 342 343 <para> 344 Inside the ioctl you're in user context to a process. When a 345 error occurs you return a negated errno (see 346 <filename class=headerfile>include/linux/errno.h</filename>), 347 otherwise you return <returnvalue>0</returnvalue>. 348 </para> 349 350 <para> 351 After you slept you should check if a signal occurred: the 352 Unix/Linux way of handling signals is to temporarily exit the 353 system call with the <constant>-ERESTARTSYS</constant> error. The 354 system call entry code will switch back to user context, process 355 the signal handler and then your system call will be restarted 356 (unless the user disabled that). So you should be prepared to 357 process the restart, e.g. if you're in the middle of manipulating 358 some data structure. 359 </para> 360 361 <programlisting> 362if (signal_pending()) 363 return -ERESTARTSYS; 364 </programlisting> 365 366 <para> 367 If you're doing longer computations: first think userspace. If you 368 <emphasis>really</emphasis> want to do it in kernel you should 369 regularly check if you need to give up the CPU (remember there is 370 cooperative multitasking per CPU). Idiom: 371 </para> 372 373 <programlisting> 374if (current->need_resched) 375 schedule(); /* Will sleep */ 376 </programlisting> 377 378 <para> 379 A short note on interface design: the UNIX system call motto is 380 "Provide mechanism not policy". 381 </para> 382 </chapter> 383 384 <chapter id="deadlock-recipes"> 385 <title>Recipes for Deadlock</title> 386 387 <para> 388 You cannot call any routines which may sleep, unless: 389 </para> 390 <itemizedlist> 391 <listitem> 392 <para> 393 You are in user context. 394 </para> 395 </listitem> 396 397 <listitem> 398 <para> 399 You do not own any spinlocks. 400 </para> 401 </listitem> 402 403 <listitem> 404 <para> 405 You have interrupts enabled (actually, Andi Kleen says 406 that the scheduling code will enable them for you, but 407 that's probably not what you wanted). 408 </para> 409 </listitem> 410 </itemizedlist> 411 412 <para> 413 Note that some functions may sleep implicitly: common ones are 414 the user space access functions (*_user) and memory allocation 415 functions without <symbol>GFP_ATOMIC</symbol>. 416 </para> 417 418 <para> 419 You will eventually lock up your box if you break these rules. 420 </para> 421 422 <para> 423 Really. 424 </para> 425 </chapter> 426 427 <chapter id="common-routines"> 428 <title>Common Routines</title> 429 430 <sect1 id="routines-printk"> 431 <title> 432 <function>printk()</function> 433 <filename class=headerfile>include/linux/kernel.h</filename> 434 </title> 435 436 <para> 437 <function>printk()</function> feeds kernel messages to the 438 console, dmesg, and the syslog daemon. It is useful for debugging 439 and reporting errors, and can be used inside interrupt context, 440 but use with caution: a machine which has its console flooded with 441 printk messages is unusable. It uses a format string mostly 442 compatible with ANSI C printf, and C string concatenation to give 443 it a first "priority" argument: 444 </para> 445 446 <programlisting> 447printk(KERN_INFO "i = %u\n", i); 448 </programlisting> 449 450 <para> 451 See <filename class=headerfile>include/linux/kernel.h</filename>; 452 for other KERN_ values; these are interpreted by syslog as the 453 level. Special case: for printing an IP address use 454 </para> 455 456 <programlisting> 457__u32 ipaddress; 458printk(KERN_INFO "my ip: %d.%d.%d.%d\n", NIPQUAD(ipaddress)); 459 </programlisting> 460 461 <para> 462 <function>printk()</function> internally uses a 1K buffer and does 463 not catch overruns. Make sure that will be enough. 464 </para> 465 466 <note> 467 <para> 468 You will know when you are a real kernel hacker 469 when you start typoing printf as printk in your user programs :) 470 </para> 471 </note> 472 473 <!--- From the Lions book reader department --> 474 475 <note> 476 <para> 477 Another sidenote: the original Unix Version 6 sources had a 478 comment on top of its printf function: "Printf should not be 479 used for chit-chat". You should follow that advice. 480 </para> 481 </note> 482 </sect1> 483 484 <sect1 id="routines-copy"> 485 <title> 486 <function>copy_[to/from]_user()</function> 487 / 488 <function>get_user()</function> 489 / 490 <function>put_user()</function> 491 <filename class=headerfile>include/asm/uaccess.h</filename> 492 </title> 493 494 <para> 495 <emphasis>[SLEEPS]</emphasis> 496 </para> 497 498 <para> 499 <function>put_user()</function> and <function>get_user()</function> 500 are used to get and put single values (such as an int, char, or 501 long) from and to userspace. A pointer into userspace should 502 never be simply dereferenced: data should be copied using these 503 routines. Both return <constant>-EFAULT</constant> or 0. 504 </para> 505 <para> 506 <function>copy_to_user()</function> and 507 <function>copy_from_user()</function> are more general: they copy 508 an arbitrary amount of data to and from userspace. 509 <caution> 510 <para> 511 Unlike <function>put_user()</function> and 512 <function>get_user()</function>, they return the amount of 513 uncopied data (ie. <returnvalue>0</returnvalue> still means 514 success). 515 </para> 516 </caution> 517 [Yes, this moronic interface makes me cringe. Please submit a 518 patch and become my hero --RR.] 519 </para> 520 <para> 521 The functions may sleep implicitly. This should never be called 522 outside user context (it makes no sense), with interrupts 523 disabled, or a spinlock held. 524 </para> 525 </sect1> 526 527 <sect1 id="routines-kmalloc"> 528 <title><function>kmalloc()</function>/<function>kfree()</function> 529 <filename class=headerfile>include/linux/slab.h</filename></title> 530 531 <para> 532 <emphasis>[MAY SLEEP: SEE BELOW]</emphasis> 533 </para> 534 535 <para> 536 These routines are used to dynamically request pointer-aligned 537 chunks of memory, like malloc and free do in userspace, but 538 <function>kmalloc()</function> takes an extra flag word. 539 Important values: 540 </para> 541 542 <variablelist> 543 <varlistentry> 544 <term> 545 <constant> 546 GFP_KERNEL 547 </constant> 548 </term> 549 <listitem> 550 <para> 551 May sleep and swap to free memory. Only allowed in user 552 context, but is the most reliable way to allocate memory. 553 </para> 554 </listitem> 555 </varlistentry> 556 557 <varlistentry> 558 <term> 559 <constant> 560 GFP_ATOMIC 561 </constant> 562 </term> 563 <listitem> 564 <para> 565 Don't sleep. Less reliable than <constant>GFP_KERNEL</constant>, 566 but may be called from interrupt context. You should 567 <emphasis>really</emphasis> have a good out-of-memory 568 error-handling strategy. 569 </para> 570 </listitem> 571 </varlistentry> 572 573 <varlistentry> 574 <term> 575 <constant> 576 GFP_DMA 577 </constant> 578 </term> 579 <listitem> 580 <para> 581 Allocate ISA DMA lower than 16MB. If you don't know what that 582 is you don't need it. Very unreliable. 583 </para> 584 </listitem> 585 </varlistentry> 586 </variablelist> 587 588 <para> 589 If you see a <errorname>kmem_grow: Called nonatomically from int 590 </errorname> warning message you called a memory allocation function 591 from interrupt context without <constant>GFP_ATOMIC</constant>. 592 You should really fix that. Run, don't walk. 593 </para> 594 595 <para> 596 If you are allocating at least <constant>PAGE_SIZE</constant> 597 (<filename class=headerfile>include/asm/page.h</filename>) bytes, 598 consider using <function>__get_free_pages()</function> 599 600 (<filename class=headerfile>include/linux/mm.h</filename>). It 601 takes an order argument (0 for page sized, 1 for double page, 2 602 for four pages etc.) and the same memory priority flag word as 603 above. 604 </para> 605 606 <para> 607 If you are allocating more than a page worth of bytes you can use 608 <function>vmalloc()</function>. It'll allocate virtual memory in 609 the kernel map. This block is not contiguous in physical memory, 610 but the <acronym>MMU</acronym> makes it look like it is for you 611 (so it'll only look contiguous to the CPUs, not to external device 612 drivers). If you really need large physically contiguous memory 613 for some weird device, you have a problem: it is poorly supported 614 in Linux because after some time memory fragmentation in a running 615 kernel makes it hard. The best way is to allocate the block early 616 in the boot process via the <function>alloc_bootmem()</function> 617 routine. 618 </para> 619 620 <para> 621 Before inventing your own cache of often-used objects consider 622 using a slab cache in 623 <filename class=headerfile>include/linux/slab.h</filename> 624 </para> 625 </sect1> 626 627 <sect1 id="routines-current"> 628 <title><function>current</function> 629 <filename class=headerfile>include/asm/current.h</filename></title> 630 631 <para> 632 This global variable (really a macro) contains a pointer to 633 the current task structure, so is only valid in user context. 634 For example, when a process makes a system call, this will 635 point to the task structure of the calling process. It is 636 <emphasis>not NULL</emphasis> in interrupt context. 637 </para> 638 </sect1> 639 640 <sect1 id="routines-udelay"> 641 <title><function>udelay()</function>/<function>mdelay()</function> 642 <filename class=headerfile>include/asm/delay.h</filename> 643 <filename class=headerfile>include/linux/delay.h</filename> 644 </title> 645 646 <para> 647 The <function>udelay()</function> function can be used for small pauses. 648 Do not use large values with <function>udelay()</function> as you risk 649 overflow - the helper function <function>mdelay()</function> is useful 650 here, or even consider <function>schedule_timeout()</function>. 651 </para> 652 </sect1> 653 654 <sect1 id="routines-endian"> 655 <title><function>cpu_to_be32()</function>/<function>be32_to_cpu()</function>/<function>cpu_to_le32()</function>/<function>le32_to_cpu()</function> 656 <filename class=headerfile>include/asm/byteorder.h</filename> 657 </title> 658 659 <para> 660 The <function>cpu_to_be32()</function> family (where the "32" can 661 be replaced by 64 or 16, and the "be" can be replaced by "le") are 662 the general way to do endian conversions in the kernel: they 663 return the converted value. All variations supply the reverse as 664 well: <function>be32_to_cpu()</function>, etc. 665 </para> 666 667 <para> 668 There are two major variations of these functions: the pointer 669 variation, such as <function>cpu_to_be32p()</function>, which take 670 a pointer to the given type, and return the converted value. The 671 other variation is the "in-situ" family, such as 672 <function>cpu_to_be32s()</function>, which convert value referred 673 to by the pointer, and return void. 674 </para> 675 </sect1> 676 677 <sect1 id="routines-local-irqs"> 678 <title><function>local_irq_save()</function>/<function>local_irq_restore()</function> 679 <filename class=headerfile>include/asm/system.h</filename> 680 </title> 681 682 <para> 683 These routines disable hard interrupts on the local CPU, and 684 restore them. They are reentrant; saving the previous state in 685 their one <varname>unsigned long flags</varname> argument. If you 686 know that interrupts are enabled, you can simply use 687 <function>local_irq_disable()</function> and 688 <function>local_irq_enable()</function>. 689 </para> 690 </sect1> 691 692 <sect1 id="routines-softirqs"> 693 <title><function>local_bh_disable()</function>/<function>local_bh_enable()</function> 694 <filename class=headerfile>include/asm/softirq.h</filename></title> 695 696 <para> 697 These routines disable soft interrupts on the local CPU, and 698 restore them. They are reentrant; if soft interrupts were 699 disabled before, they will still be disabled after this pair 700 of functions has been called. They prevent softirqs, tasklets 701 and bottom halves from running on the current CPU. 702 </para> 703 </sect1> 704 705 <sect1 id="routines-processorids"> 706 <title><function>smp_processor_id</function>()/<function>cpu_[number/logical]_map()</function> 707 <filename class=headerfile>include/asm/smp.h</filename></title> 708 709 <para> 710 <function>smp_processor_id()</function> returns the current 711 processor number, between 0 and <symbol>NR_CPUS</symbol> (the 712 maximum number of CPUs supported by Linux, currently 32). These 713 values are not necessarily continuous: to get a number between 0 714 and <function>smp_num_cpus()</function> (the number of actual 715 processors in this machine), the 716 <function>cpu_number_map()</function> function is used to map the 717 processor id to a logical number. 718 <function>cpu_logical_map()</function> does the reverse. 719 </para> 720 </sect1> 721 722 <sect1 id="routines-init"> 723 <title><type>__init</type>/<type>__exit</type>/<type>__initdata</type> 724 <filename class=headerfile>include/linux/init.h</filename></title> 725 726 <para> 727 After boot, the kernel frees up a special section; functions 728 marked with <type>__init</type> and data structures marked with 729 <type>__initdata</type> are dropped after boot is complete (within 730 modules this directive is currently ignored). <type>__exit</type> 731 is used to declare a function which is only required on exit: the 732 function will be dropped if this file is not compiled as a module. 733 See the header file for use. Note that it makes no sense for a function 734 marked with <type>__init</type> to be exported to modules with 735 <function>EXPORT_SYMBOL()</function> - this will break. 736 </para> 737 <para> 738 Static data structures marked as <type>__initdata</type> must be initialised 739 (as opposed to ordinary static data which is zeroed BSS) and cannot be 740 <type>const</type>. 741 </para> 742 743 </sect1> 744 745 <sect1 id="routines-init-again"> 746 <title><function>__initcall()</function>/<function>module_init()</function> 747 <filename class=headerfile>include/linux/init.h</filename></title> 748 <para> 749 Many parts of the kernel are well served as a module 750 (dynamically-loadable parts of the kernel). Using the 751 <function>module_init()</function> and 752 <function>module_exit()</function> macros it is easy to write code 753 without #ifdefs which can operate both as a module or built into 754 the kernel. 755 </para> 756 757 <para> 758 The <function>module_init()</function> macro defines which 759 function is to be called at module insertion time (if the file is 760 compiled as a module), or at boot time: if the file is not 761 compiled as a module the <function>module_init()</function> macro 762 becomes equivalent to <function>__initcall()</function>, which 763 through linker magic ensures that the function is called on boot. 764 </para> 765 766 <para> 767 The function can return a negative error number to cause 768 module loading to fail (unfortunately, this has no effect if 769 the module is compiled into the kernel). For modules, this is 770 called in user context, with interrupts enabled, and the 771 kernel lock held, so it can sleep. 772 </para> 773 </sect1> 774 775 <sect1 id="routines-moduleexit"> 776 <title> <function>module_exit()</function> 777 <filename class=headerfile>include/linux/init.h</filename> </title> 778 779 <para> 780 This macro defines the function to be called at module removal 781 time (or never, in the case of the file compiled into the 782 kernel). It will only be called if the module usage count has 783 reached zero. This function can also sleep, but cannot fail: 784 everything must be cleaned up by the time it returns. 785 </para> 786 </sect1> 787 788 <sect1 id="routines-module-use-counters"> 789 <title> <function>MOD_INC_USE_COUNT</function>/<function>MOD_DEC_USE_COUNT</function> 790 <filename class=headerfile>include/linux/module.h</filename></title> 791 792 <para> 793 These manipulate the module usage count, to protect against 794 removal (a module also can't be removed if another module uses 795 one of its exported symbols: see below). Every reference to 796 the module from user context should be reflected by this 797 counter (e.g. for every data structure or socket) before the 798 function sleeps. To quote Tim Waugh: 799 </para> 800 801 <programlisting> 802/* THIS IS BAD */ 803foo_open (...) 804{ 805 stuff.. 806 if (fail) 807 return -EBUSY; 808 sleep.. (might get unloaded here) 809 stuff.. 810 MOD_INC_USE_COUNT; 811 return 0; 812} 813 814/* THIS IS GOOD / 815foo_open (...) 816{ 817 MOD_INC_USE_COUNT; 818 stuff.. 819 if (fail) { 820 MOD_DEC_USE_COUNT; 821 return -EBUSY; 822 } 823 sleep.. (safe now) 824 stuff.. 825 return 0; 826} 827 </programlisting> 828 829 <para> 830 You can often avoid having to deal with these problems by using the 831 <structfield>owner</structfield> field of the 832 <structname>file_operations</structname> structure. Set this field 833 as the macro <symbol>THIS_MODULE</symbol>. 834 </para> 835 836 <para> 837 For more complicated module unload locking requirements, you can set the 838 <structfield>can_unload</structfield> function pointer to your own routine, 839 which should return <returnvalue>0</returnvalue> if the module is 840 unloadable, or <returnvalue>-EBUSY</returnvalue> otherwise. 841 </para> 842 843 </sect1> 844 </chapter> 845 846 <chapter id="queues"> 847 <title>Wait Queues 848 <filename class=headerfile>include/linux/wait.h</filename> 849 </title> 850 <para> 851 <emphasis>[SLEEPS]</emphasis> 852 </para> 853 854 <para> 855 A wait queue is used to wait for someone to wake you up when a 856 certain condition is true. They must be used carefully to ensure 857 there is no race condition. You declare a 858 <type>wait_queue_head_t</type>, and then processes which want to 859 wait for that condition declare a <type>wait_queue_t</type> 860 referring to themselves, and place that in the queue. 861 </para> 862 863 <sect1 id="queue-declaring"> 864 <title>Declaring</title> 865 866 <para> 867 You declare a <type>wait_queue_head_t</type> using the 868 <function>DECLARE_WAIT_QUEUE_HEAD()</function> macro, or using the 869 <function>init_waitqueue_head()</function> routine in your 870 initialization code. 871 </para> 872 </sect1> 873 874 <sect1 id="queue-waitqueue"> 875 <title>Queuing</title> 876 877 <para> 878 Placing yourself in the waitqueue is fairly complex, because you 879 must put yourself in the queue before checking the condition. 880 There is a macro to do this: 881 <function>wait_event_interruptible()</function> 882 883 <filename class=headerfile>include/linux/sched.h</filename> The 884 first argument is the wait queue head, and the second is an 885 expression which is evaluated; the macro returns 886 <returnvalue>0</returnvalue> when this expression is true, or 887 <returnvalue>-ERESTARTSYS</returnvalue> if a signal is received. 888 The <function>wait_event()</function> version ignores signals. 889 </para> 890 <para> 891 Do not use the <function>sleep_on()</function> function family - 892 it is very easy to accidentally introduce races; almost certainly 893 one of the <function>wait_event()</function> family will do, or a 894 loop around <function>schedule_timeout()</function>. If you choose 895 to loop around <function>schedule_timeout()</function> remember 896 you must set the task state (with 897 <function>set_current_state()</function>) on each iteration to avoid 898 busy-looping. 899 </para> 900 901 </sect1> 902 903 <sect1 id="queue-waking"> 904 <title>Waking Up Queued Tasks</title> 905 906 <para> 907 Call <function>wake_up()</function> 908 909 <filename class=headerfile>include/linux/sched.h</filename>;, 910 which will wake up every process in the queue. The exception is 911 if one has <constant>TASK_EXCLUSIVE</constant> set, in which case 912 the remainder of the queue will not be woken. 913 </para> 914 </sect1> 915 </chapter> 916 917 <chapter id="atomic-ops"> 918 <title>Atomic Operations</title> 919 920 <para> 921 Certain operations are guaranteed atomic on all platforms. The 922 first class of operations work on <type>atomic_t</type> 923 924 <filename class=headerfile>include/asm/atomic.h</filename>; this 925 contains a signed integer (at least 24 bits long), and you must use 926 these functions to manipulate or read atomic_t variables. 927 <function>atomic_read()</function> and 928 <function>atomic_set()</function> get and set the counter, 929 <function>atomic_add()</function>, 930 <function>atomic_sub()</function>, 931 <function>atomic_inc()</function>, 932 <function>atomic_dec()</function>, and 933 <function>atomic_dec_and_test()</function> (returns 934 <returnvalue>true</returnvalue> if it was decremented to zero). 935 </para> 936 937 <para> 938 Yes. It returns <returnvalue>true</returnvalue> (i.e. != 0) if the 939 atomic variable is zero. 940 </para> 941 942 <para> 943 Note that these functions are slower than normal arithmetic, and 944 so should not be used unnecessarily. On some platforms they 945 are much slower, like 32-bit Sparc where they use a spinlock. 946 </para> 947 948 <para> 949 The second class of atomic operations is atomic bit operations on a 950 <type>long</type>, defined in 951 952 <filename class=headerfile>include/asm/bitops.h</filename>. These 953 operations generally take a pointer to the bit pattern, and a bit 954 number: 0 is the least significant bit. 955 <function>set_bit()</function>, <function>clear_bit()</function> 956 and <function>change_bit()</function> set, clear, and flip the 957 given bit. <function>test_and_set_bit()</function>, 958 <function>test_and_clear_bit()</function> and 959 <function>test_and_change_bit()</function> do the same thing, 960 except return true if the bit was previously set; these are 961 particularly useful for very simple locking. 962 </para> 963 964 <para> 965 It is possible to call these operations with bit indices greater 966 than BITS_PER_LONG. The resulting behavior is strange on big-endian 967 platforms though so it is a good idea not to do this. 968 </para> 969 970 <para> 971 Note that the order of bits depends on the architecture, and in 972 particular, the bitfield passed to these operations must be at 973 least as large as a <type>long</type>. 974 </para> 975 </chapter> 976 977 <chapter id="symbols"> 978 <title>Symbols</title> 979 980 <para> 981 Within the kernel proper, the normal linking rules apply 982 (ie. unless a symbol is declared to be file scope with the 983 <type>static</type> keyword, it can be used anywhere in the 984 kernel). However, for modules, a special exported symbol table is 985 kept which limits the entry points to the kernel proper. Modules 986 can also export symbols. 987 </para> 988 989 <sect1 id="sym-exportsymbols"> 990 <title><function>EXPORT_SYMBOL()</function> 991 <filename class=headerfile>include/linux/module.h</filename></title> 992 993 <para> 994 This is the classic method of exporting a symbol, and it works 995 for both modules and non-modules. In the kernel all these 996 declarations are often bundled into a single file to help 997 genksyms (which searches source files for these declarations). 998 See the comment on genksyms and Makefiles below. 999 </para> 1000 </sect1> 1001 1002 <sect1 id="sym-exportnosymbols"> 1003 <title><symbol>EXPORT_NO_SYMBOLS</symbol> 1004 <filename class=headerfile>include/linux/module.h</filename></title> 1005 1006 <para> 1007 If a module exports no symbols then you can specify 1008 <programlisting> 1009EXPORT_NO_SYMBOLS; 1010 </programlisting> 1011 anywhere in the module. 1012 In kernel 2.4 and earlier, if a module contains neither 1013 <function>EXPORT_SYMBOL()</function> nor 1014 <symbol>EXPORT_NO_SYMBOLS</symbol> then the module defaults to 1015 exporting all non-static global symbols. 1016 In kernel 2.5 onwards you must explicitly specify whether a module 1017 exports symbols or not. 1018 </para> 1019 </sect1> 1020 1021 <sect1 id="sym-exportsymbols-gpl"> 1022 <title><function>EXPORT_SYMBOL_GPL()</function> 1023 <filename class=headerfile>include/linux/module.h</filename></title> 1024 1025 <para> 1026 Similar to <function>EXPORT_SYMBOL()</function> except that the 1027 symbols exported by <function>EXPORT_SYMBOL_GPL()</function> can 1028 only be seen by modules with a 1029 <function>MODULE_LICENSE()</function> that specifies a GPL 1030 compatible license. 1031 </para> 1032 </sect1> 1033 </chapter> 1034 1035 <chapter id="conventions"> 1036 <title>Routines and Conventions</title> 1037 1038 <sect1 id="conventions-doublelinkedlist"> 1039 <title>Double-linked lists 1040 <filename class=headerfile>include/linux/list.h</filename></title> 1041 1042 <para> 1043 There are three sets of linked-list routines in the kernel 1044 headers, but this one seems to be winning out (and Linus has 1045 used it). If you don't have some particular pressing need for 1046 a single list, it's a good choice. In fact, I don't care 1047 whether it's a good choice or not, just use it so we can get 1048 rid of the others. 1049 </para> 1050 </sect1> 1051 1052 <sect1 id="convention-returns"> 1053 <title>Return Conventions</title> 1054 1055 <para> 1056 For code called in user context, it's very common to defy C 1057 convention, and return <returnvalue>0</returnvalue> for success, 1058 and a negative error number 1059 (eg. <returnvalue>-EFAULT</returnvalue>) for failure. This can be 1060 unintuitive at first, but it's fairly widespread in the networking 1061 code, for example. 1062 </para> 1063 1064 <para> 1065 The filesystem code uses <function>ERR_PTR()</function> 1066 1067 <filename class=headerfile>include/linux/fs.h</filename>; to 1068 encode a negative error number into a pointer, and 1069 <function>IS_ERR()</function> and <function>PTR_ERR()</function> 1070 to get it back out again: avoids a separate pointer parameter for 1071 the error number. Icky, but in a good way. 1072 </para> 1073 </sect1> 1074 1075 <sect1 id="conventions-borkedcompile"> 1076 <title>Breaking Compilation</title> 1077 1078 <para> 1079 Linus and the other developers sometimes change function or 1080 structure names in development kernels; this is not done just to 1081 keep everyone on their toes: it reflects a fundamental change 1082 (eg. can no longer be called with interrupts on, or does extra 1083 checks, or doesn't do checks which were caught before). Usually 1084 this is accompanied by a fairly complete note to the linux-kernel 1085 mailing list; search the archive. Simply doing a global replace 1086 on the file usually makes things <emphasis>worse</emphasis>. 1087 </para> 1088 </sect1> 1089 1090 <sect1 id="conventions-initialising"> 1091 <title>Initializing structure members</title> 1092 1093 <para> 1094 The preferred method of initializing structures is to use 1095 designated initialisers, as defined by ISO C99, eg: 1096 </para> 1097 <programlisting> 1098static struct block_device_operations opt_fops = { 1099 .open = opt_open, 1100 .release = opt_release, 1101 .ioctl = opt_ioctl, 1102 .check_media_change = opt_media_change, 1103}; 1104 </programlisting> 1105 <para> 1106 This makes it easy to grep for, and makes it clear which 1107 structure fields are set. You should do this because it looks 1108 cool. 1109 </para> 1110 </sect1> 1111 1112 <sect1 id="conventions-gnu-extns"> 1113 <title>GNU Extensions</title> 1114 1115 <para> 1116 GNU Extensions are explicitly allowed in the Linux kernel. 1117 Note that some of the more complex ones are not very well 1118 supported, due to lack of general use, but the following are 1119 considered standard (see the GCC info page section "C 1120 Extensions" for more details - Yes, really the info page, the 1121 man page is only a short summary of the stuff in info): 1122 </para> 1123 <itemizedlist> 1124 <listitem> 1125 <para> 1126 Inline functions 1127 </para> 1128 </listitem> 1129 <listitem> 1130 <para> 1131 Statement expressions (ie. the ({ and }) constructs). 1132 </para> 1133 </listitem> 1134 <listitem> 1135 <para> 1136 Declaring attributes of a function / variable / type 1137 (__attribute__) 1138 </para> 1139 </listitem> 1140 <listitem> 1141 <para> 1142 typeof 1143 </para> 1144 </listitem> 1145 <listitem> 1146 <para> 1147 Zero length arrays 1148 </para> 1149 </listitem> 1150 <listitem> 1151 <para> 1152 Macro varargs 1153 </para> 1154 </listitem> 1155 <listitem> 1156 <para> 1157 Arithmetic on void pointers 1158 </para> 1159 </listitem> 1160 <listitem> 1161 <para> 1162 Non-Constant initializers 1163 </para> 1164 </listitem> 1165 <listitem> 1166 <para> 1167 Assembler Instructions (not outside arch/ and include/asm/) 1168 </para> 1169 </listitem> 1170 <listitem> 1171 <para> 1172 Function names as strings (__FUNCTION__) 1173 </para> 1174 </listitem> 1175 <listitem> 1176 <para> 1177 __builtin_constant_p() 1178 </para> 1179 </listitem> 1180 </itemizedlist> 1181 1182 <para> 1183 Be wary when using long long in the kernel, the code gcc generates for 1184 it is horrible and worse: division and multiplication does not work 1185 on i386 because the GCC runtime functions for it are missing from 1186 the kernel environment. 1187 </para> 1188 1189 <!-- FIXME: add a note about ANSI aliasing cleanness --> 1190 </sect1> 1191 1192 <sect1 id="conventions-cplusplus"> 1193 <title>C++</title> 1194 1195 <para> 1196 Using C++ in the kernel is usually a bad idea, because the 1197 kernel does not provide the necessary runtime environment 1198 and the include files are not tested for it. It is still 1199 possible, but not recommended. If you really want to do 1200 this, forget about exceptions at least. 1201 </para> 1202 </sect1> 1203 1204 <sect1 id="conventions-ifdef"> 1205 <title>#if</title> 1206 1207 <para> 1208 It is generally considered cleaner to use macros in header files 1209 (or at the top of .c files) to abstract away functions rather than 1210 using `#if' pre-processor statements throughout the source code. 1211 </para> 1212 </sect1> 1213 </chapter> 1214 1215 <chapter id="submitting"> 1216 <title>Putting Your Stuff in the Kernel</title> 1217 1218 <para> 1219 In order to get your stuff into shape for official inclusion, or 1220 even to make a neat patch, there's administrative work to be 1221 done: 1222 </para> 1223 <itemizedlist> 1224 <listitem> 1225 <para> 1226 Figure out whose pond you've been pissing in. Look at the top of 1227 the source files, inside the <filename>MAINTAINERS</filename> 1228 file, and last of all in the <filename>CREDITS</filename> file. 1229 You should coordinate with this person to make sure you're not 1230 duplicating effort, or trying something that's already been 1231 rejected. 1232 </para> 1233 1234 <para> 1235 Make sure you put your name and EMail address at the top of 1236 any files you create or mangle significantly. This is the 1237 first place people will look when they find a bug, or when 1238 <emphasis>they</emphasis> want to make a change. 1239 </para> 1240 </listitem> 1241 1242 <listitem> 1243 <para> 1244 Usually you want a configuration option for your kernel hack. 1245 Edit <filename>Config.in</filename> in the appropriate directory 1246 (but under <filename>arch/</filename> it's called 1247 <filename>config.in</filename>). The Config Language used is not 1248 bash, even though it looks like bash; the safe way is to use only 1249 the constructs that you already see in 1250 <filename>Config.in</filename> files (see 1251 <filename>Documentation/kbuild/config-language.txt</filename>). 1252 It's good to run "make xconfig" at least once to test (because 1253 it's the only one with a static parser). 1254 </para> 1255 1256 <para> 1257 Variables which can be Y or N use <type>bool</type> followed by a 1258 tagline and the config define name (which must start with 1259 CONFIG_). The <type>tristate</type> function is the same, but 1260 allows the answer M (which defines 1261 <symbol>CONFIG_foo_MODULE</symbol> in your source, instead of 1262 <symbol>CONFIG_FOO</symbol>) if <symbol>CONFIG_MODULES</symbol> 1263 is enabled. 1264 </para> 1265 1266 <para> 1267 You may well want to make your CONFIG option only visible if 1268 <symbol>CONFIG_EXPERIMENTAL</symbol> is enabled: this serves as a 1269 warning to users. There many other fancy things you can do: see 1270 the various <filename>Config.in</filename> files for ideas. 1271 </para> 1272 </listitem> 1273 1274 <listitem> 1275 <para> 1276 Edit the <filename>Makefile</filename>: the CONFIG variables are 1277 exported here so you can conditionalize compilation with `ifeq'. 1278 If your file exports symbols then add the names to 1279 <varname>export-objs</varname> so that genksyms will find them. 1280 <caution> 1281 <para> 1282 There is a restriction on the kernel build system that objects 1283 which export symbols must have globally unique names. 1284 If your object does not have a globally unique name then the 1285 standard fix is to move the 1286 <function>EXPORT_SYMBOL()</function> statements to their own 1287 object with a unique name. 1288 This is why several systems have separate exporting objects, 1289 usually suffixed with ksyms. 1290 </para> 1291 </caution> 1292 </para> 1293 </listitem> 1294 1295 <listitem> 1296 <para> 1297 Document your option in Documentation/Configure.help. Mention 1298 incompatibilities and issues here. <emphasis> Definitely 1299 </emphasis> end your description with <quote> if in doubt, say N 1300 </quote> (or, occasionally, `Y'); this is for people who have no 1301 idea what you are talking about. 1302 </para> 1303 </listitem> 1304 1305 <listitem> 1306 <para> 1307 Put yourself in <filename>CREDITS</filename> if you've done 1308 something noteworthy, usually beyond a single file (your name 1309 should be at the top of the source files anyway). 1310 <filename>MAINTAINERS</filename> means you want to be consulted 1311 when changes are made to a subsystem, and hear about bugs; it 1312 implies a more-than-passing commitment to some part of the code. 1313 </para> 1314 </listitem> 1315 1316 <listitem> 1317 <para> 1318 Finally, don't forget to read <filename>Documentation/SubmittingPatches</filename> 1319 and possibly <filename>Documentation/SubmittingDrivers</filename>. 1320 </para> 1321 </listitem> 1322 </itemizedlist> 1323 </chapter> 1324 1325 <chapter id="cantrips"> 1326 <title>Kernel Cantrips</title> 1327 1328 <para> 1329 Some favorites from browsing the source. Feel free to add to this 1330 list. 1331 </para> 1332 1333 <para> 1334 <filename>include/linux/brlock.h:</filename> 1335 </para> 1336 <programlisting> 1337extern inline void br_read_lock (enum brlock_indices idx) 1338{ 1339 /* 1340 * This causes a link-time bug message if an 1341 * invalid index is used: 1342 */ 1343 if (idx >= __BR_END) 1344 __br_lock_usage_bug(); 1345 1346 read_lock(&__brlock_array[smp_processor_id()][idx]); 1347} 1348 </programlisting> 1349 1350 <para> 1351 <filename>include/linux/fs.h</filename>: 1352 </para> 1353 <programlisting> 1354/* 1355 * Kernel pointers have redundant information, so we can use a 1356 * scheme where we can return either an error code or a dentry 1357 * pointer with the same return value. 1358 * 1359 * This should be a per-architecture thing, to allow different 1360 * error and pointer decisions. 1361 */ 1362 #define ERR_PTR(err) ((void *)((long)(err))) 1363 #define PTR_ERR(ptr) ((long)(ptr)) 1364 #define IS_ERR(ptr) ((unsigned long)(ptr) > (unsigned long)(-1000)) 1365</programlisting> 1366 1367 <para> 1368 <filename>include/asm-i386/uaccess.h:</filename> 1369 </para> 1370 1371 <programlisting> 1372#define copy_to_user(to,from,n) \ 1373 (__builtin_constant_p(n) ? \ 1374 __constant_copy_to_user((to),(from),(n)) : \ 1375 __generic_copy_to_user((to),(from),(n))) 1376 </programlisting> 1377 1378 <para> 1379 <filename>arch/sparc/kernel/head.S:</filename> 1380 </para> 1381 1382 <programlisting> 1383/* 1384 * Sun people can't spell worth damn. "compatability" indeed. 1385 * At least we *know* we can't spell, and use a spell-checker. 1386 */ 1387 1388/* Uh, actually Linus it is I who cannot spell. Too much murky 1389 * Sparc assembly will do this to ya. 1390 */ 1391C_LABEL(cputypvar): 1392 .asciz "compatability" 1393 1394/* Tested on SS-5, SS-10. Probably someone at Sun applied a spell-checker. */ 1395 .align 4 1396C_LABEL(cputypvar_sun4m): 1397 .asciz "compatible" 1398 </programlisting> 1399 1400 <para> 1401 <filename>arch/sparc/lib/checksum.S:</filename> 1402 </para> 1403 1404 <programlisting> 1405 /* Sun, you just can't beat me, you just can't. Stop trying, 1406 * give up. I'm serious, I am going to kick the living shit 1407 * out of you, game over, lights out. 1408 */ 1409 </programlisting> 1410 </chapter> 1411 1412 <chapter id="credits"> 1413 <title>Thanks</title> 1414 1415 <para> 1416 Thanks to Andi Kleen for the idea, answering my questions, fixing 1417 my mistakes, filling content, etc. Philipp Rumpf for more spelling 1418 and clarity fixes, and some excellent non-obvious points. Werner 1419 Almesberger for giving me a great summary of 1420 <function>disable_irq()</function>, and Jes Sorensen and Andrea 1421 Arcangeli added caveats. Michael Elizabeth Chastain for checking 1422 and adding to the Configure section. <!-- Rusty insisted on this 1423 bit; I didn't do it! --> Telsa Gwynne for teaching me DocBook. 1424 </para> 1425 </chapter> 1426</book> 1427 1428