1@node Resource Usage And Limitation, Non-Local Exits, Date and Time, Top 2@c %MENU% Functions for examining resource usage and getting and setting limits 3@chapter Resource Usage And Limitation 4This chapter describes functions for examining how much of various kinds of 5resources (CPU time, memory, etc.) a process has used and getting and setting 6limits on future usage. 7 8@menu 9* Resource Usage:: Measuring various resources used. 10* Limits on Resources:: Specifying limits on resource usage. 11* Priority:: Reading or setting process run priority. 12* Memory Resources:: Querying memory available resources. 13* Processor Resources:: Learn about the processors available. 14@end menu 15 16 17@node Resource Usage 18@section Resource Usage 19 20@pindex sys/resource.h 21The function @code{getrusage} and the data type @code{struct rusage} 22are used to examine the resource usage of a process. They are declared 23in @file{sys/resource.h}. 24 25@deftypefun int getrusage (int @var{processes}, struct rusage *@var{rusage}) 26@standards{BSD, sys/resource.h} 27@safety{@prelim{}@mtsafe{}@assafe{}@acsafe{}} 28@c On HURD, this calls task_info 3 times. On UNIX, it's a syscall. 29This function reports resource usage totals for processes specified by 30@var{processes}, storing the information in @code{*@var{rusage}}. 31 32In most systems, @var{processes} has only two valid values: 33 34@vtable @code 35@item RUSAGE_SELF 36@standards{BSD, sys/resource.h} 37Just the current process. 38 39@item RUSAGE_CHILDREN 40@standards{BSD, sys/resource.h} 41All child processes (direct and indirect) that have already terminated. 42@end vtable 43 44The return value of @code{getrusage} is zero for success, and @code{-1} 45for failure. 46 47@table @code 48@item EINVAL 49The argument @var{processes} is not valid. 50@end table 51@end deftypefun 52 53One way of getting resource usage for a particular child process is with 54the function @code{wait4}, which returns totals for a child when it 55terminates. @xref{BSD Wait Functions}. 56 57@deftp {Data Type} {struct rusage} 58@standards{BSD, sys/resource.h} 59This data type stores various resource usage statistics. It has the 60following members, and possibly others: 61 62@table @code 63@item struct timeval ru_utime 64Time spent executing user instructions. 65 66@item struct timeval ru_stime 67Time spent in operating system code on behalf of @var{processes}. 68 69@item long int ru_maxrss 70The maximum resident set size used, in kilobytes. That is, the maximum 71number of kilobytes of physical memory that @var{processes} used 72simultaneously. 73 74@item long int ru_ixrss 75An integral value expressed in kilobytes times ticks of execution, which 76indicates the amount of memory used by text that was shared with other 77processes. 78 79@item long int ru_idrss 80An integral value expressed the same way, which is the amount of 81unshared memory used for data. 82 83@item long int ru_isrss 84An integral value expressed the same way, which is the amount of 85unshared memory used for stack space. 86 87@item long int ru_minflt 88The number of page faults which were serviced without requiring any I/O. 89 90@item long int ru_majflt 91The number of page faults which were serviced by doing I/O. 92 93@item long int ru_nswap 94The number of times @var{processes} was swapped entirely out of main memory. 95 96@item long int ru_inblock 97The number of times the file system had to read from the disk on behalf 98of @var{processes}. 99 100@item long int ru_oublock 101The number of times the file system had to write to the disk on behalf 102of @var{processes}. 103 104@item long int ru_msgsnd 105Number of IPC messages sent. 106 107@item long int ru_msgrcv 108Number of IPC messages received. 109 110@item long int ru_nsignals 111Number of signals received. 112 113@item long int ru_nvcsw 114The number of times @var{processes} voluntarily invoked a context switch 115(usually to wait for some service). 116 117@item long int ru_nivcsw 118The number of times an involuntary context switch took place (because 119a time slice expired, or another process of higher priority was 120scheduled). 121@end table 122@end deftp 123 124@node Limits on Resources 125@section Limiting Resource Usage 126@cindex resource limits 127@cindex limits on resource usage 128@cindex usage limits 129 130You can specify limits for the resource usage of a process. When the 131process tries to exceed a limit, it may get a signal, or the system call 132by which it tried to do so may fail, depending on the resource. Each 133process initially inherits its limit values from its parent, but it can 134subsequently change them. 135 136There are two per-process limits associated with a resource: 137@cindex limit 138 139@table @dfn 140@item current limit 141The current limit is the value the system will not allow usage to 142exceed. It is also called the ``soft limit'' because the process being 143limited can generally raise the current limit at will. 144@cindex current limit 145@cindex soft limit 146 147@item maximum limit 148The maximum limit is the maximum value to which a process is allowed to 149set its current limit. It is also called the ``hard limit'' because 150there is no way for a process to get around it. A process may lower 151its own maximum limit, but only the superuser may increase a maximum 152limit. 153@cindex maximum limit 154@cindex hard limit 155@end table 156 157@pindex sys/resource.h 158The symbols for use with @code{getrlimit}, @code{setrlimit}, 159@code{getrlimit64}, and @code{setrlimit64} are defined in 160@file{sys/resource.h}. 161 162@deftypefun int getrlimit (int @var{resource}, struct rlimit *@var{rlp}) 163@standards{BSD, sys/resource.h} 164@safety{@prelim{}@mtsafe{}@assafe{}@acsafe{}} 165@c Direct syscall on most systems. 166Read the current and maximum limits for the resource @var{resource} 167and store them in @code{*@var{rlp}}. 168 169The return value is @code{0} on success and @code{-1} on failure. The 170only possible @code{errno} error condition is @code{EFAULT}. 171 172When the sources are compiled with @code{_FILE_OFFSET_BITS == 64} on a 17332-bit system this function is in fact @code{getrlimit64}. Thus, the 174LFS interface transparently replaces the old interface. 175@end deftypefun 176 177@deftypefun int getrlimit64 (int @var{resource}, struct rlimit64 *@var{rlp}) 178@standards{Unix98, sys/resource.h} 179@safety{@prelim{}@mtsafe{}@assafe{}@acsafe{}} 180@c Direct syscall on most systems, wrapper to getrlimit otherwise. 181This function is similar to @code{getrlimit} but its second parameter is 182a pointer to a variable of type @code{struct rlimit64}, which allows it 183to read values which wouldn't fit in the member of a @code{struct 184rlimit}. 185 186If the sources are compiled with @code{_FILE_OFFSET_BITS == 64} on a 18732-bit machine, this function is available under the name 188@code{getrlimit} and so transparently replaces the old interface. 189@end deftypefun 190 191@deftypefun int setrlimit (int @var{resource}, const struct rlimit *@var{rlp}) 192@standards{BSD, sys/resource.h} 193@safety{@prelim{}@mtsafe{}@assafe{}@acsafe{}} 194@c Direct syscall on most systems; lock-taking critical section on HURD. 195Store the current and maximum limits for the resource @var{resource} 196in @code{*@var{rlp}}. 197 198The return value is @code{0} on success and @code{-1} on failure. The 199following @code{errno} error condition is possible: 200 201@table @code 202@item EPERM 203@itemize @bullet 204@item 205The process tried to raise a current limit beyond the maximum limit. 206 207@item 208The process tried to raise a maximum limit, but is not superuser. 209@end itemize 210@end table 211 212When the sources are compiled with @code{_FILE_OFFSET_BITS == 64} on a 21332-bit system this function is in fact @code{setrlimit64}. Thus, the 214LFS interface transparently replaces the old interface. 215@end deftypefun 216 217@deftypefun int setrlimit64 (int @var{resource}, const struct rlimit64 *@var{rlp}) 218@standards{Unix98, sys/resource.h} 219@safety{@prelim{}@mtsafe{}@assafe{}@acsafe{}} 220@c Wrapper for setrlimit or direct syscall. 221This function is similar to @code{setrlimit} but its second parameter is 222a pointer to a variable of type @code{struct rlimit64} which allows it 223to set values which wouldn't fit in the member of a @code{struct 224rlimit}. 225 226If the sources are compiled with @code{_FILE_OFFSET_BITS == 64} on a 22732-bit machine this function is available under the name 228@code{setrlimit} and so transparently replaces the old interface. 229@end deftypefun 230 231@deftp {Data Type} {struct rlimit} 232@standards{BSD, sys/resource.h} 233This structure is used with @code{getrlimit} to receive limit values, 234and with @code{setrlimit} to specify limit values for a particular process 235and resource. It has two fields: 236 237@table @code 238@item rlim_t rlim_cur 239The current limit 240 241@item rlim_t rlim_max 242The maximum limit. 243@end table 244 245For @code{getrlimit}, the structure is an output; it receives the current 246values. For @code{setrlimit}, it specifies the new values. 247@end deftp 248 249For the LFS functions a similar type is defined in @file{sys/resource.h}. 250 251@deftp {Data Type} {struct rlimit64} 252@standards{Unix98, sys/resource.h} 253This structure is analogous to the @code{rlimit} structure above, but 254its components have wider ranges. It has two fields: 255 256@table @code 257@item rlim64_t rlim_cur 258This is analogous to @code{rlimit.rlim_cur}, but with a different type. 259 260@item rlim64_t rlim_max 261This is analogous to @code{rlimit.rlim_max}, but with a different type. 262@end table 263 264@end deftp 265 266Here is a list of resources for which you can specify a limit. Memory 267and file sizes are measured in bytes. 268 269@vtable @code 270@item RLIMIT_CPU 271@standards{BSD, sys/resource.h} 272The maximum amount of CPU time the process can use. If it runs for 273longer than this, it gets a signal: @code{SIGXCPU}. The value is 274measured in seconds. @xref{Operation Error Signals}. 275 276@item RLIMIT_FSIZE 277@standards{BSD, sys/resource.h} 278The maximum size of file the process can create. Trying to write a 279larger file causes a signal: @code{SIGXFSZ}. @xref{Operation Error 280Signals}. 281 282@item RLIMIT_DATA 283@standards{BSD, sys/resource.h} 284The maximum size of data memory for the process. If the process tries 285to allocate data memory beyond this amount, the allocation function 286fails. 287 288@item RLIMIT_STACK 289@standards{BSD, sys/resource.h} 290The maximum stack size for the process. If the process tries to extend 291its stack past this size, it gets a @code{SIGSEGV} signal. 292@xref{Program Error Signals}. 293 294@item RLIMIT_CORE 295@standards{BSD, sys/resource.h} 296The maximum size core file that this process can create. If the process 297terminates and would dump a core file larger than this, then no core 298file is created. So setting this limit to zero prevents core files from 299ever being created. 300 301@item RLIMIT_RSS 302@standards{BSD, sys/resource.h} 303The maximum amount of physical memory that this process should get. 304This parameter is a guide for the system's scheduler and memory 305allocator; the system may give the process more memory when there is a 306surplus. 307 308@item RLIMIT_MEMLOCK 309@standards{BSD, sys/resource.h} 310The maximum amount of memory that can be locked into physical memory (so 311it will never be paged out). 312 313@item RLIMIT_NPROC 314@standards{BSD, sys/resource.h} 315The maximum number of processes that can be created with the same user ID. 316If you have reached the limit for your user ID, @code{fork} will fail 317with @code{EAGAIN}. @xref{Creating a Process}. 318 319@item RLIMIT_NOFILE 320@itemx RLIMIT_OFILE 321@standardsx{RLIMIT_NOFILE, BSD, sys/resource.h} 322The maximum number of files that the process can open. If it tries to 323open more files than this, its open attempt fails with @code{errno} 324@code{EMFILE}. @xref{Error Codes}. Not all systems support this limit; 325GNU does, and 4.4 BSD does. 326 327@item RLIMIT_AS 328@standards{Unix98, sys/resource.h} 329The maximum size of total memory that this process should get. If the 330process tries to allocate more memory beyond this amount with, for 331example, @code{brk}, @code{malloc}, @code{mmap} or @code{sbrk}, the 332allocation function fails. 333 334@item RLIM_NLIMITS 335@standards{BSD, sys/resource.h} 336The number of different resource limits. Any valid @var{resource} 337operand must be less than @code{RLIM_NLIMITS}. 338@end vtable 339 340@deftypevr Constant rlim_t RLIM_INFINITY 341@standards{BSD, sys/resource.h} 342This constant stands for a value of ``infinity'' when supplied as 343the limit value in @code{setrlimit}. 344@end deftypevr 345 346 347The following are historical functions to do some of what the functions 348above do. The functions above are better choices. 349 350@code{ulimit} and the command symbols are declared in @file{ulimit.h}. 351@pindex ulimit.h 352 353@deftypefun {long int} ulimit (int @var{cmd}, @dots{}) 354@standards{BSD, ulimit.h} 355@safety{@prelim{}@mtsafe{}@assafe{}@acsafe{}} 356@c Wrapper for getrlimit, setrlimit or 357@c sysconf(_SC_OPEN_MAX)->getdtablesize->getrlimit. 358 359@code{ulimit} gets the current limit or sets the current and maximum 360limit for a particular resource for the calling process according to the 361command @var{cmd}. 362 363If you are getting a limit, the command argument is the only argument. 364If you are setting a limit, there is a second argument: 365@code{long int} @var{limit} which is the value to which you are setting 366the limit. 367 368The @var{cmd} values and the operations they specify are: 369@vtable @code 370 371@item GETFSIZE 372Get the current limit on the size of a file, in units of 512 bytes. 373 374@item SETFSIZE 375Set the current and maximum limit on the size of a file to @var{limit} * 376512 bytes. 377 378@end vtable 379 380There are also some other @var{cmd} values that may do things on some 381systems, but they are not supported. 382 383Only the superuser may increase a maximum limit. 384 385When you successfully get a limit, the return value of @code{ulimit} is 386that limit, which is never negative. When you successfully set a limit, 387the return value is zero. When the function fails, the return value is 388@code{-1} and @code{errno} is set according to the reason: 389 390@table @code 391@item EPERM 392A process tried to increase a maximum limit, but is not superuser. 393@end table 394 395 396@end deftypefun 397 398@code{vlimit} and its resource symbols are declared in @file{sys/vlimit.h}. 399@pindex sys/vlimit.h 400 401@deftypefun int vlimit (int @var{resource}, int @var{limit}) 402@standards{BSD, sys/vlimit.h} 403@safety{@prelim{}@mtunsafe{@mtasurace{:setrlimit}}@asunsafe{}@acsafe{}} 404@c It calls getrlimit and modifies the rlim_cur field before calling 405@c setrlimit. There's a window for a concurrent call to setrlimit that 406@c modifies e.g. rlim_max, which will be lost if running as super-user. 407 408@code{vlimit} sets the current limit for a resource for a process. 409 410@var{resource} identifies the resource: 411 412@vtable @code 413@item LIM_CPU 414Maximum CPU time. Same as @code{RLIMIT_CPU} for @code{setrlimit}. 415@item LIM_FSIZE 416Maximum file size. Same as @code{RLIMIT_FSIZE} for @code{setrlimit}. 417@item LIM_DATA 418Maximum data memory. Same as @code{RLIMIT_DATA} for @code{setrlimit}. 419@item LIM_STACK 420Maximum stack size. Same as @code{RLIMIT_STACK} for @code{setrlimit}. 421@item LIM_CORE 422Maximum core file size. Same as @code{RLIMIT_COR} for @code{setrlimit}. 423@item LIM_MAXRSS 424Maximum physical memory. Same as @code{RLIMIT_RSS} for @code{setrlimit}. 425@end vtable 426 427The return value is zero for success, and @code{-1} with @code{errno} set 428accordingly for failure: 429 430@table @code 431@item EPERM 432The process tried to set its current limit beyond its maximum limit. 433@end table 434 435@end deftypefun 436 437@node Priority 438@section Process CPU Priority And Scheduling 439@cindex process priority 440@cindex cpu priority 441@cindex priority of a process 442 443When multiple processes simultaneously require CPU time, the system's 444scheduling policy and process CPU priorities determine which processes 445get it. This section describes how that determination is made and 446@glibcadj{} functions to control it. 447 448It is common to refer to CPU scheduling simply as scheduling and a 449process' CPU priority simply as the process' priority, with the CPU 450resource being implied. Bear in mind, though, that CPU time is not the 451only resource a process uses or that processes contend for. In some 452cases, it is not even particularly important. Giving a process a high 453``priority'' may have very little effect on how fast a process runs with 454respect to other processes. The priorities discussed in this section 455apply only to CPU time. 456 457CPU scheduling is a complex issue and different systems do it in wildly 458different ways. New ideas continually develop and find their way into 459the intricacies of the various systems' scheduling algorithms. This 460section discusses the general concepts, some specifics of systems 461that commonly use @theglibc{}, and some standards. 462 463For simplicity, we talk about CPU contention as if there is only one CPU 464in the system. But all the same principles apply when a processor has 465multiple CPUs, and knowing that the number of processes that can run at 466any one time is equal to the number of CPUs, you can easily extrapolate 467the information. 468 469The functions described in this section are all defined by the POSIX.1 470and POSIX.1b standards (the @code{sched@dots{}} functions are POSIX.1b). 471However, POSIX does not define any semantics for the values that these 472functions get and set. In this chapter, the semantics are based on the 473Linux kernel's implementation of the POSIX standard. As you will see, 474the Linux implementation is quite the inverse of what the authors of the 475POSIX syntax had in mind. 476 477@menu 478* Absolute Priority:: The first tier of priority. Posix 479* Realtime Scheduling:: Scheduling among the process nobility 480* Basic Scheduling Functions:: Get/set scheduling policy, priority 481* Traditional Scheduling:: Scheduling among the vulgar masses 482* CPU Affinity:: Limiting execution to certain CPUs 483@end menu 484 485 486 487@node Absolute Priority 488@subsection Absolute Priority 489@cindex absolute priority 490@cindex priority, absolute 491 492Every process has an absolute priority, and it is represented by a number. 493The higher the number, the higher the absolute priority. 494 495@cindex realtime CPU scheduling 496On systems of the past, and most systems today, all processes have 497absolute priority 0 and this section is irrelevant. In that case, 498@xref{Traditional Scheduling}. Absolute priorities were invented to 499accommodate realtime systems, in which it is vital that certain processes 500be able to respond to external events happening in real time, which 501means they cannot wait around while some other process that @emph{wants 502to}, but doesn't @emph{need to} run occupies the CPU. 503 504@cindex ready to run 505@cindex preemptive scheduling 506When two processes are in contention to use the CPU at any instant, the 507one with the higher absolute priority always gets it. This is true even if the 508process with the lower priority is already using the CPU (i.e., the 509scheduling is preemptive). Of course, we're only talking about 510processes that are running or ``ready to run,'' which means they are 511ready to execute instructions right now. When a process blocks to wait 512for something like I/O, its absolute priority is irrelevant. 513 514@cindex runnable process 515@strong{NB:} The term ``runnable'' is a synonym for ``ready to run.'' 516 517When two processes are running or ready to run and both have the same 518absolute priority, it's more interesting. In that case, who gets the 519CPU is determined by the scheduling policy. If the processes have 520absolute priority 0, the traditional scheduling policy described in 521@ref{Traditional Scheduling} applies. Otherwise, the policies described 522in @ref{Realtime Scheduling} apply. 523 524You normally give an absolute priority above 0 only to a process that 525can be trusted not to hog the CPU. Such processes are designed to block 526(or terminate) after relatively short CPU runs. 527 528A process begins life with the same absolute priority as its parent 529process. Functions described in @ref{Basic Scheduling Functions} can 530change it. 531 532Only a privileged process can change a process' absolute priority to 533something other than @code{0}. Only a privileged process or the 534target process' owner can change its absolute priority at all. 535 536POSIX requires absolute priority values used with the realtime 537scheduling policies to be consecutive with a range of at least 32. On 538Linux, they are 1 through 99. The functions 539@code{sched_get_priority_max} and @code{sched_set_priority_min} portably 540tell you what the range is on a particular system. 541 542 543@subsubsection Using Absolute Priority 544 545One thing you must keep in mind when designing real time applications is 546that having higher absolute priority than any other process doesn't 547guarantee the process can run continuously. Two things that can wreck a 548good CPU run are interrupts and page faults. 549 550Interrupt handlers live in that limbo between processes. The CPU is 551executing instructions, but they aren't part of any process. An 552interrupt will stop even the highest priority process. So you must 553allow for slight delays and make sure that no device in the system has 554an interrupt handler that could cause too long a delay between 555instructions for your process. 556 557Similarly, a page fault causes what looks like a straightforward 558sequence of instructions to take a long time. The fact that other 559processes get to run while the page faults in is of no consequence, 560because as soon as the I/O is complete, the higher priority process will 561kick them out and run again, but the wait for the I/O itself could be a 562problem. To neutralize this threat, use @code{mlock} or 563@code{mlockall}. 564 565There are a few ramifications of the absoluteness of this priority on a 566single-CPU system that you need to keep in mind when you choose to set a 567priority and also when you're working on a program that runs with high 568absolute priority. Consider a process that has higher absolute priority 569than any other process in the system and due to a bug in its program, it 570gets into an infinite loop. It will never cede the CPU. You can't run 571a command to kill it because your command would need to get the CPU in 572order to run. The errant program is in complete control. It controls 573the vertical, it controls the horizontal. 574 575There are two ways to avoid this: 1) keep a shell running somewhere with 576a higher absolute priority or 2) keep a controlling terminal attached to 577the high priority process group. All the priority in the world won't 578stop an interrupt handler from running and delivering a signal to the 579process if you hit Control-C. 580 581Some systems use absolute priority as a means of allocating a fixed 582percentage of CPU time to a process. To do this, a super high priority 583privileged process constantly monitors the process' CPU usage and raises 584its absolute priority when the process isn't getting its entitled share 585and lowers it when the process is exceeding it. 586 587@strong{NB:} The absolute priority is sometimes called the ``static 588priority.'' We don't use that term in this manual because it misses the 589most important feature of the absolute priority: its absoluteness. 590 591 592@node Realtime Scheduling 593@subsection Realtime Scheduling 594@cindex realtime scheduling 595 596Whenever two processes with the same absolute priority are ready to run, 597the kernel has a decision to make, because only one can run at a time. 598If the processes have absolute priority 0, the kernel makes this decision 599as described in @ref{Traditional Scheduling}. Otherwise, the decision 600is as described in this section. 601 602If two processes are ready to run but have different absolute priorities, 603the decision is much simpler, and is described in @ref{Absolute 604Priority}. 605 606Each process has a scheduling policy. For processes with absolute 607priority other than zero, there are two available: 608 609@enumerate 610@item 611First Come First Served 612@item 613Round Robin 614@end enumerate 615 616The most sensible case is where all the processes with a certain 617absolute priority have the same scheduling policy. We'll discuss that 618first. 619 620In Round Robin, processes share the CPU, each one running for a small 621quantum of time (``time slice'') and then yielding to another in a 622circular fashion. Of course, only processes that are ready to run and 623have the same absolute priority are in this circle. 624 625In First Come First Served, the process that has been waiting the 626longest to run gets the CPU, and it keeps it until it voluntarily 627relinquishes the CPU, runs out of things to do (blocks), or gets 628preempted by a higher priority process. 629 630First Come First Served, along with maximal absolute priority and 631careful control of interrupts and page faults, is the one to use when a 632process absolutely, positively has to run at full CPU speed or not at 633all. 634 635Judicious use of @code{sched_yield} function invocations by processes 636with First Come First Served scheduling policy forms a good compromise 637between Round Robin and First Come First Served. 638 639To understand how scheduling works when processes of different scheduling 640policies occupy the same absolute priority, you have to know the nitty 641gritty details of how processes enter and exit the ready to run list. 642 643In both cases, the ready to run list is organized as a true queue, where 644a process gets pushed onto the tail when it becomes ready to run and is 645popped off the head when the scheduler decides to run it. Note that 646ready to run and running are two mutually exclusive states. When the 647scheduler runs a process, that process is no longer ready to run and no 648longer in the ready to run list. When the process stops running, it 649may go back to being ready to run again. 650 651The only difference between a process that is assigned the Round Robin 652scheduling policy and a process that is assigned First Come First Serve 653is that in the former case, the process is automatically booted off the 654CPU after a certain amount of time. When that happens, the process goes 655back to being ready to run, which means it enters the queue at the tail. 656The time quantum we're talking about is small. Really small. This is 657not your father's timesharing. For example, with the Linux kernel, the 658round robin time slice is a thousand times shorter than its typical 659time slice for traditional scheduling. 660 661A process begins life with the same scheduling policy as its parent process. 662Functions described in @ref{Basic Scheduling Functions} can change it. 663 664Only a privileged process can set the scheduling policy of a process 665that has absolute priority higher than 0. 666 667@node Basic Scheduling Functions 668@subsection Basic Scheduling Functions 669 670This section describes functions in @theglibc{} for setting the 671absolute priority and scheduling policy of a process. 672 673@strong{Portability Note:} On systems that have the functions in this 674section, the macro _POSIX_PRIORITY_SCHEDULING is defined in 675@file{<unistd.h>}. 676 677For the case that the scheduling policy is traditional scheduling, more 678functions to fine tune the scheduling are in @ref{Traditional Scheduling}. 679 680Don't try to make too much out of the naming and structure of these 681functions. They don't match the concepts described in this manual 682because the functions are as defined by POSIX.1b, but the implementation 683on systems that use @theglibc{} is the inverse of what the POSIX 684structure contemplates. The POSIX scheme assumes that the primary 685scheduling parameter is the scheduling policy and that the priority 686value, if any, is a parameter of the scheduling policy. In the 687implementation, though, the priority value is king and the scheduling 688policy, if anything, only fine tunes the effect of that priority. 689 690The symbols in this section are declared by including file @file{sched.h}. 691 692@strong{Portability Note:} In POSIX, the @code{pid_t} arguments of the 693functions below refer to process IDs. On Linux, they are actually 694thread IDs, and control how specific threads are scheduled with 695regards to the entire system. The resulting behavior does not conform 696to POSIX. This is why the following description refers to tasks and 697tasks IDs, and not processes and process IDs. 698@c https://sourceware.org/bugzilla/show_bug.cgi?id=14829 699 700@deftp {Data Type} {struct sched_param} 701@standards{POSIX, sched.h} 702This structure describes an absolute priority. 703@table @code 704@item int sched_priority 705absolute priority value 706@end table 707@end deftp 708 709@deftypefun int sched_setscheduler (pid_t @var{pid}, int @var{policy}, const struct sched_param *@var{param}) 710@standards{POSIX, sched.h} 711@safety{@prelim{}@mtsafe{}@assafe{}@acsafe{}} 712@c Direct syscall, Linux only. 713 714This function sets both the absolute priority and the scheduling policy 715for a task. 716 717It assigns the absolute priority value given by @var{param} and the 718scheduling policy @var{policy} to the task with ID @var{pid}, 719or the calling task if @var{pid} is zero. If @var{policy} is 720negative, @code{sched_setscheduler} keeps the existing scheduling policy. 721 722The following macros represent the valid values for @var{policy}: 723 724@vtable @code 725@item SCHED_OTHER 726Traditional Scheduling 727@item SCHED_FIFO 728First In First Out 729@item SCHED_RR 730Round Robin 731@end vtable 732 733@c The Linux kernel code (in sched.c) actually reschedules the process, 734@c but it puts it at the head of the run queue, so I'm not sure just what 735@c the effect is, but it must be subtle. 736 737On success, the return value is @code{0}. Otherwise, it is @code{-1} 738and @code{ERRNO} is set accordingly. The @code{errno} values specific 739to this function are: 740 741@table @code 742@item EPERM 743@itemize @bullet 744@item 745The calling task does not have @code{CAP_SYS_NICE} permission and 746@var{policy} is not @code{SCHED_OTHER} (or it's negative and the 747existing policy is not @code{SCHED_OTHER}. 748 749@item 750The calling task does not have @code{CAP_SYS_NICE} permission and its 751owner is not the target task's owner. I.e., the effective uid of the 752calling task is neither the effective nor the real uid of task 753@var{pid}. 754@c We need a cross reference to the capabilities section, when written. 755@end itemize 756 757@item ESRCH 758There is no task with pid @var{pid} and @var{pid} is not zero. 759 760@item EINVAL 761@itemize @bullet 762@item 763@var{policy} does not identify an existing scheduling policy. 764 765@item 766The absolute priority value identified by *@var{param} is outside the 767valid range for the scheduling policy @var{policy} (or the existing 768scheduling policy if @var{policy} is negative) or @var{param} is 769null. @code{sched_get_priority_max} and @code{sched_get_priority_min} 770tell you what the valid range is. 771 772@item 773@var{pid} is negative. 774@end itemize 775@end table 776 777@end deftypefun 778 779 780@deftypefun int sched_getscheduler (pid_t @var{pid}) 781@standards{POSIX, sched.h} 782@safety{@prelim{}@mtsafe{}@assafe{}@acsafe{}} 783@c Direct syscall, Linux only. 784 785This function returns the scheduling policy assigned to the task with 786ID @var{pid}, or the calling task if @var{pid} is zero. 787 788The return value is the scheduling policy. See 789@code{sched_setscheduler} for the possible values. 790 791If the function fails, the return value is instead @code{-1} and 792@code{errno} is set accordingly. 793 794The @code{errno} values specific to this function are: 795 796@table @code 797 798@item ESRCH 799There is no task with pid @var{pid} and it is not zero. 800 801@item EINVAL 802@var{pid} is negative. 803 804@end table 805 806Note that this function is not an exact mate to @code{sched_setscheduler} 807because while that function sets the scheduling policy and the absolute 808priority, this function gets only the scheduling policy. To get the 809absolute priority, use @code{sched_getparam}. 810 811@end deftypefun 812 813 814@deftypefun int sched_setparam (pid_t @var{pid}, const struct sched_param *@var{param}) 815@standards{POSIX, sched.h} 816@safety{@prelim{}@mtsafe{}@assafe{}@acsafe{}} 817@c Direct syscall, Linux only. 818 819This function sets a task's absolute priority. 820 821It is functionally identical to @code{sched_setscheduler} with 822@var{policy} = @code{-1}. 823 824@c in fact, that's how it's implemented in Linux. 825 826@end deftypefun 827 828@deftypefun int sched_getparam (pid_t @var{pid}, struct sched_param *@var{param}) 829@standards{POSIX, sched.h} 830@safety{@prelim{}@mtsafe{}@assafe{}@acsafe{}} 831@c Direct syscall, Linux only. 832 833This function returns a task's absolute priority. 834 835@var{pid} is the task ID of the task whose absolute priority you want 836to know. 837 838@var{param} is a pointer to a structure in which the function stores the 839absolute priority of the task. 840 841On success, the return value is @code{0}. Otherwise, it is @code{-1} 842and @code{errno} is set accordingly. The @code{errno} values specific 843to this function are: 844 845@table @code 846 847@item ESRCH 848There is no task with ID @var{pid} and it is not zero. 849 850@item EINVAL 851@var{pid} is negative. 852 853@end table 854 855@end deftypefun 856 857 858@deftypefun int sched_get_priority_min (int @var{policy}) 859@standards{POSIX, sched.h} 860@safety{@prelim{}@mtsafe{}@assafe{}@acsafe{}} 861@c Direct syscall, Linux only. 862 863This function returns the lowest absolute priority value that is 864allowable for a task with scheduling policy @var{policy}. 865 866On Linux, it is 0 for SCHED_OTHER and 1 for everything else. 867 868On success, the return value is @code{0}. Otherwise, it is @code{-1} 869and @code{ERRNO} is set accordingly. The @code{errno} values specific 870to this function are: 871 872@table @code 873@item EINVAL 874@var{policy} does not identify an existing scheduling policy. 875@end table 876 877@end deftypefun 878 879@deftypefun int sched_get_priority_max (int @var{policy}) 880@standards{POSIX, sched.h} 881@safety{@prelim{}@mtsafe{}@assafe{}@acsafe{}} 882@c Direct syscall, Linux only. 883 884This function returns the highest absolute priority value that is 885allowable for a task that with scheduling policy @var{policy}. 886 887On Linux, it is 0 for SCHED_OTHER and 99 for everything else. 888 889On success, the return value is @code{0}. Otherwise, it is @code{-1} 890and @code{ERRNO} is set accordingly. The @code{errno} values specific 891to this function are: 892 893@table @code 894@item EINVAL 895@var{policy} does not identify an existing scheduling policy. 896@end table 897 898@end deftypefun 899 900@deftypefun int sched_rr_get_interval (pid_t @var{pid}, struct timespec *@var{interval}) 901@standards{POSIX, sched.h} 902@safety{@prelim{}@mtsafe{}@assafe{}@acsafe{}} 903@c Direct syscall, Linux only. 904 905This function returns the length of the quantum (time slice) used with 906the Round Robin scheduling policy, if it is used, for the task with 907task ID @var{pid}. 908 909It returns the length of time as @var{interval}. 910@c We need a cross-reference to where timespec is explained. But that 911@c section doesn't exist yet, and the time chapter needs to be slightly 912@c reorganized so there is a place to put it (which will be right next 913@c to timeval, which is presently misplaced). 2000.05.07. 914 915With a Linux kernel, the round robin time slice is always 150 916microseconds, and @var{pid} need not even be a real pid. 917 918The return value is @code{0} on success and in the pathological case 919that it fails, the return value is @code{-1} and @code{errno} is set 920accordingly. There is nothing specific that can go wrong with this 921function, so there are no specific @code{errno} values. 922 923@end deftypefun 924 925@deftypefun int sched_yield (void) 926@standards{POSIX, sched.h} 927@safety{@prelim{}@mtsafe{}@assafe{}@acsafe{}} 928@c Direct syscall on Linux; alias to swtch on HURD. 929 930This function voluntarily gives up the task's claim on the CPU. 931 932Technically, @code{sched_yield} causes the calling task to be made 933immediately ready to run (as opposed to running, which is what it was 934before). This means that if it has absolute priority higher than 0, it 935gets pushed onto the tail of the queue of tasks that share its 936absolute priority and are ready to run, and it will run again when its 937turn next arrives. If its absolute priority is 0, it is more 938complicated, but still has the effect of yielding the CPU to other 939tasks. 940 941If there are no other tasks that share the calling task's absolute 942priority, this function doesn't have any effect. 943 944To the extent that the containing program is oblivious to what other 945processes in the system are doing and how fast it executes, this 946function appears as a no-op. 947 948The return value is @code{0} on success and in the pathological case 949that it fails, the return value is @code{-1} and @code{errno} is set 950accordingly. There is nothing specific that can go wrong with this 951function, so there are no specific @code{errno} values. 952 953@end deftypefun 954 955@node Traditional Scheduling 956@subsection Traditional Scheduling 957@cindex scheduling, traditional 958 959This section is about the scheduling among processes whose absolute 960priority is 0. When the system hands out the scraps of CPU time that 961are left over after the processes with higher absolute priority have 962taken all they want, the scheduling described herein determines who 963among the great unwashed processes gets them. 964 965@menu 966* Traditional Scheduling Intro:: 967* Traditional Scheduling Functions:: 968@end menu 969 970@node Traditional Scheduling Intro 971@subsubsection Introduction To Traditional Scheduling 972 973Long before there was absolute priority (See @ref{Absolute Priority}), 974Unix systems were scheduling the CPU using this system. When POSIX came 975in like the Romans and imposed absolute priorities to accommodate the 976needs of realtime processing, it left the indigenous Absolute Priority 977Zero processes to govern themselves by their own familiar scheduling 978policy. 979 980Indeed, absolute priorities higher than zero are not available on many 981systems today and are not typically used when they are, being intended 982mainly for computers that do realtime processing. So this section 983describes the only scheduling many programmers need to be concerned 984about. 985 986But just to be clear about the scope of this scheduling: Any time a 987process with an absolute priority of 0 and a process with an absolute 988priority higher than 0 are ready to run at the same time, the one with 989absolute priority 0 does not run. If it's already running when the 990higher priority ready-to-run process comes into existence, it stops 991immediately. 992 993In addition to its absolute priority of zero, every process has another 994priority, which we will refer to as "dynamic priority" because it changes 995over time. The dynamic priority is meaningless for processes with 996an absolute priority higher than zero. 997 998The dynamic priority sometimes determines who gets the next turn on the 999CPU. Sometimes it determines how long turns last. Sometimes it 1000determines whether a process can kick another off the CPU. 1001 1002In Linux, the value is a combination of these things, but mostly it 1003just determines the length of the time slice. The higher a process' 1004dynamic priority, the longer a shot it gets on the CPU when it gets one. 1005If it doesn't use up its time slice before giving up the CPU to do 1006something like wait for I/O, it is favored for getting the CPU back when 1007it's ready for it, to finish out its time slice. Other than that, 1008selection of processes for new time slices is basically round robin. 1009But the scheduler does throw a bone to the low priority processes: A 1010process' dynamic priority rises every time it is snubbed in the 1011scheduling process. In Linux, even the fat kid gets to play. 1012 1013The fluctuation of a process' dynamic priority is regulated by another 1014value: The ``nice'' value. The nice value is an integer, usually in the 1015range -20 to 20, and represents an upper limit on a process' dynamic 1016priority. The higher the nice number, the lower that limit. 1017 1018On a typical Linux system, for example, a process with a nice value of 101920 can get only 10 milliseconds on the CPU at a time, whereas a process 1020with a nice value of -20 can achieve a high enough priority to get 400 1021milliseconds. 1022 1023The idea of the nice value is deferential courtesy. In the beginning, 1024in the Unix garden of Eden, all processes shared equally in the bounty 1025of the computer system. But not all processes really need the same 1026share of CPU time, so the nice value gave a courteous process the 1027ability to refuse its equal share of CPU time that others might prosper. 1028Hence, the higher a process' nice value, the nicer the process is. 1029(Then a snake came along and offered some process a negative nice value 1030and the system became the crass resource allocation system we know 1031today.) 1032 1033Dynamic priorities tend upward and downward with an objective of 1034smoothing out allocation of CPU time and giving quick response time to 1035infrequent requests. But they never exceed their nice limits, so on a 1036heavily loaded CPU, the nice value effectively determines how fast a 1037process runs. 1038 1039In keeping with the socialistic heritage of Unix process priority, a 1040process begins life with the same nice value as its parent process and 1041can raise it at will. A process can also raise the nice value of any 1042other process owned by the same user (or effective user). But only a 1043privileged process can lower its nice value. A privileged process can 1044also raise or lower another process' nice value. 1045 1046@glibcadj{} functions for getting and setting nice values are described in 1047@xref{Traditional Scheduling Functions}. 1048 1049@node Traditional Scheduling Functions 1050@subsubsection Functions For Traditional Scheduling 1051 1052@pindex sys/resource.h 1053This section describes how you can read and set the nice value of a 1054process. All these symbols are declared in @file{sys/resource.h}. 1055 1056The function and macro names are defined by POSIX, and refer to 1057"priority," but the functions actually have to do with nice values, as 1058the terms are used both in the manual and POSIX. 1059 1060The range of valid nice values depends on the kernel, but typically it 1061runs from @code{-20} to @code{20}. A lower nice value corresponds to 1062higher priority for the process. These constants describe the range of 1063priority values: 1064 1065@vtable @code 1066@item PRIO_MIN 1067@standards{BSD, sys/resource.h} 1068The lowest valid nice value. 1069 1070@item PRIO_MAX 1071@standards{BSD, sys/resource.h} 1072The highest valid nice value. 1073@end vtable 1074 1075@deftypefun int getpriority (int @var{class}, int @var{id}) 1076@standards{BSD, sys/resource.h} 1077@standards{POSIX, sys/resource.h} 1078@safety{@prelim{}@mtsafe{}@assafe{}@acsafe{}} 1079@c Direct syscall on UNIX. On HURD, calls _hurd_priority_which_map. 1080Return the nice value of a set of processes; @var{class} and @var{id} 1081specify which ones (see below). If the processes specified do not all 1082have the same nice value, this returns the lowest value that any of them 1083has. 1084 1085On success, the return value is @code{0}. Otherwise, it is @code{-1} 1086and @code{errno} is set accordingly. The @code{errno} values specific 1087to this function are: 1088 1089@table @code 1090@item ESRCH 1091The combination of @var{class} and @var{id} does not match any existing 1092process. 1093 1094@item EINVAL 1095The value of @var{class} is not valid. 1096@end table 1097 1098If the return value is @code{-1}, it could indicate failure, or it could 1099be the nice value. The only way to make certain is to set @code{errno = 11000} before calling @code{getpriority}, then use @code{errno != 0} 1101afterward as the criterion for failure. 1102@end deftypefun 1103 1104@deftypefun int setpriority (int @var{class}, int @var{id}, int @var{niceval}) 1105@standards{BSD, sys/resource.h} 1106@standards{POSIX, sys/resource.h} 1107@safety{@prelim{}@mtsafe{}@assafe{}@acsafe{}} 1108@c Direct syscall on UNIX. On HURD, calls _hurd_priority_which_map. 1109Set the nice value of a set of processes to @var{niceval}; @var{class} 1110and @var{id} specify which ones (see below). 1111 1112The return value is @code{0} on success, and @code{-1} on 1113failure. The following @code{errno} error condition are possible for 1114this function: 1115 1116@table @code 1117@item ESRCH 1118The combination of @var{class} and @var{id} does not match any existing 1119process. 1120 1121@item EINVAL 1122The value of @var{class} is not valid. 1123 1124@item EPERM 1125The call would set the nice value of a process which is owned by a different 1126user than the calling process (i.e., the target process' real or effective 1127uid does not match the calling process' effective uid) and the calling 1128process does not have @code{CAP_SYS_NICE} permission. 1129 1130@item EACCES 1131The call would lower the process' nice value and the process does not have 1132@code{CAP_SYS_NICE} permission. 1133@end table 1134 1135@end deftypefun 1136 1137The arguments @var{class} and @var{id} together specify a set of 1138processes in which you are interested. These are the possible values of 1139@var{class}: 1140 1141@vtable @code 1142@item PRIO_PROCESS 1143@standards{BSD, sys/resource.h} 1144One particular process. The argument @var{id} is a process ID (pid). 1145 1146@item PRIO_PGRP 1147@standards{BSD, sys/resource.h} 1148All the processes in a particular process group. The argument @var{id} is 1149a process group ID (pgid). 1150 1151@item PRIO_USER 1152@standards{BSD, sys/resource.h} 1153All the processes owned by a particular user (i.e., whose real uid 1154indicates the user). The argument @var{id} is a user ID (uid). 1155@end vtable 1156 1157If the argument @var{id} is 0, it stands for the calling process, its 1158process group, or its owner (real uid), according to @var{class}. 1159 1160@deftypefun int nice (int @var{increment}) 1161@standards{BSD, unistd.h} 1162@safety{@prelim{}@mtunsafe{@mtasurace{:setpriority}}@asunsafe{}@acsafe{}} 1163@c Calls getpriority before and after setpriority, using the result of 1164@c the first call to compute the argument for setpriority. This creates 1165@c a window for a concurrent setpriority (or nice) call to be lost or 1166@c exhibit surprising behavior. 1167Increment the nice value of the calling process by @var{increment}. 1168The return value is the new nice value on success, and @code{-1} on 1169failure. In the case of failure, @code{errno} will be set to the 1170same values as for @code{setpriority}. 1171 1172 1173Here is an equivalent definition of @code{nice}: 1174 1175@smallexample 1176int 1177nice (int increment) 1178@{ 1179 int result, old = getpriority (PRIO_PROCESS, 0); 1180 result = setpriority (PRIO_PROCESS, 0, old + increment); 1181 if (result != -1) 1182 return old + increment; 1183 else 1184 return -1; 1185@} 1186@end smallexample 1187@end deftypefun 1188 1189 1190@node CPU Affinity 1191@subsection Limiting execution to certain CPUs 1192 1193On a multi-processor system the operating system usually distributes 1194the different processes which are runnable on all available CPUs in a 1195way which allows the system to work most efficiently. Which processes 1196and threads run can be to some extend be control with the scheduling 1197functionality described in the last sections. But which CPU finally 1198executes which process or thread is not covered. 1199 1200There are a number of reasons why a program might want to have control 1201over this aspect of the system as well: 1202 1203@itemize @bullet 1204@item 1205One thread or process is responsible for absolutely critical work 1206which under no circumstances must be interrupted or hindered from 1207making progress by other processes or threads using CPU resources. In 1208this case the special process would be confined to a CPU which no 1209other process or thread is allowed to use. 1210 1211@item 1212The access to certain resources (RAM, I/O ports) has different costs 1213from different CPUs. This is the case in NUMA (Non-Uniform Memory 1214Architecture) machines. Preferably memory should be accessed locally 1215but this requirement is usually not visible to the scheduler. 1216Therefore forcing a process or thread to the CPUs which have local 1217access to the most-used memory helps to significantly boost the 1218performance. 1219 1220@item 1221In controlled runtimes resource allocation and book-keeping work (for 1222instance garbage collection) is performance local to processors. This 1223can help to reduce locking costs if the resources do not have to be 1224protected from concurrent accesses from different processors. 1225@end itemize 1226 1227The POSIX standard up to this date is of not much help to solve this 1228problem. The Linux kernel provides a set of interfaces to allow 1229specifying @emph{affinity sets} for a process. The scheduler will 1230schedule the thread or process on CPUs specified by the affinity 1231masks. The interfaces which @theglibc{} define follow to some 1232extent the Linux kernel interface. 1233 1234@deftp {Data Type} cpu_set_t 1235@standards{GNU, sched.h} 1236This data set is a bitset where each bit represents a CPU. How the 1237system's CPUs are mapped to bits in the bitset is system dependent. 1238The data type has a fixed size; in the unlikely case that the number 1239of bits are not sufficient to describe the CPUs of the system a 1240different interface has to be used. 1241 1242This type is a GNU extension and is defined in @file{sched.h}. 1243@end deftp 1244 1245To manipulate the bitset, to set and reset bits, a number of macros are 1246defined. Some of the macros take a CPU number as a parameter. Here 1247it is important to never exceed the size of the bitset. The following 1248macro specifies the number of bits in the @code{cpu_set_t} bitset. 1249 1250@deftypevr Macro int CPU_SETSIZE 1251@standards{GNU, sched.h} 1252The value of this macro is the maximum number of CPUs which can be 1253handled with a @code{cpu_set_t} object. 1254@end deftypevr 1255 1256The type @code{cpu_set_t} should be considered opaque; all 1257manipulation should happen via the next four macros. 1258 1259@deftypefn Macro void CPU_ZERO (cpu_set_t *@var{set}) 1260@standards{GNU, sched.h} 1261@safety{@prelim{}@mtsafe{}@assafe{}@acsafe{}} 1262@c CPU_ZERO ok 1263@c __CPU_ZERO_S ok 1264@c memset dup ok 1265This macro initializes the CPU set @var{set} to be the empty set. 1266 1267This macro is a GNU extension and is defined in @file{sched.h}. 1268@end deftypefn 1269 1270@deftypefn Macro void CPU_SET (int @var{cpu}, cpu_set_t *@var{set}) 1271@standards{GNU, sched.h} 1272@safety{@prelim{}@mtsafe{}@assafe{}@acsafe{}} 1273@c CPU_SET ok 1274@c __CPU_SET_S ok 1275@c __CPUELT ok 1276@c __CPUMASK ok 1277This macro adds @var{cpu} to the CPU set @var{set}. 1278 1279The @var{cpu} parameter must not have side effects since it is 1280evaluated more than once. 1281 1282This macro is a GNU extension and is defined in @file{sched.h}. 1283@end deftypefn 1284 1285@deftypefn Macro void CPU_CLR (int @var{cpu}, cpu_set_t *@var{set}) 1286@standards{GNU, sched.h} 1287@safety{@prelim{}@mtsafe{}@assafe{}@acsafe{}} 1288@c CPU_CLR ok 1289@c __CPU_CLR_S ok 1290@c __CPUELT dup ok 1291@c __CPUMASK dup ok 1292This macro removes @var{cpu} from the CPU set @var{set}. 1293 1294The @var{cpu} parameter must not have side effects since it is 1295evaluated more than once. 1296 1297This macro is a GNU extension and is defined in @file{sched.h}. 1298@end deftypefn 1299 1300@deftypefn Macro int CPU_ISSET (int @var{cpu}, const cpu_set_t *@var{set}) 1301@standards{GNU, sched.h} 1302@safety{@prelim{}@mtsafe{}@assafe{}@acsafe{}} 1303@c CPU_ISSET ok 1304@c __CPU_ISSET_S ok 1305@c __CPUELT dup ok 1306@c __CPUMASK dup ok 1307This macro returns a nonzero value (true) if @var{cpu} is a member 1308of the CPU set @var{set}, and zero (false) otherwise. 1309 1310The @var{cpu} parameter must not have side effects since it is 1311evaluated more than once. 1312 1313This macro is a GNU extension and is defined in @file{sched.h}. 1314@end deftypefn 1315 1316 1317CPU bitsets can be constructed from scratch or the currently installed 1318affinity mask can be retrieved from the system. 1319 1320@deftypefun int sched_getaffinity (pid_t @var{pid}, size_t @var{cpusetsize}, cpu_set_t *@var{cpuset}) 1321@standards{GNU, sched.h} 1322@safety{@prelim{}@mtsafe{}@assafe{}@acsafe{}} 1323@c Wrapped syscall to zero out past the kernel cpu set size; Linux 1324@c only. 1325 1326This function stores the CPU affinity mask for the process or thread 1327with the ID @var{pid} in the @var{cpusetsize} bytes long bitmap 1328pointed to by @var{cpuset}. If successful, the function always 1329initializes all bits in the @code{cpu_set_t} object and returns zero. 1330 1331If @var{pid} does not correspond to a process or thread on the system 1332the or the function fails for some other reason, it returns @code{-1} 1333and @code{errno} is set to represent the error condition. 1334 1335@table @code 1336@item ESRCH 1337No process or thread with the given ID found. 1338 1339@item EFAULT 1340The pointer @var{cpuset} does not point to a valid object. 1341@end table 1342 1343This function is a GNU extension and is declared in @file{sched.h}. 1344@end deftypefun 1345 1346Note that it is not portably possible to use this information to 1347retrieve the information for different POSIX threads. A separate 1348interface must be provided for that. 1349 1350@deftypefun int sched_setaffinity (pid_t @var{pid}, size_t @var{cpusetsize}, const cpu_set_t *@var{cpuset}) 1351@standards{GNU, sched.h} 1352@safety{@prelim{}@mtsafe{}@assafe{}@acsafe{}} 1353@c Wrapped syscall to detect attempts to set bits past the kernel cpu 1354@c set size; Linux only. 1355 1356This function installs the @var{cpusetsize} bytes long affinity mask 1357pointed to by @var{cpuset} for the process or thread with the ID @var{pid}. 1358If successful the function returns zero and the scheduler will in the future 1359take the affinity information into account. 1360 1361If the function fails it will return @code{-1} and @code{errno} is set 1362to the error code: 1363 1364@table @code 1365@item ESRCH 1366No process or thread with the given ID found. 1367 1368@item EFAULT 1369The pointer @var{cpuset} does not point to a valid object. 1370 1371@item EINVAL 1372The bitset is not valid. This might mean that the affinity set might 1373not leave a processor for the process or thread to run on. 1374@end table 1375 1376This function is a GNU extension and is declared in @file{sched.h}. 1377@end deftypefun 1378 1379@deftypefun int getcpu (unsigned int *cpu, unsigned int *node) 1380@standards{Linux, <sched.h>} 1381@safety{@prelim{}@mtsafe{}@assafe{}@acsafe{}} 1382The @code{getcpu} function identifies the processor and node on which 1383the calling thread or process is currently running and writes them into 1384the integers pointed to by the @var{cpu} and @var{node} arguments. The 1385processor is a unique nonnegative integer identifying a CPU. The node 1386is a unique nonnegative integer identifying a NUMA node. When either 1387@var{cpu} or @var{node} is @code{NULL}, nothing is written to the 1388respective pointer. 1389 1390The return value is @code{0} on success and @code{-1} on failure. The 1391following @code{errno} error condition is defined for this function: 1392 1393@table @code 1394@item ENOSYS 1395The operating system does not support this function. 1396@end table 1397 1398This function is Linux-specific and is declared in @file{sched.h}. 1399@end deftypefun 1400 1401@node Memory Resources 1402@section Querying memory available resources 1403 1404The amount of memory available in the system and the way it is organized 1405determines oftentimes the way programs can and have to work. For 1406functions like @code{mmap} it is necessary to know about the size of 1407individual memory pages and knowing how much memory is available enables 1408a program to select appropriate sizes for, say, caches. Before we get 1409into these details a few words about memory subsystems in traditional 1410Unix systems will be given. 1411 1412@menu 1413* Memory Subsystem:: Overview about traditional Unix memory handling. 1414* Query Memory Parameters:: How to get information about the memory 1415 subsystem? 1416@end menu 1417 1418@node Memory Subsystem 1419@subsection Overview about traditional Unix memory handling 1420 1421@cindex address space 1422@cindex physical memory 1423@cindex physical address 1424Unix systems normally provide processes virtual address spaces. This 1425means that the addresses of the memory regions do not have to correspond 1426directly to the addresses of the actual physical memory which stores the 1427data. An extra level of indirection is introduced which translates 1428virtual addresses into physical addresses. This is normally done by the 1429hardware of the processor. 1430 1431@cindex shared memory 1432Using a virtual address space has several advantages. The most important 1433is process isolation. The different processes running on the system 1434cannot interfere directly with each other. No process can write into 1435the address space of another process (except when shared memory is used 1436but then it is wanted and controlled). 1437 1438Another advantage of virtual memory is that the address space the 1439processes see can actually be larger than the physical memory available. 1440The physical memory can be extended by storage on an external media 1441where the content of currently unused memory regions is stored. The 1442address translation can then intercept accesses to these memory regions 1443and make memory content available again by loading the data back into 1444memory. This concept makes it necessary that programs which have to use 1445lots of memory know the difference between available virtual address 1446space and available physical memory. If the working set of virtual 1447memory of all the processes is larger than the available physical memory 1448the system will slow down dramatically due to constant swapping of 1449memory content from the memory to the storage media and back. This is 1450called ``thrashing''. 1451@cindex thrashing 1452 1453@cindex memory page 1454@cindex page, memory 1455A final aspect of virtual memory which is important and follows from 1456what is said in the last paragraph is the granularity of the virtual 1457address space handling. When we said that the virtual address handling 1458stores memory content externally it cannot do this on a byte-by-byte 1459basis. The administrative overhead does not allow this (leaving alone 1460the processor hardware). Instead several thousand bytes are handled 1461together and form a @dfn{page}. The size of each page is always a power 1462of two bytes. The smallest page size in use today is 4096, with 8192, 146316384, and 65536 being other popular sizes. 1464 1465@node Query Memory Parameters 1466@subsection How to get information about the memory subsystem? 1467 1468The page size of the virtual memory the process sees is essential to 1469know in several situations. Some programming interfaces (e.g., 1470@code{mmap}, @pxref{Memory-mapped I/O}) require the user to provide 1471information adjusted to the page size. In the case of @code{mmap} it is 1472necessary to provide a length argument which is a multiple of the page 1473size. Another place where the knowledge about the page size is useful 1474is in memory allocation. If one allocates pieces of memory in larger 1475chunks which are then subdivided by the application code it is useful to 1476adjust the size of the larger blocks to the page size. If the total 1477memory requirement for the block is close (but not larger) to a multiple 1478of the page size the kernel's memory handling can work more effectively 1479since it only has to allocate memory pages which are fully used. (To do 1480this optimization it is necessary to know a bit about the memory 1481allocator which will require a bit of memory itself for each block and 1482this overhead must not push the total size over the page size multiple.) 1483 1484The page size traditionally was a compile time constant. But recent 1485development of processors changed this. Processors now support 1486different page sizes and they can possibly even vary among different 1487processes on the same system. Therefore the system should be queried at 1488runtime about the current page size and no assumptions (except about it 1489being a power of two) should be made. 1490 1491@vindex _SC_PAGESIZE 1492The correct interface to query about the page size is @code{sysconf} 1493(@pxref{Sysconf Definition}) with the parameter @code{_SC_PAGESIZE}. 1494There is a much older interface available, too. 1495 1496@deftypefun int getpagesize (void) 1497@standards{BSD, unistd.h} 1498@safety{@prelim{}@mtsafe{}@assafe{}@acsafe{}} 1499@c Obtained from the aux vec at program startup time. GNU/Linux/m68k is 1500@c the exception, with the possibility of a syscall. 1501The @code{getpagesize} function returns the page size of the process. 1502This value is fixed for the runtime of the process but can vary in 1503different runs of the application. 1504 1505The function is declared in @file{unistd.h}. 1506@end deftypefun 1507 1508Widely available on @w{System V} derived systems is a method to get 1509information about the physical memory the system has. The call 1510 1511@vindex _SC_PHYS_PAGES 1512@cindex sysconf 1513@smallexample 1514 sysconf (_SC_PHYS_PAGES) 1515@end smallexample 1516 1517@noindent 1518returns the total number of pages of physical memory the system has. 1519This does not mean all this memory is available. This information can 1520be found using 1521 1522@vindex _SC_AVPHYS_PAGES 1523@cindex sysconf 1524@smallexample 1525 sysconf (_SC_AVPHYS_PAGES) 1526@end smallexample 1527 1528These two values help to optimize applications. The value returned for 1529@code{_SC_AVPHYS_PAGES} is the amount of memory the application can use 1530without hindering any other process (given that no other process 1531increases its memory usage). The value returned for 1532@code{_SC_PHYS_PAGES} is more or less a hard limit for the working set. 1533If all applications together constantly use more than that amount of 1534memory the system is in trouble. 1535 1536@Theglibc{} provides in addition to these already described way to 1537get this information two functions. They are declared in the file 1538@file{sys/sysinfo.h}. Programmers should prefer to use the 1539@code{sysconf} method described above. 1540 1541@deftypefun {long int} get_phys_pages (void) 1542@standards{GNU, sys/sysinfo.h} 1543@safety{@prelim{}@mtsafe{}@asunsafe{@ascuheap{} @asulock{}}@acunsafe{@aculock{} @acsfd{} @acsmem{}}} 1544@c This fopens a /proc file and scans it for the requested information. 1545The @code{get_phys_pages} function returns the total number of pages of 1546physical memory the system has. To get the amount of memory this number has to 1547be multiplied by the page size. 1548 1549This function is a GNU extension. 1550@end deftypefun 1551 1552@deftypefun {long int} get_avphys_pages (void) 1553@standards{GNU, sys/sysinfo.h} 1554@safety{@prelim{}@mtsafe{}@asunsafe{@ascuheap{} @asulock{}}@acunsafe{@aculock{} @acsfd{} @acsmem{}}} 1555The @code{get_avphys_pages} function returns the number of available pages of 1556physical memory the system has. To get the amount of memory this number has to 1557be multiplied by the page size. 1558 1559This function is a GNU extension. 1560@end deftypefun 1561 1562@node Processor Resources 1563@section Learn about the processors available 1564 1565The use of threads or processes with shared memory allows an application 1566to take advantage of all the processing power a system can provide. If 1567the task can be parallelized the optimal way to write an application is 1568to have at any time as many processes running as there are processors. 1569To determine the number of processors available to the system one can 1570run 1571 1572@vindex _SC_NPROCESSORS_CONF 1573@cindex sysconf 1574@smallexample 1575 sysconf (_SC_NPROCESSORS_CONF) 1576@end smallexample 1577 1578@noindent 1579which returns the number of processors the operating system configured. 1580But it might be possible for the operating system to disable individual 1581processors and so the call 1582 1583@vindex _SC_NPROCESSORS_ONLN 1584@cindex sysconf 1585@smallexample 1586 sysconf (_SC_NPROCESSORS_ONLN) 1587@end smallexample 1588 1589@noindent 1590returns the number of processors which are currently online (i.e., 1591available). 1592 1593For these two pieces of information @theglibc{} also provides 1594functions to get the information directly. The functions are declared 1595in @file{sys/sysinfo.h}. 1596 1597@deftypefun int get_nprocs_conf (void) 1598@standards{GNU, sys/sysinfo.h} 1599@safety{@prelim{}@mtsafe{}@asunsafe{@ascuheap{} @asulock{}}@acunsafe{@aculock{} @acsfd{} @acsmem{}}} 1600@c This function reads from from /sys using dir streams (single user, so 1601@c no @mtasurace issue), and on some arches, from /proc using streams. 1602The @code{get_nprocs_conf} function returns the number of processors the 1603operating system configured. 1604 1605This function is a GNU extension. 1606@end deftypefun 1607 1608@deftypefun int get_nprocs (void) 1609@standards{GNU, sys/sysinfo.h} 1610@safety{@prelim{}@mtsafe{}@assafe{}@acsafe{@acsfd{}}} 1611@c This function reads from /proc using file descriptor I/O. 1612The @code{get_nprocs} function returns the number of available processors. 1613 1614This function is a GNU extension. 1615@end deftypefun 1616 1617@cindex load average 1618Before starting more threads it should be checked whether the processors 1619are not already overused. Unix systems calculate something called the 1620@dfn{load average}. This is a number indicating how many processes were 1621running. This number is an average over different periods of time 1622(normally 1, 5, and 15 minutes). 1623 1624@deftypefun int getloadavg (double @var{loadavg}[], int @var{nelem}) 1625@standards{BSD, stdlib.h} 1626@safety{@prelim{}@mtsafe{}@assafe{}@acsafe{@acsfd{}}} 1627@c Calls host_info on HURD; on Linux, opens /proc/loadavg, reads from 1628@c it, closes it, without cancellation point, and calls strtod_l with 1629@c the C locale to convert the strings to doubles. 1630This function gets the 1, 5 and 15 minute load averages of the 1631system. The values are placed in @var{loadavg}. @code{getloadavg} will 1632place at most @var{nelem} elements into the array but never more than 1633three elements. The return value is the number of elements written to 1634@var{loadavg}, or -1 on error. 1635 1636This function is declared in @file{stdlib.h}. 1637@end deftypefun 1638