1Started Oct 1999 by Kanoj Sarcar <kanoj@sgi.com> 2 3The intent of this file is to have an uptodate, running commentary 4from different people about how locking and synchronization is done 5in the Linux vm code. 6 7page_table_lock & mmap_sem 8-------------------------------------- 9 10Page stealers pick processes out of the process pool and scan for 11the best process to steal pages from. To guarantee the existence 12of the victim mm, a mm_count inc and a mmdrop are done in swap_out(). 13Page stealers hold kernel_lock to protect against a bunch of races. 14The vma list of the victim mm is also scanned by the stealer, 15and the page_table_lock is used to preserve list sanity against the 16process adding/deleting to the list. This also guarantees existence 17of the vma. Vma existence is not guaranteed once try_to_swap_out() 18drops the page_table_lock. To guarantee the existence of the underlying 19file structure, a get_file is done before the swapout() method is 20invoked. The page passed into swapout() is guaranteed not to be reused 21for a different purpose because the page reference count due to being 22present in the user's pte is not released till after swapout() returns. 23 24Any code that modifies the vmlist, or the vm_start/vm_end/ 25vm_flags:VM_LOCKED/vm_next of any vma *in the list* must prevent 26kswapd from looking at the chain. 27 28The rules are: 291. To scan the vmlist (look but don't touch) you must hold the 30 mmap_sem with read bias, i.e. down_read(&mm->mmap_sem) 312. To modify the vmlist you need to hold the mmap_sem with 32 read&write bias, i.e. down_write(&mm->mmap_sem) *AND* 33 you need to take the page_table_lock. 343. The swapper takes _just_ the page_table_lock, this is done 35 because the mmap_sem can be an extremely long lived lock 36 and the swapper just cannot sleep on that. 374. The exception to this rule is expand_stack, which just 38 takes the read lock and the page_table_lock, this is ok 39 because it doesn't really modify fields anybody relies on. 405. You must be able to guarantee that while holding page_table_lock 41 or page_table_lock of mm A, you will not try to get either lock 42 for mm B. 43 44The caveats are: 451. find_vma() makes use of, and updates, the mmap_cache pointer hint. 46The update of mmap_cache is racy (page stealer can race with other code 47that invokes find_vma with mmap_sem held), but that is okay, since it 48is a hint. This can be fixed, if desired, by having find_vma grab the 49page_table_lock. 50 51 52Code that add/delete elements from the vmlist chain are 531. callers of insert_vm_struct 542. callers of merge_segments 553. callers of avl_remove 56 57Code that changes vm_start/vm_end/vm_flags:VM_LOCKED of vma's on 58the list: 591. expand_stack 602. mprotect 613. mlock 624. mremap 63 64It is advisable that changes to vm_start/vm_end be protected, although 65in some cases it is not really needed. Eg, vm_start is modified by 66expand_stack(), it is hard to come up with a destructive scenario without 67having the vmlist protection in this case. 68 69The page_table_lock nests with the inode i_shared_lock and the kmem cache 70c_spinlock spinlocks. This is okay, since code that holds i_shared_lock 71never asks for memory, and the kmem code asks for pages after dropping 72c_spinlock. The page_table_lock also nests with pagecache_lock and 73pagemap_lru_lock spinlocks, and no code asks for memory with these locks 74held. 75 76The page_table_lock is grabbed while holding the kernel_lock spinning monitor. 77 78The page_table_lock is a spin lock. 79 80swap_list_lock/swap_device_lock 81------------------------------- 82The swap devices are chained in priority order from the "swap_list" header. 83The "swap_list" is used for the round-robin swaphandle allocation strategy. 84The #free swaphandles is maintained in "nr_swap_pages". These two together 85are protected by the swap_list_lock. 86 87The swap_device_lock, which is per swap device, protects the reference 88counts on the corresponding swaphandles, maintained in the "swap_map" 89array, and the "highest_bit" and "lowest_bit" fields. 90 91Both of these are spinlocks, and are never acquired from intr level. The 92locking hierarchy is swap_list_lock -> swap_device_lock. 93 94To prevent races between swap space deletion or async readahead swapins 95deciding whether a swap handle is being used, ie worthy of being read in 96from disk, and an unmap -> swap_free making the handle unused, the swap 97delete and readahead code grabs a temp reference on the swaphandle to 98prevent warning messages from swap_duplicate <- read_swap_cache_async. 99 100Swap cache locking 101------------------ 102Pages are added into the swap cache with kernel_lock held, to make sure 103that multiple pages are not being added (and hence lost) by associating 104all of them with the same swaphandle. 105 106Pages are guaranteed not to be removed from the scache if the page is 107"shared": ie, other processes hold reference on the page or the associated 108swap handle. The only code that does not follow this rule is shrink_mmap, 109which deletes pages from the swap cache if no process has a reference on 110the page (multiple processes might have references on the corresponding 111swap handle though). lookup_swap_cache() races with shrink_mmap, when 112establishing a reference on a scache page, so, it must check whether the 113page it located is still in the swapcache, or shrink_mmap deleted it. 114(This race is due to the fact that shrink_mmap looks at the page ref 115count with pagecache_lock, but then drops pagecache_lock before deleting 116the page from the scache). 117 118do_wp_page and do_swap_page have MP races in them while trying to figure 119out whether a page is "shared", by looking at the page_count + swap_count. 120To preserve the sum of the counts, the page lock _must_ be acquired before 121calling is_page_shared (else processes might switch their swap_count refs 122to the page count refs, after the page count ref has been snapshotted). 123 124Swap device deletion code currently breaks all the scache assumptions, 125since it grabs neither mmap_sem nor page_table_lock. 126