1The paging design used on the x86-64 linux kernel port in 2.4.x provides: 2 3o per process virtual address space limit of 512 Gigabytes 4o top of userspace stack located at address 0x0000007fffffffff 5o start of the kernel mapping = 0x0000010000000000 6o global RAM per system 508*512GB=254 Terabytes 7o no need of any common code change 8o 512GB of vmalloc/ioremap space 9 10Description: 11 x86-64 has a 4 level page structure, similar to ia32 PSE but with 12 some extensions. Each level consits of a 4K page with 512 64bit 13 entries. The levels are named in Linux PML4, PGD, PMD, PTE; AMD calls them 14 PML4E, PDPE, PDE, PTE respectively. For direct and kernel mapping 15 only 3 levels are used with the PMD pointing to 2MB pages. 16 17 Userspace is able to modify and it sees only the 3rd/2nd/1st level 18 pagetables (pgd_offset() implicitly walks the 1st slot of the 4th 19 level pagetable and it returns an entry into the 3rd level pagetable). 20 This is where the per-process 512 Gigabytes limit cames from. 21 22 The common code pgd is the PDPE, the pmd is the PDE, the 23 pte is the PTE. The PML4 remains invisible to the common 24 code. 25 26 Since the per-process limit is 512 Gigabytes (due to kernel common 27 code 3 level pagetable limitation), the higher virtual address mapped 28 into userspace is 0x7fffffffff and it makes sense to use it 29 as the top of the userspace stack to allow the stack to grow as 30 much as possible. 31 32 The kernel mapping and the direct memory mapping are split. Direct memory 33 mapping starts directly after userspace after a 512GB gap, while 34 kernel mapping is at the end of (negative) virtual address space to exploit 35 the kernel code model. There is no support for discontig memory, this 36 implies that kernel mapping/vmalloc/ioremap/module mapping are not 37 represented in their "real" mapping in mem_map, but only with their 38 direct mapped (but normally not used) alias. 39 40Future: 41 42 During 2.5.x we can break the 512 Gigabytes per-process limit 43 possibly by removing from the common code any knowledge about the 44 architectural dependent physical layout of the virtual to physical 45 mapping. 46 47 Once the 512 Gigabytes limit will be removed the kernel stack will 48 be moved (most probably to virtual address 0x00007fffffffffff). 49 Nothing will break in userspace due that move, as nothing breaks 50 in IA32 compiling the kernel with CONFIG_2G. 51 52Linus agreed on not breaking common code and to live with the 512 Gigabytes 53per-process limitation for the 2.4.x timeframe and he has given me and Andi 54some very useful hints... (thanks! :) 55 56Thanks also to H. Peter Anvin for his interesting and useful suggestions on 57the x86-64-discuss lists! 58 59Current PML4 Layout: 60 Each CPU has an PML4 page that never changes. 61 Each slot is 512GB of virtual memory. 62 63 0 user space pgd or 40MB low mapping at bootup. Changed at context switch. 64 1 unmapped 65 2 __PAGE_OFFSET - start of direct mapping of physical memory 66 ... direct mapping in further slots as needed. 67 509 some io mappings (others are in a memory hole below 4gb) 68 510 vmalloc and ioremap space 69 511 kernel code mapping, fixmaps and modules. 70 71Other memory management related issues follows: 72 73PAGE_SIZE: 74 75 If somebody is wondering why these days we still have a so small 76 4k pagesize (16 or 32 kbytes would be much better for performance 77 of course), the PAGE_SIZE have to remain 4k for 32bit apps to 78 provide 100% backwards compatible IA32 API (we can't allow silent 79 fs corruption or as best a loss of coherency with the page cache 80 by allocating MAP_SHARED areas in MAP_ANONYMOUS memory with a 81 do_mmap_fake). I think it could be possible to have a dynamic page 82 size between 32bit and 64bit apps but it would need extremely 83 intrusive changes in the common code as first for page cache and 84 we sure don't want to depend on them right now even if the 85 hardware would support that. 86 87PAGETABLE SIZE: 88 89 In turn we can't afford to have pagetables larger than 4k because 90 we could not be able to allocate them due physical memory 91 fragmentation, and failing to allocate the kernel stack is a minor 92 issue compared to failing the allocation of a pagetable. If we 93 fail the allocation of a pagetable the only thing we can do is to 94 sched_yield polling the freelist (deadlock prone) or to segfault 95 the task (not even the sighandler would be sure to run). 96 97KERNEL STACK: 98 99 1st stage: 100 101 The kernel stack will be at first allocated with an order 2 allocation 102 (16k) (the utilization of the stack for a 64bit platform really 103 isn't exactly the double of a 32bit platform because the local 104 variables may not be all 64bit wide, but not much less). This will 105 make things even worse than they are right now on IA32 with 106 respect of failing fork/clone due memory fragmentation. 107 108 2nd stage: 109 110 We'll benchmark if reserving one register as task_struct 111 pointer will improve performance of the kernel (instead of 112 recalculating the task_struct pointer starting from the stack 113 pointer each time). My guess is that recalculating will be faster 114 but it worth a try. 115 116 If reserving one register for the task_struct pointer 117 will be faster we can as well split task_struct and kernel 118 stack. task_struct can be a slab allocation or a 119 PAGE_SIZEd allocation, and the kernel stack can then be 120 allocated in a order 1 allocation. Really this is risky, 121 since 8k on a 64bit platform is going to be less than 7k 122 on a 32bit platform but we could try it out. This would 123 reduce the fragmentation problem of an order of magnitude 124 making it equal to the current IA32. 125 126 We must also consider the x86-64 seems to provide in hardware a 127 per-irq stack that could allow us to remove the irq handler 128 footprint from the regular per-process-stack, so it could allow 129 us to live with a smaller kernel stack compared to the other 130 linux architectures. 131 132 3rd stage: 133 134 Before going into production if we still have the order 2 135 allocation we can add a sysctl that allows the kernel stack to be 136 allocated with vmalloc during memory fragmentation. This have to 137 remain turned off during benchmarks :) but it should be ok in real 138 life. 139 140Order of PAGE_CACHE_SIZE and other allocations: 141 142 On the long run we can increase the PAGE_CACHE_SIZE to be 143 an order 2 allocations and also the slab/buffercache etc.ec.. 144 could be all done with order 2 allocations. To make the above 145 to work we should change lots of common code thus it can be done 146 only once the basic port will be in a production state. Having 147 a working PAGE_CACHE_SIZE would be a benefit also for 148 IA32 and other architectures of course. 149 150vmalloc: 151 vmalloc should be outside the first 512GB to keep that space free 152 for the user space. It needs an own pgd to work on in common code. 153 It currently gets an own pgd in the 510th slot of the per CPU PML4. 154 155PML4: 156 Each CPU as an own PML4 (=top level of the 4 level page hierarchy). On 157 context switch the first slot is rewritten to the pgd of the new process 158 and CR3 is flushed. 159 160Modules: 161 Modules need to be in the same 4GB range as the core kernel. Otherwise 162 a GOT would be needed. Modules are currently at 0xffffffffa0000000 163 to 0xffffffffafffffff. This is inbetween the kernel text and the 164 vsyscall/fixmap mappings. 165 166Vsyscalls: 167 Vsyscalls have a reserved space near the end of user space that is 168 acessible by user space. This address is part of the ABI and cannot be 169 changed. They have ffffffffff600000 to ffffffffffe00000 (but only 170 some small space at the beginning is allocated and known to user space 171 currently). See vsyscall.c for more details. 172 173Fixmaps: 174 Fixed mappings set up at boot. Used to access IO APIC and some other hardware. 175 These are at the end of vsyscall space (ffffffffffe00000) downwards, 176 but are not accessible by user space of course. 177 178Early mapping: 179 On a 120TB memory system bootmem could use upto 3.5GB 180 of memory for its bootmem bitmap. To avoid having to map 3.5GB by hand 181 for bootmem's purposes the full direct mapping is created before bootmem 182 is initialized. The direct mapping needs some memory for its page tables, 183 these are directly taken from the physical memory after the kernel. To 184 access these pages they need to be mapped, this is done by a temporary 185 mapping with a few spare static 2MB PMD entries. 186 187Unsolved issues: 188 2MB pages for user space - may need to add a highmem zone for that again to 189 avoid fragmentation. 190 191Andrea <andrea@suse.de> SuSE 192Andi Kleen <ak@suse.de> SuSE 193 194$Id$ 195