https://www.kernel.org/doc/Documentation/x86/x86_64/mm.txt
https://www.kernel.org/doc/html/latest/x86/x86_64/5level-paging.html
一、4层页表实现虚拟内存映射
原始的x86-64架构,以4层页表并受限于此,实现了256 TiB的虚拟地址空间和64 TiB的物理地址空间。 我们已经受限于此:一些厂商提供64TiB的内存。
为了克服这一限制,即将推出的硬件将引入对5级分页的支持。它是当前页表结构的直接扩展,增加了一层翻译。
它将虚拟地址空间的限制提高到128PiB,物理地址空间的限制提高到4PiB。
QEMU 2.9及更高版本支持5级分页。
========================================================================================================================
Start addr | Offset | End addr | Size | VM area description
========================================================================================================================
| | | |
0000000000000000 | 0 | 00007fffffffffff | 128 TB | user-space virtual memory, different per mm
__________________|____________|__________________|_________|___________________________________________________________
| | | |
0000800000000000 | +128 TB | ffff7fffffffffff | ~16M TB | ... huge, almost 64 bits wide hole of non-canonical
| | | | virtual memory addresses up to the -128 TB
| | | | starting offset of kernel mappings.
__________________|____________|__________________|_________|___________________________________________________________
|
| Kernel-space virtual memory, shared between all processes:
____________________________________________________________|___________________________________________________________
| | | |
ffff800000000000 | -128 TB | ffff87ffffffffff | 8 TB | ... guard hole, also reserved for hypervisor
ffff880000000000 | -120 TB | ffff887fffffffff | 0.5 TB | LDT remap for PTI
ffff888000000000 | -119.5 TB | ffffc87fffffffff | 64 TB | direct mapping of all physical memory (page_offset_base)
ffffc88000000000 | -55.5 TB | ffffc8ffffffffff | 0.5 TB | ... unused hole
ffffc90000000000 | -55 TB | ffffe8ffffffffff | 32 TB | vmalloc/ioremap space (vmalloc_base)
ffffe90000000000 | -23 TB | ffffe9ffffffffff | 1 TB | ... unused hole
ffffea0000000000 | -22 TB | ffffeaffffffffff | 1 TB | virtual memory map (vmemmap_base)
ffffeb0000000000 | -21 TB | ffffebffffffffff | 1 TB | ... unused hole
ffffec0000000000 | -20 TB | fffffbffffffffff | 16 TB | KASAN shadow memory
__________________|____________|__________________|_________|____________________________________________________________
|
| Identical layout to the 56-bit one from here on:
____________________________________________________________|____________________________________________________________
| | | |
fffffc0000000000 | -4 TB | fffffdffffffffff | 2 TB | ... unused hole
| | | | vaddr_end for KASLR
fffffe0000000000 | -2 TB | fffffe7fffffffff | 0.5 TB | cpu_entry_area mapping
fffffe8000000000 | -1.5 TB | fffffeffffffffff | 0.5 TB | ... unused hole
ffffff0000000000 | -1 TB | ffffff7fffffffff | 0.5 TB | %esp fixup stacks
ffffff8000000000 | -512 GB | ffffffeeffffffff | 444 GB | ... unused hole
ffffffef00000000 | -68 GB | fffffffeffffffff | 64 GB | EFI region mapping space
ffffffff00000000 | -4 GB | ffffffff7fffffff | 2 GB | ... unused hole
ffffffff80000000 | -2 GB | ffffffff9fffffff | 512 MB | kernel text mapping, mapped to physical address 0
ffffffff80000000 |-2048 MB | | |
ffffffffa0000000 |-1536 MB | fffffffffeffffff | 1520 MB | module mapping space
ffffffffff000000 | -16 MB | | |
FIXADDR_START | ~-11 MB | ffffffffff5fffff | ~0.5 MB | kernel-internal fixmap range, variable size and offset
ffffffffff600000 | -10 MB | ffffffffff600fff | 4 kB | legacy vsyscall ABI
ffffffffffe00000 | -2 MB | ffffffffffffffff | 2 MB | ... unused hole
__________________|____________|__________________|_________|___________________________________________________________
二、5层页表实现虚拟内存映射
CONFIG_X86_5LEVEL=y
开启该特性。
配置CONFIG_X86_5LEVEL=y
的内核仍然可以运行在4-level的硬件上。这种情况,会在运行时会包含一个额外的页表级别 – p4d – 。
在x86架构上,5级分页支持56位用户空间虚拟地址空间。并非所有用户空间都准备好处理宽地址。众所周知,至少有一些JIT编译器使用指针中的高位对其信息进行编码。它与具有5级分页的有效指针冲突,并导致崩溃。为了缓解这种情况,默认我们不会分配47位以上的虚拟地址空间。
但是,用户空间可以通过指定47位以上的hint addresswith or without MAP_FIXED
),从整个地址空间请求分配。
若hint address设置在47位以上,但没有指定MAP_FIXED
,我们将尝试按指定的地址查找未映射的区域。若它已经被占用,我们将在整个的地址空间中查找未映射的区域,而不是从47位窗口中查找。
high hint address只会影响相关的分配,而不会影响将来的mmap()s。
在旧内核上或在没有5级分页支持的计算机上指定high hint address是安全的。Hint将被忽略,内核将退回到47位地址空间的分配。
该方法有助于轻松地使应用程序的内存分配器分配大地址空间,而无需手动跟踪分配的虚拟地址空间。
一个重要问题:处理与MPX的交互。MPX(没有MAWA
扩展)无法处理47位以上的地址,因此我们需要确保无法启用MPX。我们已经在边界上方有VMA,并且一旦启用MPX,就禁止创建此类VMA。
========================================================================================================================
Start addr | Offset | End addr | Size | VM area description
========================================================================================================================
| | | |
0000000000000000 | 0 | 00ffffffffffffff | 64 PB | user-space virtual memory, different per mm
__________________|____________|__________________|_________|___________________________________________________________
| | | |
0000800000000000 | +64 PB | ffff7fffffffffff | ~16K PB | ... huge, still almost 64 bits wide hole of non-canonical
| | | | virtual memory addresses up to the -64 PB
| | | | starting offset of kernel mappings.
__________________|____________|__________________|_________|___________________________________________________________
|
| Kernel-space virtual memory, shared between all processes:
____________________________________________________________|___________________________________________________________
| | | |
ff00000000000000 | -64 PB | ff0fffffffffffff | 4 PB | ... guard hole, also reserved for hypervisor
ff10000000000000 | -60 PB | ff10ffffffffffff | 0.25 PB | LDT remap for PTI
ff11000000000000 | -59.75 PB | ff90ffffffffffff | 32 PB | direct mapping of all physical memory (page_offset_base)
ff91000000000000 | -27.75 PB | ff9fffffffffffff | 3.75 PB | ... unused hole
ffa0000000000000 | -24 PB | ffd1ffffffffffff | 12.5 PB | vmalloc/ioremap space (vmalloc_base)
ffd2000000000000 | -11.5 PB | ffd3ffffffffffff | 0.5 PB | ... unused hole
ffd4000000000000 | -11 PB | ffd5ffffffffffff | 0.5 PB | virtual memory map (vmemmap_base)
ffd6000000000000 | -10.5 PB | ffdeffffffffffff | 2.25 PB | ... unused hole
ffdf000000000000 | -8.25 PB | fffffdffffffffff | ~8 PB | KASAN shadow memory
__________________|____________|__________________|_________|____________________________________________________________
|
| Identical layout to the 47-bit one from here on:
____________________________________________________________|____________________________________________________________
| | | |
fffffc0000000000 | -4 TB | fffffdffffffffff | 2 TB | ... unused hole
| | | | vaddr_end for KASLR
fffffe0000000000 | -2 TB | fffffe7fffffffff | 0.5 TB | cpu_entry_area mapping
fffffe8000000000 | -1.5 TB | fffffeffffffffff | 0.5 TB | ... unused hole
ffffff0000000000 | -1 TB | ffffff7fffffffff | 0.5 TB | %esp fixup stacks
ffffff8000000000 | -512 GB | ffffffeeffffffff | 444 GB | ... unused hole
ffffffef00000000 | -68 GB | fffffffeffffffff | 64 GB | EFI region mapping space
ffffffff00000000 | -4 GB | ffffffff7fffffff | 2 GB | ... unused hole
ffffffff80000000 | -2 GB | ffffffff9fffffff | 512 MB | kernel text mapping, mapped to physical address 0
ffffffff80000000 |-2048 MB | | |
ffffffffa0000000 |-1536 MB | fffffffffeffffff | 1520 MB | module mapping space
ffffffffff000000 | -16 MB | | |
FIXADDR_START | ~-11 MB | ffffffffff5fffff | ~0.5 MB | kernel-internal fixmap range, variable size and offset
ffffffffff600000 | -10 MB | ffffffffff600fff | 4 kB | legacy vsyscall ABI
ffffffffffe00000 | -2 MB | ffffffffffffffff | 2 MB | ... unused hole
__________________|____________|__________________|_________|___________________________________________________________
架构定义了64位虚拟地址。工程实现支持少一些。目前支持的是48位和57位虚拟地址。
63位到最高有效实现位是sign扩展的。如果将其解释为unsighed,则会导致用户空间和内核地址之间出现漏洞。
直接映射覆盖系统中直到最高内存地址的所有内存(这意味着在某些情况下,它还可以包括PCI内存洞)。
vmalloc空间使用page fault处理程序(以init_top_pgt为参考)惰性地同步到进程的不同PML4/PML5页面中。
我们将在EFI_pgd
PGD中的EFI runtime服务映射到64Gb大型虚拟内存窗口中(此大小是任意的,如果需要,可以稍后提高)。这些映射不属于任何其他内核PGD,仅在EFI runtime调用期间可用。
请注意,如果启用了CONFIG_RANDOMIZE_MEMORY
,则直接映射所有物理内存、vmalloc/ioremap空间和虚拟内存,都将随机化。
它们的顺序保留,但它们的基数早在boot时就偏移了。
与KASLR相比,在这里更改任何内容时都要非常小心。KASLR地址范围不得与除KASAN阴影区域以外的任何区域重叠,因为KASAN禁用了KASLR。
对于4层和5层布局,STACKLEAK_POISON值在最后2MB
网友评论