Memory management is the heart of operating systems; it is crucial for both programming and system administration. In the next few posts I'll cover memory with an eye towards practical aspects, but without shying away from internals. While the concepts are generic, examples are mostly from Linux and Windows on 32-bit x86. This first post describes how programs are laid out in memory.
内存管理是操作系统的核心,它对编程和系统管理都至关重要。在接下来的几篇文章中,我将着眼于实际角度来讨论内存管理相关的内容。虽然这些概念是通用的,但示例大多来自32位x86上的Linux和Windows。这篇文章描述了程序在内存中的布局。
虚拟地址空间(virtual address space)
Each process in a multi-tasking OS runs in its own memory sandbox. This sandbox is the virtual address space, which in 32-bit mode is always a 4GB block of memory addresses. These virtual addresses are mapped to physical memory by page tables, which are maintained by the operating system kernel and consulted by the processor. Each process has its own set of page tables, but there is a catch. Once virtual addresses are enabled, they apply to all software running in the machine, including the kernel itself. Thus a portion of the virtual address space must be reserved to the kernel:
多任务操作系统中的每个进程都在自己的内存沙盒中运行。这个沙盒是虚拟地址空间(virtual address space),在32位模式下,它总是4GB内存地址空间。这些虚拟地址通过页表(page table)映射到物理内存,这些表由操作系统内核维护,并被处理器引用。每个进程都有自己的一组页表,但会有一些限制。一旦使用虚拟地址,它们会被运行在机器上的所有软件使用,包括内核本身。因此,必须将虚拟地址空间的一部分保留给内核:
Kernel/User Memory SplitThis does not mean the kernel uses that much physical memory, only that it has that portion of address space available to map whatever physical memory it wishes. Kernel space is flagged in the page tables as exclusive to privileged code (ring 2 or lower), hence a page fault is triggered if user-mode programs try to touch it. In Linux, kernel space is constantly present and maps the same physical memory in all processes. Kernel code and data are always addressable, ready to handle interrupts or system calls at any time. By contrast, the mapping for the user-mode portion of the address space changes whenever a process switch happens:
这并不意味着内核使用了那么多的物理内存,只意味着,它有那些地址空间可以映射它想要的任何物理内存。在页表中,内核空间被标记为特权代码(环2或者更低)的独占空间(对这句话的不理解可以看《CPU Rings, Privilege, and Protection》),因此,如果用户模式的程序试图访问它,就会触发缺页异常(page fault)。在Linux中,内核空间是持续存在的,并且在所有进程中,都映射到相同的物理地址上。内核代码和数据总是可寻址的,随时可以处理中断或系统调用。与之相反,用户模式的进程地址空间映射,总是随着进程的切换而变化:
Process Switch Effects on Virtual MemoryBlue regions represent virtual addresses that are mapped to physical memory, whereas white regions are unmapped. In the example above, Firefox has used far more of its virtual address space due to its legendary memory hunger. The distinct bands in the address space correspond to memory segments like the heap, stack, and so on. Keep in mind these segments are simply a range of memory addresses and have nothing to do with Intel-style segments. Anyway, here is the standard segment layout in a Linux process:
蓝色区域表示已经映射到物理内存的虚拟地址,而白色区域表示未映射的部分。在上面的例子中,Firefox使用了相当多的虚拟地址空间。这些地址空间对应于heap、stack等内存段。请注意,这里的段只是表示一段内存地址,而与Intel手册中所说的段寄存器之类的段没有任何关联。下面是Linux进程中标准的段空间分布:(请注意,这里的图对应的是从内核2.6.7就引入的虚拟地址空间布局,它的mmap区域是自顶向下扩展的。经典布局与此相反。具体的原因和优缺点,请大家参考《Professional Linux Kernel Architecture》中4.3.2 process address space layout一节)
Flexible Process Address Space Layout In LinuxWhen computing was happy and safe and cuddly, the starting virtual addresses for the segments shown above were exactly the same for nearly every process in a machine. This made it easy to exploit security vulnerabilities remotely. An exploit often needs to reference absolute memory locations: an address on the stack, the address for a library function, etc. Remote attackers must choose this location blindly, counting on the fact that address spaces are all the same. When they are, people get pwned. Thus address space randomization has become popular. Linux randomizes the stack, memory mapping segment, and heap by adding offsets to their starting addresses. Unfortunately the 32-bit address space is pretty tight, leaving little room for randomization and hampering its effectiveness.
当计算机正常运行时,几乎每个进程的虚拟地址都与上图完全相同。这使得破解安全漏洞变得很容易,攻击通常需要使用物理地址:栈地址、库函数地址等。攻击者必须依赖地址空间分布的一致性,来探索出这些地址。如果让他们猜个正着,那么有人就会被整了。因此,地址空间的随机排布方式便逐渐流行起来。Linux通过对栈、内存映射段和堆的起始地址,添加偏移量进行随机化。不幸的是,32位地址空间非常有限,给这种随机化留下的空间并不大,削弱了该机制的效果。
栈(stack)
The topmost segment in the process address space is the stack, which stores local variables and function parameters in most programming languages. Calling a method or function pushes a new stack frame onto the stack. The stack frame is destroyed when the function returns. This simple design, possible because the data obeys strict LIFO order, means that no complex data structure is needed to track stack contents - a simple pointer to the top of the stack will do. Pushing and popping are thus very fast and deterministic. Also, the constant reuse of stack regions tends to keep active stack memory in the cpu caches, speeding up access. Each thread in a process gets its own stack.
进程地址空间中最顶部的段是栈,大多数编程语言,用它存储局部变量和函数参数。调用方法或函数,会将新的栈帧(stack frame)压栈。当函数返回时,栈帧被销毁。这是一个简单的设计,因为数据严格遵循LIFO(后进先出)的顺序,这意味着不需要复杂的数据结构来跟踪堆栈内容,只需要一个指针,指向栈顶就可以了。因此,压栈和弹栈是非常迅速和明确的。而且,一个经常被使用的栈的区域会被保存在cpu缓存中,从而加快访问速度。进程中每个线程都有自己的栈。
It is possible to exhaust the area mapping the stack by pushing more data than it can fit. This triggers a page fault that is handled in Linux by expand_stack(), which in turn calls acct_stack_growth() to check whether it's appropriate to grow the stack. If the stack size is below RLIMIT_STACK (usually 8MB), then normally the stack grows and the program continues merrily, unaware of what just happened. This is the normal mechanism whereby stack size adjusts to demand. However, if the maximum stack size has been reached, we have a stack overflow and the program receives a Segmentation Fault. While the mapped stack area expands to meet demand, it does not shrink back when the stack gets smaller. Like the federal budget, it only expands.
通过不断的压栈,超出其容量就会耗尽栈所对应的内存区域。这会触发缺页异常(page fault),该错误在Linux中由expand_stack()处理,它又会调用acct_stack_growth()来检查是否还有合适的地方用于栈的增长。如果栈的大小低于RLIMIT_STACK(通常是8MB),那么一般情况下栈会被加大,程序像什么都没有发生一样,继续正常的执行。这就是根据需要调整栈空间的机制。但是,如果已达到栈的最大空间,则会出现栈溢出stack overflow),引发段错误(segmentation fault)。尽管映射的栈区域可以随着需求而扩大,但当栈变小时则不会收缩。就像联邦预算一样,它只会扩大。
Dynamic stack growth is the only situation in which access to an unmapped memory region, shown in white above, might be valid. Any other access to unmapped memory triggers a page fault that results in a Segmentation Fault. Some mapped areas are read-only, hence write attempts to these areas also lead to segfaults.
动态的栈增长,是唯一一种访问未映射内存区域而被允许的情形,如上面白色所示。除此之外,任何对未映射内存区域的访问都会触发缺页异常,从而导致段错误。一些被映射的区域是只读的,对这些区域进行写入也会导致段错误。
内存映射段(memory mapping segment)
Below the stack, we have the memory mapping segment. Here the kernel maps contents of files directly to memory. Any application can ask for such a mapping via the Linux mmap() system call (implementation) or CreateFileMapping() / MapViewOfFile() in Windows. Memory mapping is a convenient and high-performance way to do file I/O, so it is used for loading dynamic libraries. It is also possible to create an anonymous memory mapping that does not correspond to any files, being used instead for program data. In Linux, if you request a large block of memory via malloc(), the C library will create such an anonymous mapping instead of using heap memory. 'Large' means larger than MMAP_THRESHOLD bytes, 128 kB by default and adjustable via mallopt().
在栈的下面是内存映射段。在这里,内核将文件的内容直接映射到内存。任何应用程序都可以通过Linux的mmap()系统调用或Windows中的CreateFileMapping()/MapViewOfFile()来进行这样的映射。内存映射是一种方便高效的文件I/O方式,所以它被用来加载动态库。还可以创建不对应任何文件的匿名内存映射,而是将其用于存放程序数据。在Linux中,如果使用malloc()申请一大块内存,C运行库将开辟一个这样的匿名映射,而不是开辟堆内存,这里的“大”意味着大于MMAP_THRESHOLD字节,默认为128 kb,可通过mallopt()调整。
堆(heap)
Speaking of the heap, it comes next in our plunge into address space. The heap provides runtime memory allocation, like the stack, meant for data that must outlive the function doing the allocation, unlike the stack. Most languages provide heap management to programs. Satisfying memory requests is thus a joint affair between the language runtime and the kernel. In C, the interface to heap allocation is malloc() and friends, whereas in a garbage-collected language like C# the interface is the new keyword.
堆位于内存映射段下面。与栈类似的是,堆也提供了运行时内存分配;不同的是,堆存储那些在执行分配的函数域之外,仍要保持存在的数据。大多数语言都提供堆的管理接口。在C语言中,分配堆的接口是malloc()以及类似的函数,而在有垃圾回收机制的语言中,例如C#,接口就是new关键字。
If there is enough space in the heap to satisfy a memory request, it can be handled by the language runtime without kernel involvement. Otherwise the heap is enlarged via the brk() system call (implementation) to make room for the requested block. Heap management is complex, requiring sophisticated algorithms that strive for speed and efficient memory usage in the face of our programs' chaotic allocation patterns. The time needed to service a heap request can vary substantially. Real-time systems have special-purpose allocators to deal with this problem. Heaps also become fragmented, shown below:
如果堆中有足够的空间来满足内存请求,它就可以被语言运行时库处理而不需要内核参与。否则,就通过brk()系统调用来扩大堆,来满足所需的空间。堆管理是复杂的,需要复杂的算法,在面对程序混乱的分配模式时,努力提高速度和有效的内存使用率。为堆请求所需的时间可能会有很大的不同。实时系统有专门的分配器来处理这个问题。堆也会变成碎片,如下图所示:
Fragmented HeapBSS, data, and program text
Finally, we get to the lowest segments of memory: BSS, data, and program text. Both BSS and data store contents for static (global) variables in C. The difference is that BSS stores the contents of uninitialized static variables, whose values are not set by the programmer in source code. The BSS memory area is anonymous: it does not map any file. If you say static int cntActiveUsers
, the contents of cntActiveUsers
live in the BSS.
最后,我们来看最下面的内存段:BSS、data 和 program text。在C语言中,BSS 和 data 用来存储静态的(全局的)变量。区别在于,BSS存储未初始化的静态变量,它们的值不是在代码中设置的。BSS内存区域是匿名的:它不映射任何文件。如果代码中 static int cntActiveUsers
,那么该变量 cntActiveUsers
的内容将存在BSS段中。
The data segment, on the other hand, holds the contents for static variables initialized in source code. This memory area is not anonymous. It maps the part of the program's binary image that contains the initial static values given in source code. So if you say static int cntWorkerBees = 10
, the contents of cntWorkerBees live in the data segment and start out as 10. Even though the data segment maps a file, it is a private memory mapping, which means that updates to memory are not reflected in the underlying file. This must be the case, otherwise assignments to global variables would change your on-disk binary image. Inconceivable!
另一方面,data段存储了代码中初始化的静态变量的内容。这个内存区域不是匿名的。它映射了一部分的程序二进制镜像,也就是代码中指定了初始值的静态变量。所以,如果写了 static int cntWorkerBees = 10
,那么cntWorkerBees的内容位于data段中,并且初始值是10。即使data段映射一个文件,它也是一个私有的内存映射,这意味着更改此处的内存不会影响被映射的文件。这是必须的,否则分配给全局变量的值,会改变磁盘上的二进制文件,这是不可取的!
The data example in the diagram is trickier because it uses a pointer. In that case, the contents of pointer gonzo - a 4-byte memory address - live in the data segment. The actual string it points to does not, however. The string lives in the text segment, which is read-only and stores all of your code in addition to tidbits like string literals. The text segment also maps your binary file in memory, but writes to this area earn your program a Segmentation Fault. This helps prevent pointer bugs, though not as effectively as avoiding C in the first place. Here's a diagram showing these segments and our example variables:
图中的数据示例更复杂,因为它使用指针。在这种情况下,指针gonzo(一个4字节的内存地址)的内容存在于data段中。但是,它所指向的字符串却不是这样。该字符串位于text段中,该段是只读的,除存储字符串字面量之外,还存储所有代码。text段也将二进制文件映射到内存中,但是如果尝试向该区域写入,程序会出现段错误。这有助于防止指针错误,尽管不如一开始就避免使用C语言有效。下面的图表显示了这些段和我们的示例变量:
ELF Binary Image Mapped Into MemoryYou can examine the memory areas in a Linux process by reading the file /proc/pid_of_process/maps
. Keep in mind that a segment may contain many areas. For example, each memory mapped file normally has its own area in the mmap segment, and dynamic libraries have extra areas similar to BSS and data. The next post will clarify what 'area' really means. Also, sometimes people say "data segment" meaning all of data + bss + heap.
你可以通过读取文件 /proc/pid_of_process/maps
来检查Linux进程中的内存区域。要记住,一个段可能包含许多区域。例如,每个内存映射文件,通常都会在mmap段有自己的区域,动态库也会在 BSS 和 data 段占用一些额外的区域。下一篇文章将阐明“区域”的真正含义。另外,有时人们也会把“data段”作为的data + bss + heap的统称。
其他
You can examine binary images using the nm and objdump commands to display symbols, their addresses, segments, and so on. Finally, the virtual address layout described above is the "flexible" layout in Linux, which has been the default for a few years. It assumes that we have a value for RLIMIT_STACK. When that's not the case, Linux reverts back to the "classic" layout shown below:
你可以使用nm和objdump命令,查看二进制文件里的符号、地址、段等信息。最后,上面提到的虚拟地址布局,在Linux中是有一定的灵活性的,这些年已经成为了默认的了。它假设我们已经设定好了RLIMIT_STACK。否则,Linux会恢复到下面所示的“经典”布局,如下图所示:(就像之前我所提到的,内核提供两种虚拟内存的布局,经典布局就是栈和内存映射段会相对增长;而最新的实现内存映射段却是向堆增长)
Classic Process Address Space Layout In LinuxThat's it for virtual address space layout. The next post discusses how the kernel keeps track of these memory areas. Coming up we'll look at memory mapping, how file reading and writing ties into all this and what memory usage figures mean.
这就是虚拟地址空间的布局。下一篇文章《内核如何管理内存》将讨论,内核如何跟踪这些内存区域。其中,我们将研究内存映射,文件读写如何与这些联系起来,以及内存使用率的含义。
另外,更加详细的总结可以看这篇文章:《Linux虚拟地址空间分布》。
网友评论