美文网首页
GoLang-内存管理

GoLang-内存管理

作者: 帘外五更风 | 来源:发表于2020-02-02 20:11 被阅读0次

    一、tcmalloc介绍<参考资源>

    go的内存管理和tcmalloc(thread-caching malloc)很像,先看一下tcmalloc的实现。

    1.1 简介

    tcmalloc是google推出的一种内存分配器,常见的内存分配器还有glibc的ptmalloc和google的jemalloc。相比于ptmalloc,tcmalloc性能更好,特别适用于高并发场景。

    1.2 tcmalloc算法策略

    tcmalloc分配的内存主要来自两个地方:全局缓存堆和进程的私有缓存。对于一些小容量的内存申请使用进程的私有缓存,私有缓存不足的时候可以再从全局缓存申请一部分作为私有缓存。对于大容量的内存申请则需要从全局缓存中进行申请。而大小容量的边界就是32k。缓存的组织方式是一个单链表数组,数组的每个元素是一个单链表,链表中的每个元素具有相同的大小。

    1.2.1 Small Object Allocation

    小对象内存分配默认会分配86个不同大小的块,而这些块的大小并没有明确说明,需要查一下源码。每种大小的块的数组的长度都采用使用了才初始化,有点类似于lazy-initialize。


    Small Object Allocation

    1.2.2 Big Object Allocation

    对于大于32k的内存申请,使用全局内存来分配。全局内存的组织也是单链表数组,数组长度为256,分别对用1 page大小, 2 page大小(1 page=4k)。


    Big Object Allocation

    1.2.3 Span

    tcmalloc使用span来管理内存分页,一个span可以包含几个连续分页。span的状态只有未分配、作为大对象分配、作为小对象分配。


    Span

    1.3 golang对比

    golang的内存分配并不是和tcmalloc一模一样。

    • 局部缓存并不是分配给进程或者线程,而是分配给P(这个还需要说一下go的goroutine实现)
    • go的GC是stop the world,并不是每个进程单独进行GC。
    • span的管理更有效率

    二、基本知识

    2.1 golang运行调度

    在 Golang 里面有三个基本的概念:G, M, P。

    • G(Goroutine):我们所说的协程,为用户级的轻量级线程,每个Goroutine对象中的sched保存着其上下文信息
    • M(Machine):对内核级线程的封装,数量对应真实的CPU数(真正干活的对象)
    • P(Processor):即为G和M的调度对象,用来调度G和M之间的关联关系,其数量可通过GOMAXPROCS()来设置,默认为核心数。
      一个 Goroutine 的运行需要 G + P + M 三部分结合起来。

    2.2 逃逸分析(escape analysis)

    逃逸分析请戳 GoLang-逃逸分析

    三、关键数据结构

    几个关键的地方:

    • mcache: per-P cache,可以认为是 local cache。
    • mcentral: 全局 cache,mcache 不够用的时候向 mcentral 申请。
    • mheap: 当 mcentral 也不够用的时候,通过 mheap 向操作系统申请。
      可以将其看成多级内存分配器。

    3.1 mcache

    我们知道每个 Gorontine 的运行都是绑定到一个 P 上面,mcache 是每个 P 的 cache。这么做的好处是分配内存时不需要加锁。mcache 结构如下。

    //go version 1.10.8
    //file runtime/mcache.go
    
    // Per-thread (in Go, per-P) cache for small objects.
    // No locking needed because it is per-thread (per-P).
    //
    // mcaches are allocated from non-GC'd memory, so any heap pointers
    // must be specially handled.
    //
    //go:notinheap
    type mcache struct {
        // The following members are accessed on every malloc,
        // so they are grouped here for better caching.
        next_sample int32   // trigger heap sample after allocating this many bytes
        local_scan  uintptr // bytes of scannable heap allocated
    
        // Allocator cache for tiny objects w/o pointers.
        // See "Tiny allocator" comment in malloc.go.
    
        // tiny points to the beginning of the current tiny block, or
        // nil if there is no current tiny block.
        //
        // tiny is a heap pointer. Since mcache is in non-GC'd memory,
        // we handle it by clearing it in releaseAll during mark
        // termination.
        tiny             uintptr
        tinyoffset       uintptr
        local_tinyallocs uintptr // number of tiny allocs not counted in other stats
    
        // The rest is not accessed on every malloc.
    
        alloc [numSpanClasses]*mspan // spans to allocate from, indexed by spanClass
    
        stackcache [_NumStackOrders]stackfreelist
    
        // Local allocator stats, flushed during GC.
        local_nlookup    uintptr                  // number of pointer lookups
        local_largefree  uintptr                  // bytes freed for large objects (>maxsmallsize)
        local_nlargefree uintptr                  // number of frees for large objects (>maxsmallsize)
        local_nsmallfree [_NumSizeClasses]uintptr // number of frees for small objects (<=maxsmallsize)
    }
    

    我们可以暂时只关注 alloc [_NumSizeClasses]*mspan,这是一个大小为 67 的指针(指针指向 mspan )数组(_NumSizeClasses = 67),每个数组元素用来包含特定大小的块。当要分配内存大小时,为 object 在 alloc 数组中选择合适的元素来分配。67 种块大小为 0,8 byte, 16 byte, … ,这个和 tcmalloc 稍有区别。

    //go version 1.10.8
    //file runtime/sizeclasses.go
    
    var class_to_size = [_NumSizeClasses]uint16{0, 8, 16, 32, 48, 64, 80, 96, 112, 128, 144, 160, 176, 192, 208, 224, 240, 256, 288, 320, 352, 384, 416, 448, 480, 512, 576, 640, 704, 768, 896, 1024, 1152, 1280, 1408, 1536, 1792, 2048, 2304, 2688, 3072, 3200, 3456, 4096, 4864, 5376, 6144, 6528, 6784, 6912, 8192, 9472, 9728, 10240, 10880, 12288, 13568, 14336, 16384, 18432, 19072, 20480, 21760, 24576, 27264, 28672, 32768}
    

    上面的 alloc 类似内存池的 freelist 数组或者链表,正常实现每个数组元素是一个链表,链表由特定大小的块串起来。但是这里统一使用了 mspan 结构,那么只有一种可能,就是 mspan 中记录了需要分配的块大小。我们来看一下 mspan 的结构。

    3.2 mspan

    span 在 tcmalloc 中作为一种管理内存的基本单位而存在。Golang 的 mspan 的结构如下。

    //go version 1.10.8
    //file runtime/mheap.go
    
    //go:notinheap
    type mspan struct {
        next *mspan     // next span in list, or nil if none
        prev *mspan     // previous span in list, or nil if none
        list *mSpanList // For debugging. TODO: Remove.
    
        startAddr uintptr // address of first byte of span aka s.base()
        npages    uintptr // number of pages in span
    
        manualFreeList gclinkptr // list of free objects in _MSpanManual spans
    
        // freeindex is the slot index between 0 and nelems at which to begin scanning
        // for the next free object in this span.
        // Each allocation scans allocBits starting at freeindex until it encounters a 0
        // indicating a free object. freeindex is then adjusted so that subsequent scans begin
        // just past the newly discovered free object.
        //
        // If freeindex == nelem, this span has no free objects.
        //
        // allocBits is a bitmap of objects in this span.
        // If n >= freeindex and allocBits[n/8] & (1<<(n%8)) is 0
        // then object n is free;
        // otherwise, object n is allocated. Bits starting at nelem are
        // undefined and should never be referenced.
        //
        // Object n starts at address n*elemsize + (start << pageShift).
        freeindex uintptr
        // TODO: Look up nelems from sizeclass and remove this field if it
        // helps performance.
        nelems uintptr // number of object in the span.
    
        // Cache of the allocBits at freeindex. allocCache is shifted
        // such that the lowest bit corresponds to the bit freeindex.
        // allocCache holds the complement of allocBits, thus allowing
        // ctz (count trailing zero) to use it directly.
        // allocCache may contain bits beyond s.nelems; the caller must ignore
        // these.
        allocCache uint64
    
        // allocBits and gcmarkBits hold pointers to a span's mark and
        // allocation bits. The pointers are 8 byte aligned.
        // There are three arenas where this data is held.
        // free: Dirty arenas that are no longer accessed
        //       and can be reused.
        // next: Holds information to be used in the next GC cycle.
        // current: Information being used during this GC cycle.
        // previous: Information being used during the last GC cycle.
        // A new GC cycle starts with the call to finishsweep_m.
        // finishsweep_m moves the previous arena to the free arena,
        // the current arena to the previous arena, and
        // the next arena to the current arena.
        // The next arena is populated as the spans request
        // memory to hold gcmarkBits for the next GC cycle as well
        // as allocBits for newly allocated spans.
        //
        // The pointer arithmetic is done "by hand" instead of using
        // arrays to avoid bounds checks along critical performance
        // paths.
        // The sweep will free the old allocBits and set allocBits to the
        // gcmarkBits. The gcmarkBits are replaced with a fresh zeroed
        // out memory.
        allocBits  *gcBits
        gcmarkBits *gcBits
    
        // sweep generation:
        // if sweepgen == h->sweepgen - 2, the span needs sweeping
        // if sweepgen == h->sweepgen - 1, the span is currently being swept
        // if sweepgen == h->sweepgen, the span is swept and ready to use
        // h->sweepgen is incremented by 2 after every GC
    
        sweepgen    uint32
        divMul      uint16     // for divide by elemsize - divMagic.mul
        baseMask    uint16     // if non-0, elemsize is a power of 2, & this will get object allocation base
        allocCount  uint16     // number of allocated objects
        spanclass   spanClass  // size class and noscan (uint8)
        incache     bool       // being used by an mcache
        state       mSpanState // mspaninuse etc
        needzero    uint8      // needs to be zeroed before allocation
        divShift    uint8      // for divide by elemsize - divMagic.shift
        divShift2   uint8      // for divide by elemsize - divMagic.shift2
        elemsize    uintptr    // computed from sizeclass or from npages
        unusedsince int64      // first time spotted by gc in mspanfree state
        npreleased  uintptr    // number of pages released to the os
        limit       uintptr    // end of data in span
        speciallock mutex      // guards specials list
        specials    *special   // linked list of special records sorted by offset.
    }
    

    从上面的结构可以看出:

    • next, prev: 指针域,因为 mspan 一般都是以链表形式使用。
    • npages: mspan 的大小为 page 大小的整数倍。
    • spanclass(size class and noscan (uint8)): 0 ~ _NumSizeClasses 之间的一个值,这个解释了我们的疑问。比如,spanclass = 3,那么这个 mspan 被分割成 32 byte 的块。
    • elemsize: 通过 sizeclass 或者 npages 可以计算出来。比如 sizeclass = 3, elemsize = 32 byte。对于大于 32Kb 的内存分配,都是分配整数页,elemsize = page_size * npages。
    • nelems: span 中包块的总数目。
    • freeindex: 0 ~ nelemes-1,表示分配到第几个块。

    3.3 mcentral

    上面说到当 mcache 不够用的时候,会从 mcentral 申请。那我们下面就来介绍一下 mcentral。

    //go version 1.10.8
    //file runtime/mcentral.go
    
    // Central list of free objects of a given size.
    //
    //go:notinheap
    type mcentral struct {
        lock      mutex
        spanclass spanClass
        nonempty  mSpanList // list of spans with a free object, ie a nonempty free list
        empty     mSpanList // list of spans with no free objects (or cached in an mcache)
    
        // nmalloc is the cumulative count of objects allocated from
        // this mcentral, assuming all spans in mcaches are
        // fully-allocated. Written atomically, read under STW.
        nmalloc uint64
    }
    
    // mSpanList heads a linked list of spans.
    //
    //go:notinheap
    type mSpanList struct {
        first *mspan // first span in list, or nil if none
        last  *mspan // last span in list, or nil if none
    }
    

    mcentral 分析:

    • spanclass: 也有成员 spanclass,那么 mcentral 是不是也有 67 个呢?是的。
    • lock: 因为会有多个 P 过来竞争。
    • nonempty: mspan 的双向链表,当前 mcentral 中可用的 mspan list。
    • empty: 已经被使用的,可以认为是一种对所有 mspan 的 track。

    问题来了,mcentral 存在于什么地方?虽然在上面我们将 mcentral 和 mheap 作为两个部分来讲,但是作为全局的结构,这两部分是可以定义在一起的。实际上也是这样,mcentral 包含在 mheap 中。

    3.4 mheap

    Golang 中的 mheap 结构定义如下。

    //go version 1.10.8
    //file runtime/mheap.go
    
    // Main malloc heap.
    // The heap itself is the "free[]" and "large" arrays,
    // but all the other global data is here too.
    //
    // mheap must not be heap-allocated because it contains mSpanLists,
    // which must not be heap-allocated.
    //
    //go:notinheap
    type mheap struct {
        lock      mutex
        free      [_MaxMHeapList]mSpanList // free lists of given length up to _MaxMHeapList
        freelarge mTreap                   // free treap of length >= _MaxMHeapList
        busy      [_MaxMHeapList]mSpanList // busy lists of large spans of given length
        busylarge mSpanList                // busy lists of large spans length >= _MaxMHeapList
        sweepgen  uint32                   // sweep generation, see comment in mspan
        sweepdone uint32                   // all spans are swept
        sweepers  uint32                   // number of active sweepone calls
    
        // allspans is a slice of all mspans ever created. Each mspan
        // appears exactly once.
        //
        // The memory for allspans is manually managed and can be
        // reallocated and move as the heap grows.
        //
        // In general, allspans is protected by mheap_.lock, which
        // prevents concurrent access as well as freeing the backing
        // store. Accesses during STW might not hold the lock, but
        // must ensure that allocation cannot happen around the
        // access (since that may free the backing store).
        allspans []*mspan // all spans out there
    
        // spans is a lookup table to map virtual address page IDs to *mspan.
        // For allocated spans, their pages map to the span itself.
        // For free spans, only the lowest and highest pages map to the span itself.
        // Internal pages map to an arbitrary span.
        // For pages that have never been allocated, spans entries are nil.
        //
        // Modifications are protected by mheap.lock. Reads can be
        // performed without locking, but ONLY from indexes that are
        // known to contain in-use or stack spans. This means there
        // must not be a safe-point between establishing that an
        // address is live and looking it up in the spans array.
        //
        // This is backed by a reserved region of the address space so
        // it can grow without moving. The memory up to len(spans) is
        // mapped. cap(spans) indicates the total reserved memory.
        spans []*mspan
    
        // sweepSpans contains two mspan stacks: one of swept in-use
        // spans, and one of unswept in-use spans. These two trade
        // roles on each GC cycle. Since the sweepgen increases by 2
        // on each cycle, this means the swept spans are in
        // sweepSpans[sweepgen/2%2] and the unswept spans are in
        // sweepSpans[1-sweepgen/2%2]. Sweeping pops spans from the
        // unswept stack and pushes spans that are still in-use on the
        // swept stack. Likewise, allocating an in-use span pushes it
        // on the swept stack.
        sweepSpans [2]gcSweepBuf
    
        _ uint32 // align uint64 fields on 32-bit for atomics
    
        // Proportional sweep
        //
        // These parameters represent a linear function from heap_live
        // to page sweep count. The proportional sweep system works to
        // stay in the black by keeping the current page sweep count
        // above this line at the current heap_live.
        //
        // The line has slope sweepPagesPerByte and passes through a
        // basis point at (sweepHeapLiveBasis, pagesSweptBasis). At
        // any given time, the system is at (memstats.heap_live,
        // pagesSwept) in this space.
        //
        // It's important that the line pass through a point we
        // control rather than simply starting at a (0,0) origin
        // because that lets us adjust sweep pacing at any time while
        // accounting for current progress. If we could only adjust
        // the slope, it would create a discontinuity in debt if any
        // progress has already been made.
        pagesInUse         uint64  // pages of spans in stats _MSpanInUse; R/W with mheap.lock
        pagesSwept         uint64  // pages swept this cycle; updated atomically
        pagesSweptBasis    uint64  // pagesSwept to use as the origin of the sweep ratio; updated atomically
        sweepHeapLiveBasis uint64  // value of heap_live to use as the origin of sweep ratio; written with lock, read without
        sweepPagesPerByte  float64 // proportional sweep ratio; written with lock, read without
        // TODO(austin): pagesInUse should be a uintptr, but the 386
        // compiler can't 8-byte align fields.
    
        // Malloc stats.
        largealloc  uint64                  // bytes allocated for large objects
        nlargealloc uint64                  // number of large object allocations
        largefree   uint64                  // bytes freed for large objects (>maxsmallsize)
        nlargefree  uint64                  // number of frees for large objects (>maxsmallsize)
        nsmallfree  [_NumSizeClasses]uint64 // number of frees for small objects (<=maxsmallsize)
    
        // range of addresses we might see in the heap
        bitmap        uintptr // Points to one byte past the end of the bitmap
        bitmap_mapped uintptr
    
        // The arena_* fields indicate the addresses of the Go heap.
        //
        // The maximum range of the Go heap is
        // [arena_start, arena_start+_MaxMem+1).
        //
        // The range of the current Go heap is
        // [arena_start, arena_used). Parts of this range may not be
        // mapped, but the metadata structures are always mapped for
        // the full range.
        arena_start uintptr
        arena_used  uintptr // Set with setArenaUsed.
    
        // The heap is grown using a linear allocator that allocates
        // from the block [arena_alloc, arena_end). arena_alloc is
        // often, but *not always* equal to arena_used.
        arena_alloc uintptr
        arena_end   uintptr
    
        // arena_reserved indicates that the memory [arena_alloc,
        // arena_end) is reserved (e.g., mapped PROT_NONE). If this is
        // false, we have to be careful not to clobber existing
        // mappings here. If this is true, then we own the mapping
        // here and *must* clobber it to use it.
        arena_reserved bool
    
        _ uint32 // ensure 64-bit alignment
    
        // central free lists for small size classes.
        // the padding makes sure that the MCentrals are
        // spaced CacheLineSize bytes apart, so that each MCentral.lock
        // gets its own cache line.
        // central is indexed by spanClass.
        central [numSpanClasses]struct {
            mcentral mcentral
            pad      [sys.CacheLineSize - unsafe.Sizeof(mcentral{})%sys.CacheLineSize]byte
        }
    
        spanalloc             fixalloc // allocator for span*
        cachealloc            fixalloc // allocator for mcache*
        treapalloc            fixalloc // allocator for treapNodes* used by large objects
        specialfinalizeralloc fixalloc // allocator for specialfinalizer*
        specialprofilealloc   fixalloc // allocator for specialprofile*
        speciallock           mutex    // lock for special record allocators.
    
        unused *specialfinalizer // never set, just here to force the specialfinalizer type into DWARF
    }
    
    var mheap_ mheap
    

    mheap_ 是一个全局变量,会在系统初始化的时候初始化(在函数 mallocinit() 中)。我们先看一下 mheap 具体结构。

    • allspans []*mspan: 所有的 spans 都是通过 mheap_ 申请,所有申请过的 mspan 都会记录在 allspans。结构体中的 lock 就是用来保证并发安全的。注释中有关于 STW 的说明,与Golang 的 GC 有关,这里不做详细说明。
    • central [_NumSizeClasses]…: 这个就是之前介绍的 mcentral ,每种大小的块对应一个 mcentral。mcentral 上面介绍过了。pad 可以认为是一个字节填充,为了避免伪共享(false sharing)问题的。False Sharing 可以参考 False Sharing - wikipedia,这里就不细说了。
    • sweepgen, sweepdone: GC 相关。(Golang 的 GC 策略是 Mark & Sweep, 这里是用来表示 sweep 的,这里就不再深入了。)
    • free [_MaxMHeapList]mSpanList: 这是一个 SpanList 数组,每个 SpanList 里面的 mspan 由 1 ~ 127 (_MaxMHeapList - 1) 个 page 组成。比如 free[3] 是由包含 3 个 page 的 mspan 组成的链表。free 表示的是 free list,也就是未分配的。对应的还有 busy list。
    • freelarge: mspan 组成的链表,每个元素(也就是 mspan)的 page 个数大于 127。对应的还有 busylarge。
    • spans []*mspan: 记录 arena 区域页号(page number)和 mspan 的映射关系。
    • arena_start, arena_end, arena_used: 要解释这几个变量之前要解释一下 arena。arena 是 Golang 中用于分配内存的连续虚拟地址区域。由 mheap 管理,堆上申请的所有内存都来自 arena。那么如何标志内存可用呢?操作系统的常见做法用两种:一种是用链表将所有的可用内存都串起来;另一种是使用位图来标志内存块是否可用。结合上面一条 spans,内存的布局是下面这样的。


      golang内存布局
    • spanalloc, cachealloc fixalloc: fixalloc 是 free-list,用来分配特定大小的块。
    • 剩下的是一些统计信息和 GC 相关的信息,这里暂且按住不表。

    四、初始化

    在系统初始化阶段,上面介绍的几个结构会被进行初始化,我们直接看一下初始化代码:mallocinit()。

    //go version 1.10.8
    //file runtime/malloc.go
    
    func mallocinit() {
        if class_to_size[_TinySizeClass] != _TinySize {
            throw("bad TinySizeClass")
        }
    
        testdefersizes()
    
        // Copy class sizes out for statistics table.
        for i := range class_to_size {
            memstats.by_size[i].size = uint32(class_to_size[i])
        }
    
        // Check physPageSize.
        if physPageSize == 0 {
            // The OS init code failed to fetch the physical page size.
            throw("failed to get system page size")
        }
        if physPageSize < minPhysPageSize {
            print("system page size (", physPageSize, ") is smaller than minimum page size (", minPhysPageSize, ")\n")
            throw("bad system page size")
        }
        if physPageSize&(physPageSize-1) != 0 {
            print("system page size (", physPageSize, ") must be a power of 2\n")
            throw("bad system page size")
        }
    
        // The auxiliary regions start at p and are laid out in the
        // following order: spans, bitmap, arena.
        var p, pSize uintptr
        var reserved bool
    
        // The spans array holds one *mspan per _PageSize of arena.
        var spansSize uintptr = (_MaxMem + 1) / _PageSize * sys.PtrSize
        spansSize = round(spansSize, _PageSize)
        // The bitmap holds 2 bits per word of arena.
        var bitmapSize uintptr = (_MaxMem + 1) / (sys.PtrSize * 8 / 2)
        bitmapSize = round(bitmapSize, _PageSize)
    
        // Set up the allocation arena, a contiguous area of memory where
        // allocated data will be found.
        if sys.PtrSize == 8 {
            // On a 64-bit machine, allocate from a single contiguous reservation.
            // 512 GB (MaxMem) should be big enough for now.
            //
            // The code will work with the reservation at any address, but ask
            // SysReserve to use 0x0000XXc000000000 if possible (XX=00...7f).
            // Allocating a 512 GB region takes away 39 bits, and the amd64
            // doesn't let us choose the top 17 bits, so that leaves the 9 bits
            // in the middle of 0x00c0 for us to choose. Choosing 0x00c0 means
            // that the valid memory addresses will begin 0x00c0, 0x00c1, ..., 0x00df.
            // In little-endian, that's c0 00, c1 00, ..., df 00. None of those are valid
            // UTF-8 sequences, and they are otherwise as far away from
            // ff (likely a common byte) as possible. If that fails, we try other 0xXXc0
            // addresses. An earlier attempt to use 0x11f8 caused out of memory errors
            // on OS X during thread allocations.  0x00c0 causes conflicts with
            // AddressSanitizer which reserves all memory up to 0x0100.
            // These choices are both for debuggability and to reduce the
            // odds of a conservative garbage collector (as is still used in gccgo)
            // not collecting memory because some non-pointer block of memory
            // had a bit pattern that matched a memory address.
            //
            // Actually we reserve 544 GB (because the bitmap ends up being 32 GB)
            // but it hardly matters: e0 00 is not valid UTF-8 either.
            //
            // If this fails we fall back to the 32 bit memory mechanism
            //
            // However, on arm64, we ignore all this advice above and slam the
            // allocation at 0x40 << 32 because when using 4k pages with 3-level
            // translation buffers, the user address space is limited to 39 bits
            // On darwin/arm64, the address space is even smaller.
            arenaSize := round(_MaxMem, _PageSize)
            pSize = bitmapSize + spansSize + arenaSize + _PageSize
            for i := 0; i <= 0x7f; i++ {
                switch {
                case GOARCH == "arm64" && GOOS == "darwin":
                    p = uintptr(i)<<40 | uintptrMask&(0x0013<<28)
                case GOARCH == "arm64":
                    p = uintptr(i)<<40 | uintptrMask&(0x0040<<32)
                default:
                    p = uintptr(i)<<40 | uintptrMask&(0x00c0<<32)
                }
                p = uintptr(sysReserve(unsafe.Pointer(p), pSize, &reserved))
                if p != 0 {
                    break
                }
            }
        }
    
        if p == 0 {
            // On a 32-bit machine, we can't typically get away
            // with a giant virtual address space reservation.
            // Instead we map the memory information bitmap
            // immediately after the data segment, large enough
            // to handle the entire 4GB address space (256 MB),
            // along with a reservation for an initial arena.
            // When that gets used up, we'll start asking the kernel
            // for any memory anywhere.
    
            // We want to start the arena low, but if we're linked
            // against C code, it's possible global constructors
            // have called malloc and adjusted the process' brk.
            // Query the brk so we can avoid trying to map the
            // arena over it (which will cause the kernel to put
            // the arena somewhere else, likely at a high
            // address).
            procBrk := sbrk0()
    
            // If we fail to allocate, try again with a smaller arena.
            // This is necessary on Android L where we share a process
            // with ART, which reserves virtual memory aggressively.
            // In the worst case, fall back to a 0-sized initial arena,
            // in the hope that subsequent reservations will succeed.
            arenaSizes := []uintptr{
                512 << 20,
                256 << 20,
                128 << 20,
                0,
            }
    
            for _, arenaSize := range arenaSizes {
                // SysReserve treats the address we ask for, end, as a hint,
                // not as an absolute requirement. If we ask for the end
                // of the data segment but the operating system requires
                // a little more space before we can start allocating, it will
                // give out a slightly higher pointer. Except QEMU, which
                // is buggy, as usual: it won't adjust the pointer upward.
                // So adjust it upward a little bit ourselves: 1/4 MB to get
                // away from the running binary image and then round up
                // to a MB boundary.
                p = round(firstmoduledata.end+(1<<18), 1<<20)
                pSize = bitmapSize + spansSize + arenaSize + _PageSize
                if p <= procBrk && procBrk < p+pSize {
                    // Move the start above the brk,
                    // leaving some room for future brk
                    // expansion.
                    p = round(procBrk+(1<<20), 1<<20)
                }
                p = uintptr(sysReserve(unsafe.Pointer(p), pSize, &reserved))
                if p != 0 {
                    break
                }
            }
            if p == 0 {
                throw("runtime: cannot reserve arena virtual address space")
            }
        }
    
        // PageSize can be larger than OS definition of page size,
        // so SysReserve can give us a PageSize-unaligned pointer.
        // To overcome this we ask for PageSize more and round up the pointer.
        p1 := round(p, _PageSize)
        pSize -= p1 - p
    
        spansStart := p1
        p1 += spansSize
        mheap_.bitmap = p1 + bitmapSize
        p1 += bitmapSize
        if sys.PtrSize == 4 {
            // Set arena_start such that we can accept memory
            // reservations located anywhere in the 4GB virtual space.
            mheap_.arena_start = 0
        } else {
            mheap_.arena_start = p1
        }
        mheap_.arena_end = p + pSize
        mheap_.arena_used = p1
        mheap_.arena_alloc = p1
        mheap_.arena_reserved = reserved
    
        if mheap_.arena_start&(_PageSize-1) != 0 {
            println("bad pagesize", hex(p), hex(p1), hex(spansSize), hex(bitmapSize), hex(_PageSize), "start", hex(mheap_.arena_start))
            throw("misrounded allocation in mallocinit")
        }
    
        // Initialize the rest of the allocator.
        mheap_.init(spansStart, spansSize)
        _g_ := getg()
        _g_.m.mcache = allocmcache()
    }
    

    4.1 arena 相关

    // The spans array holds one *mspan per _PageSize of arena.
    var spansSize uintptr = (_MaxMem + 1) / _PageSize * sys.PtrSize
    spansSize = round(spansSize, _PageSize)
    // The bitmap holds 2 bits per word of arena.
    var bitmapSize uintptr = (_MaxMem + 1) / (sys.PtrSize * 8 / 2)
    bitmapSize = round(bitmapSize, _PageSize)
    arenaSize := round(_MaxMem, _PageSize)
    
    // _MaxMem is the maximum heap arena size minus 1.
    //
    // On 32-bit, this is also the maximum heap pointer value,
    // since the arena starts at address 0.
    _MaxMem = 1<<_MHeapMap_TotalBits - 1        
    

    首先解释一下变量 _MaxMem ,里面还有一个变量就不再列出来了。简单来说 _MaxMem 就是系统为 arena 区分配的大小:64 位系统分配 512 G;对于 Windows 64 位系统,arena 区分配 32 G。round 是一个对齐操作,向上取 _PageSize 的倍数。实现也很有意思,代码如下。

    // round n up to a multiple of a.  a must be a power of 2.
    func round(n, a uintptr) uintptr {
        return (n + a - 1) &^ (a - 1)
    }
    

    bitmap 用两个 bit 表示一个字的可用状态,那么算下来 bitmap 的大小为 16 G。spans 记录的 arena 区的块页号和对应的 mspan 指针的对应关系。比如 arena 区内存地址 x,对应的页号就是 page_num = (x - arena_start) / page_size,那么 spans 就会记录 spans[page_num] = x。如果 arena 为 512 G的话,spans 区的大小为 512 G / 8K * 8 = 512 M。这里值得注意的是 Golang 的内存管理虚拟地址页大小为 8k。

    _PageSize = 1 << _PageShift
    _PageMask = _PageSize - 1
    

    所以这一段连续的的虚拟内存布局(64 位)如下:


    golang虚拟内存布局

    4.2 虚拟地址申请

    主要是下面这段代码。

    pSize = bitmapSize + spansSize + arenaSize + _PageSize
    for i := 0; i <= 0x7f; i++ {
        switch {
        case GOARCH == "arm64" && GOOS == "darwin":
            p = uintptr(i)<<40 | uintptrMask&(0x0013<<28)
        case GOARCH == "arm64":
            p = uintptr(i)<<40 | uintptrMask&(0x0040<<32)
        default:
            p = uintptr(i)<<40 | uintptrMask&(0x00c0<<32)
        }
        p = uintptr(sysReserve(unsafe.Pointer(p), pSize, &reserved))
        if p != 0 {
            break
        }
    }   
    

    初始化的时候,Golang 向操作系统申请一段连续的地址空间,就是上面的 spans + bitmap + arena。p 就是这段连续地址空间的开始地址,不同平台的 p 取值不一样。向 OS 申请的时候视不同的 OS 版本,调用不同的系统调用,比如 Unix 系统调用 mmap (mmap 向操作系统内核申请新的虚拟地址区间,可指定起始地址和长度),Windows 系统调用 VirtualAlloc (类似 mmap)。

    //来源于互联网
    
    //bsd
    func sysReserve(v unsafe.Pointer, n uintptr, reserved *bool) unsafe.Pointer {
        if sys.PtrSize == 8 && uint64(n) > 1<<32 || sys.GoosNacl != 0 {
            *reserved = false
            return v
        }
    
        p := mmap(v, n, _PROT_NONE, _MAP_ANON|_MAP_PRIVATE, -1, 0)
        if uintptr(p) < 4096 {
            return nil
        }
        *reserved = true
        return p
    }
    
    //darwin
    func sysReserve(v unsafe.Pointer, n uintptr, reserved *bool) unsafe.Pointer {
        *reserved = true
        p := mmap(v, n, _PROT_NONE, _MAP_ANON|_MAP_PRIVATE, -1, 0)
        if uintptr(p) < 4096 {
            return nil
        }
        return p
    }
    
    //linux
    func sysReserve(v unsafe.Pointer, n uintptr, reserved *bool) unsafe.Pointer {
        ...
        p := mmap(v, n, _PROT_NONE, _MAP_ANON|_MAP_PRIVATE, -1, 0)
        if uintptr(p) < 4096 {
            return nil
        }
        *reserved = true
        return p
    }
    //windows
    func sysReserve(v unsafe.Pointer, n uintptr, reserved *bool) unsafe.Pointer {
        *reserved = true
        // v is just a hint.
        // First try at v.
        v = unsafe.Pointer(stdcall4(_VirtualAlloc, uintptr(v), n, _MEM_RESERVE, _PAGE_READWRITE))
        if v != nil {
            return v
        }
    
        // Next let the kernel choose the address.
        return unsafe.Pointer(stdcall4(_VirtualAlloc, 0, n, _MEM_RESERVE, _PAGE_READWRITE))
    }
    

    4.3 mheap 初始化

    我们上面介绍 mheap 结构的时候知道 spans, bitmap, arena 都是存在于 mheap 中的,从操作系统申请完地址之后就是初始化 mheap 了。

        // PageSize can be larger than OS definition of page size,
        // so SysReserve can give us a PageSize-unaligned pointer.
        // To overcome this we ask for PageSize more and round up the pointer.
        p1 := round(p, _PageSize)
        pSize -= p1 - p
    
        spansStart := p1
        p1 += spansSize
        mheap_.bitmap = p1 + bitmapSize
        p1 += bitmapSize
        if sys.PtrSize == 4 {
            // Set arena_start such that we can accept memory
            // reservations located anywhere in the 4GB virtual space.
            mheap_.arena_start = 0
        } else {
            mheap_.arena_start = p1
        }
        mheap_.arena_end = p + pSize
        mheap_.arena_used = p1
        mheap_.arena_alloc = p1
        mheap_.arena_reserved = reserved
    
        if mheap_.arena_start&(_PageSize-1) != 0 {
            println("bad pagesize", hex(p), hex(p1), hex(spansSize), hex(bitmapSize), hex(_PageSize), "start", hex(mheap_.arena_start))
            throw("misrounded allocation in mallocinit")
        }
    
        // Initialize the rest of the allocator.
        mheap_.init(spansStart, spansSize)
        _g_ := getg()
        _g_.m.mcache = allocmcache()
    

    p 是从连续虚拟地址的起始地址,先进行对齐,然后初始化 arena,bitmap,spans 地址。mheap_.init()会初始化 fixalloc 等相关的成员,还有 mcentral 的初始化。

    // Initialize the heap.
    func (h *mheap) init(spansStart, spansBytes uintptr) {
        h.treapalloc.init(unsafe.Sizeof(treapNode{}), nil, nil, &memstats.other_sys)
        h.spanalloc.init(unsafe.Sizeof(mspan{}), recordspan, unsafe.Pointer(h), &memstats.mspan_sys)
        h.cachealloc.init(unsafe.Sizeof(mcache{}), nil, nil, &memstats.mcache_sys)
        h.specialfinalizeralloc.init(unsafe.Sizeof(specialfinalizer{}), nil, nil, &memstats.other_sys)
        h.specialprofilealloc.init(unsafe.Sizeof(specialprofile{}), nil, nil, &memstats.other_sys)
    
        // Don't zero mspan allocations. Background sweeping can
        // inspect a span concurrently with allocating it, so it's
        // important that the span's sweepgen survive across freeing
        // and re-allocating a span to prevent background sweeping
        // from improperly cas'ing it from 0.
        //
        // This is safe because mspan contains no heap pointers.
        h.spanalloc.zero = false
    
        // h->mapcache needs no init
        for i := range h.free {
            h.free[i].init()
            h.busy[i].init()
        }
    
        h.busylarge.init()
        for i := range h.central {
            h.central[i].mcentral.init(spanClass(i))
        }
    
        sp := (*slice)(unsafe.Pointer(&h.spans))
        sp.array = unsafe.Pointer(spansStart)
        sp.len = 0
        sp.cap = int(spansBytes / sys.PtrSize)
    
        // Map metadata structures. But don't map race detector memory
        // since we're not actually growing the arena here (and TSAN
        // gets mad if you map 0 bytes).
        h.setArenaUsed(h.arena_used, false)
    }
    

    mheap 初始化之后,对当前的线程也就是 M 进行初始化。

    //获取当前 G
    _g_ := getg()
    // 获取 G 上绑定的 M 的 mcache
    _g_.m.mcache = allocmcache()
    

    4.4 per-P mcache 初始化

    上面好像并没有说到针对 P 的 mcache 初始化,因为这个时候还没有初始化 P。我们看一下 bootstrap 的代码。

    func schedinit() {
        ...
        mallocinit()
        ...
        
        if procs > _MaxGomaxprocs {
            procs = _MaxGomaxprocs
        }
        if procresize(procs) != nil {
            ...
        }
    }
    

    其中 mallocinit() 上面说过了。对 P 的初始化在函数 procresize() 中执行,我们下面只看内存相关的部分。

    //go version 1.10.8
    //file runtime/proc.go
    
    func procresize(nprocs int32) *p {
        ...
        // initialize new P's
        for i := int32(0); i < nprocs; i++ {
            pp := allp[i]
            if pp == nil {
                pp = new(p)
                pp.id = i
                pp.status = _Pgcstop
                pp.sudogcache = pp.sudogbuf[:0]
                for i := range pp.deferpool {
                    pp.deferpool[i] = pp.deferpoolbuf[i][:0]
                }
                atomicstorep(unsafe.Pointer(&allp[i]), unsafe.Pointer(pp))
            }
            // P mcache 初始化
            if pp.mcache == nil {
                if old == 0 && i == 0 {
                    if getg().m.mcache == nil {
                        throw("missing mcache?")
                    }
                    // P[0] 分配给主 Goroutine
                    pp.mcache = getg().m.mcache // bootstrap
                } else {
                    // P[0] 之外的 P 申请 mcache
                    pp.mcache = allocmcache()
                }
            }
            ...
        }
        ...
    }
    

    所有的 P 都存放在一个全局数组 allp 中,procresize() 的目的就是将 allp 中用到的 P 进行初始化,同时对多余 P 的资源剥离。

    五、内存分配

    先说一下给对象 object 分配内存的主要流程:

    1. object size > 32K,则使用 mheap 直接分配。
    2. object size < 16 byte,使用 mcache 的小对象分配器 tiny 直接分配。 (其实 tiny 就是一个指针,暂且这么说吧。)
    3. object size > 16 byte && size <=32K byte 时,先使用 mcache 中对应的 size class 分配。
    4. 如果 mcache 对应的 size class 的 span 已经没有可用的块,则向 mcentral 请求。
    5. 如果 mcentral 也没有可用的块,则向 mheap 申请,并切分。
    6. 如果 mheap 也没有合适的 span,则向操作系统申请。
      我们看一下在堆上,也就是 arena 区分配内存的相关函数。
    package main
    
    func foo() *int {
        x := 1
        return &x
    }
    
    func main() {
        x := foo()
        println(*x)
    }
    

    根据之前介绍的逃逸分析,foo() 中的 x 会被分配到堆上。把上面代码保存为 main.go 看一下汇编代码。

    $ go build -gcflags '-l' -o main main.go
    $ go tool objdump -s "main\.foo" main
    TEXT main.foo(SB) /Users/xxx/src/heap/main.go
      main.go:3             0x104aad0               65488b0c25a0080000      MOVQ GS:0x8a0, CX                       
      main.go:3             0x104aad9               483b6110                CMPQ 0x10(CX), SP                       
      main.go:3             0x104aadd               7639                    JBE 0x104ab18                           
      main.go:3             0x104aadf               4883ec18                SUBQ $0x18, SP                          
      main.go:3             0x104aae3               48896c2410              MOVQ BP, 0x10(SP)                       
      main.go:3             0x104aae8               488d6c2410              LEAQ 0x10(SP), BP                       
      main.go:4             0x104aaed               488d052c990000          LEAQ type.*+38976(SB), AX               
      main.go:4             0x104aaf4               48890424                MOVQ AX, 0(SP)                          
      main.go:4             0x104aaf8               e82304fcff              CALL runtime.newobject(SB)              
      main.go:4             0x104aafd               488b442408              MOVQ 0x8(SP), AX                        
      main.go:4             0x104ab02               48c70001000000          MOVQ $0x1, 0(AX)                        
      main.go:5             0x104ab09               4889442420              MOVQ AX, 0x20(SP)                       
      main.go:5             0x104ab0e               488b6c2410              MOVQ 0x10(SP), BP                       
      main.go:5             0x104ab13               4883c418                ADDQ $0x18, SP                          
      main.go:5             0x104ab17               c3                      RET                                     
      main.go:3             0x104ab18               e80389ffff              CALL runtime.morestack_noctxt(SB)       
      main.go:3             0x104ab1d               ebb1                    JMP main.foo(SB)                        
      :-1                   0x104ab1f               cc                      INT $0x3   
    

    堆上内存分配调用了 runtime 包的 newobject 函数。

    //go version 1.10.8
    //file runtime/malloc.go
    
    // implementation of new builtin
    // compiler (both frontend and SSA backend) knows the signature
    // of this function
    func newobject(typ *_type) unsafe.Pointer {
        return mallocgc(typ.size, typ, true)
    }
    
    // Allocate an object of size bytes.
    // Small objects are allocated from the per-P cache's free lists.
    // Large objects (> 32 kB) are allocated straight from the heap.
    func mallocgc(size uintptr, typ *_type, needzero bool) unsafe.Pointer {
        if gcphase == _GCmarktermination {
            throw("mallocgc called with gcphase == _GCmarktermination")
        }
    
        if size == 0 {
            return unsafe.Pointer(&zerobase)
        }
    
        if debug.sbrk != 0 {
            align := uintptr(16)
            if typ != nil {
                align = uintptr(typ.align)
            }
            return persistentalloc(size, align, &memstats.other_sys)
        }
    
        // assistG is the G to charge for this allocation, or nil if
        // GC is not currently active.
        var assistG *g
        if gcBlackenEnabled != 0 {
            // Charge the current user G for this allocation.
            assistG = getg()
            if assistG.m.curg != nil {
                assistG = assistG.m.curg
            }
            // Charge the allocation against the G. We'll account
            // for internal fragmentation at the end of mallocgc.
            assistG.gcAssistBytes -= int64(size)
    
            if assistG.gcAssistBytes < 0 {
                // This G is in debt. Assist the GC to correct
                // this before allocating. This must happen
                // before disabling preemption.
                gcAssistAlloc(assistG)
            }
        }
    
        // Set mp.mallocing to keep from being preempted by GC.
        mp := acquirem()
        if mp.mallocing != 0 {
            throw("malloc deadlock")
        }
        if mp.gsignal == getg() {
            throw("malloc during signal")
        }
        mp.mallocing = 1
    
        shouldhelpgc := false
        dataSize := size
        c := gomcache()
        var x unsafe.Pointer
        noscan := typ == nil || typ.kind&kindNoPointers != 0
        if size <= maxSmallSize {
            // object size <= 32K
            if noscan && size < maxTinySize {
            // 小于 16 byte 的小对象分配
                // Tiny allocator.
                //
                // Tiny allocator combines several tiny allocation requests
                // into a single memory block. The resulting memory block
                // is freed when all subobjects are unreachable. The subobjects
                // must be noscan (don't have pointers), this ensures that
                // the amount of potentially wasted memory is bounded.
                //
                // Size of the memory block used for combining (maxTinySize) is tunable.
                // Current setting is 16 bytes, which relates to 2x worst case memory
                // wastage (when all but one subobjects are unreachable).
                // 8 bytes would result in no wastage at all, but provides less
                // opportunities for combining.
                // 32 bytes provides more opportunities for combining,
                // but can lead to 4x worst case wastage.
                // The best case winning is 8x regardless of block size.
                //
                // Objects obtained from tiny allocator must not be freed explicitly.
                // So when an object will be freed explicitly, we ensure that
                // its size >= maxTinySize.
                //
                // SetFinalizer has a special case for objects potentially coming
                // from tiny allocator, it such case it allows to set finalizers
                // for an inner byte of a memory block.
                //
                // The main targets of tiny allocator are small strings and
                // standalone escaping variables. On a json benchmark
                // the allocator reduces number of allocations by ~12% and
                // reduces heap size by ~20%.
                off := c.tinyoffset
                // Align tiny pointer for required (conservative) alignment.
                if size&7 == 0 {
                    off = round(off, 8)
                } else if size&3 == 0 {
                    off = round(off, 4)
                } else if size&1 == 0 {
                    off = round(off, 2)
                }
                if off+size <= maxTinySize && c.tiny != 0 {
                    // The object fits into existing tiny block.
                    x = unsafe.Pointer(c.tiny + off)
                    c.tinyoffset = off + size
                    c.local_tinyallocs++
                    mp.mallocing = 0
                    releasem(mp)
                    return x
                }
                // Allocate a new maxTinySize block.
                span := c.alloc[tinySpanClass]
                v := nextFreeFast(span)
                if v == 0 {
                    v, _, shouldhelpgc = c.nextFree(tinySpanClass)
                }
                x = unsafe.Pointer(v)
                (*[2]uint64)(x)[0] = 0
                (*[2]uint64)(x)[1] = 0
                // See if we need to replace the existing tiny block with the new one
                // based on amount of remaining free space.
                if size < c.tinyoffset || c.tiny == 0 {
                    c.tiny = uintptr(x)
                    c.tinyoffset = size
                }
                size = maxTinySize
            } else {
                var sizeclass uint8
                if size <= smallSizeMax-8 {
                    sizeclass = size_to_class8[(size+smallSizeDiv-1)/smallSizeDiv]
                } else {
                    sizeclass = size_to_class128[(size-smallSizeMax+largeSizeDiv-1)/largeSizeDiv]
                }
                size = uintptr(class_to_size[sizeclass])
                spc := makeSpanClass(sizeclass, noscan)
                span := c.alloc[spc]
                v := nextFreeFast(span)
                if v == 0 {
                    v, span, shouldhelpgc = c.nextFree(spc)
                }
                x = unsafe.Pointer(v)
                if needzero && span.needzero != 0 {
                    memclrNoHeapPointers(unsafe.Pointer(v), size)
                }
            }
        } else {
            //object size > 32K byte
            var s *mspan
            shouldhelpgc = true
            systemstack(func() {
                s = largeAlloc(size, needzero, noscan)
            })
            s.freeindex = 1
            s.allocCount = 1
            x = unsafe.Pointer(s.base())
            size = s.elemsize
        }
    
        var scanSize uintptr
        if !noscan {
            // If allocating a defer+arg block, now that we've picked a malloc size
            // large enough to hold everything, cut the "asked for" size down to
            // just the defer header, so that the GC bitmap will record the arg block
            // as containing nothing at all (as if it were unused space at the end of
            // a malloc block caused by size rounding).
            // The defer arg areas are scanned as part of scanstack.
            if typ == deferType {
                dataSize = unsafe.Sizeof(_defer{})
            }
            heapBitsSetType(uintptr(x), size, dataSize, typ)
            if dataSize > typ.size {
                // Array allocation. If there are any
                // pointers, GC has to scan to the last
                // element.
                if typ.ptrdata != 0 {
                    scanSize = dataSize - typ.size + typ.ptrdata
                }
            } else {
                scanSize = typ.ptrdata
            }
            c.local_scan += scanSize
        }
    
        // Ensure that the stores above that initialize x to
        // type-safe memory and set the heap bits occur before
        // the caller can make x observable to the garbage
        // collector. Otherwise, on weakly ordered machines,
        // the garbage collector could follow a pointer to x,
        // but see uninitialized memory or stale heap bits.
        publicationBarrier()
    
        // Allocate black during GC.
        // All slots hold nil so no scanning is needed.
        // This may be racing with GC so do it atomically if there can be
        // a race marking the bit.
        if gcphase != _GCoff {
            gcmarknewobject(uintptr(x), size, scanSize)
        }
    
        if raceenabled {
            racemalloc(x, size)
        }
    
        if msanenabled {
            msanmalloc(x, size)
        }
    
        mp.mallocing = 0
        releasem(mp)
    
        if debug.allocfreetrace != 0 {
            tracealloc(x, size, typ)
        }
    
        if rate := MemProfileRate; rate > 0 {
            if size < uintptr(rate) && int32(size) < c.next_sample {
                c.next_sample -= int32(size)
            } else {
                mp := acquirem()
                profilealloc(mp, x, size)
                releasem(mp)
            }
        }
    
        if assistG != nil {
            // Account for internal fragmentation in the assist
            // debt now that we know it.
            assistG.gcAssistBytes -= int64(size - dataSize)
        }
    
        if shouldhelpgc {
            if t := (gcTrigger{kind: gcTriggerHeap}); t.test() {
                gcStart(gcBackgroundMode, t)
            }
        }
    
        return x
    }
    

    整个分配过程可以根据 object size 拆解成三部分:size < 16 byte, 16 byte <= size <= 32 K byte, size > 32 K byte。

    5.1 size 小于 16 byte

    对于小于 16 byte 的内存块,mcache 有个专门的内存区域 tiny 用来分配,tiny 是指针,指向开始地址。

    func mallocgc(...) {
        ...
                off := c.tinyoffset
                // 地址对齐
                if size&7 == 0 {
                    off = round(off, 8)
                } else if size&3 == 0 {
                    off = round(off, 4)
                } else if size&1 == 0 {
                    off = round(off, 2)
                }
                // 分配
                if off+size <= maxTinySize && c.tiny != 0 {
                    // The object fits into existing tiny block.
                    x = unsafe.Pointer(c.tiny + off)
                    c.tinyoffset = off + size
                    c.local_tinyallocs++
                    mp.mallocing = 0
                    releasem(mp)
                    return x
                }
                // tiny 不够了,为其重新分配一个 16 byte 内存块
                span := c.alloc[tinySizeClass]
                v := nextFreeFast(span)
                if v == 0 {
                    v, _, shouldhelpgc = c.nextFree(tinySizeClass)
                }
                x = unsafe.Pointer(v)
                //将申请的内存块全置为 0
                (*[2]uint64)(x)[0] = 0
                (*[2]uint64)(x)[1] = 0
                // See if we need to replace the existing tiny block with the new one
                // based on amount of remaining free space.
                // 如果申请的内存块用不完,则将剩下的给 tiny,用 tinyoffset 记录分配了多少。
                if size < c.tinyoffset || c.tiny == 0 {
                    c.tiny = uintptr(x)
                    c.tinyoffset = size
                }
                size = maxTinySize
    }
    

    如上所示,tinyoffset 表示 tiny 当前分配到什么地址了,之后的分配根据 tinyoffset 寻址。先根据要分配的对象大小进行地址对齐,比如 size 是 8 的倍数,tinyoffset 和 8 对齐。然后就是进行分配。如果 tiny 剩余的空间不够用,则重新申请一个 16 byte 的内存块,并分配给 object。如果有结余,则记录在 tiny 上。

    5.2 size 大于 32 K byte

    对于大于 32 Kb 的内存分配,直接跳过 mcache 和 mcentral,通过 mheap 分配。

    func mallocgc(...) {
        } else {
            var s *mspan
            shouldhelpgc = true
            systemstack(func() {
                s = largeAlloc(size, needzero)
            })
            s.freeindex = 1
            s.allocCount = 1
            x = unsafe.Pointer(s.base())
            size = s.elemsize
        }
        ...
    }
    
    func largeAlloc(size uintptr, needzero bool) *mspan {
        ...
        npages := size >> _PageShift
        if size&_PageMask != 0 {
            npages++
        }
        ...
        s := mheap_.alloc(npages, 0, true, needzero)
        if s == nil {
            throw("out of memory")
        }
        s.limit = s.base() + size
        heapBitsForSpan(s.base()).initSpan(s)
        return s
    }
    

    对于大于 32 K 的内存分配都是分配整数页,先右移然后低位与计算需要的页数。

    5.3 size 介于 16 和 32K

    对于 size 介于 16 ~ 32K byte 的内存分配先计算应该分配的 sizeclass,然后去 mcache 里面 alloc[sizeclass] 申请,如果 mcache.alloc[sizeclass] 不足以申请,则 mcache 向 mcentral 申请,然后再分配。mcentral 给 mcache 分配完之后会判断自己需不需要扩充,如果需要则想 mheap 申请。

    func mallocgc(...) {
            ...
            } else {
                var sizeclass uint8
                //计算 sizeclass
                if size <= smallSizeMax-8 {
                    sizeclass = size_to_class8[(size+smallSizeDiv-1)/smallSizeDiv]
                } else {
                    sizeclass = size_to_class128[(size-smallSizeMax+largeSizeDiv-1)/largeSizeDiv]
                }
                size = uintptr(class_to_size[sizeclass])
                spc := makeSpanClass(sizeclass, noscan)
                span := c.alloc[spc]
                //从对应的 span 里面分配一个 object 
                v := nextFreeFast(span)
                if v == 0 {
                    v, span, shouldhelpgc = c.nextFree(spc)
                }
                x = unsafe.Pointer(v)
                if needzero && span.needzero != 0 {
                    memclrNoHeapPointers(unsafe.Pointer(v), size)
                }
              }
    }
    

    我们首先看一下如何计算 sizeclass 的,预先定义了两个数组:size_to_class8 和 size_to_class128。 数组 size_to_class8,其第 i 个值表示地址区间 ( (i-1)8, i8 ] (smallSizeDiv = 8) 对应的 sizeclass,size_to_class128 类似。小于 1024 - 8 = 1016 (smallSizeMax=1024),使用 size_to_class8,否则使用数组 size_to_class128。看一下数组具体的值:0, 1, 2, 3, 3, 4, 4…。举个例子,比如要分配 17 byte 的内存 (16 byte 以下的使用 mcache.tiny 分配),sizeclass = size_to_calss8[(17+7)/8] = size_to_class8[3] = 3。不得不说这种用空间换时间的策略确实提高了运行效率。
    计算出 sizeclass,那么就可以去 mcache.alloc[sizeclass] 分配了,注意这是一个 mspan 指针,真正的分配函数是 nextFreeFast() 函数,如下。

    // nextFreeFast returns the next free object if one is quickly available.
    // Otherwise it returns 0.
    func nextFreeFast(s *mspan) gclinkptr {
        theBit := sys.Ctz64(s.allocCache) // Is there a free object in the allocCache?
        if theBit < 64 {
            result := s.freeindex + uintptr(theBit)
            if result < s.nelems {
                freeidx := result + 1
                if freeidx%64 == 0 && freeidx != s.nelems {
                    return 0
                }
                s.allocCache >>= uint(theBit + 1)
                s.freeindex = freeidx
                s.allocCount++
                return gclinkptr(result*s.elemsize + s.base())
            }
        }
        return 0
    }
    

    allocCache 这里是用位图表示内存是否可用,1 表示可用。然后通过 span 里面的 freeindex 和 elemsize 来计算地址即可。
    如果 mcache.alloc[sizeclass] 已经不够用了,则从 mcentral 申请内存到 mcache。

    // nextFree returns the next free object from the cached span if one is available.
    // Otherwise it refills the cache with a span with an available object and
    // returns that object along with a flag indicating that this was a heavy
    // weight allocation. If it is a heavy weight allocation the caller must
    // determine whether a new GC cycle needs to be started or if the GC is active
    // whether this goroutine needs to assist the GC.
    func (c *mcache) nextFree(sizeclass uint8) (v gclinkptr, s *mspan, shouldhelpgc bool) {
        s = c.alloc[sizeclass]
        shouldhelpgc = false
        freeIndex := s.nextFreeIndex()
        if freeIndex == s.nelems {
            // The span is full.
            if uintptr(s.allocCount) != s.nelems {
                println("runtime: s.allocCount=", s.allocCount, "s.nelems=", s.nelems)
                throw("s.allocCount != s.nelems && freeIndex == s.nelems")
            }
            systemstack(func() {
                // 这个地方 mcache 向 mcentral 申请
                c.refill(int32(sizeclass))
            })
            shouldhelpgc = true
            s = c.alloc[sizeclass]
            // mcache 向 mcentral 申请完之后,再次从 mcache 申请
            freeIndex = s.nextFreeIndex()
        }
    
        ...
    }
    
    // nextFreeIndex returns the index of the next free object in s at
    // or after s.freeindex.
    // There are hardware instructions that can be used to make this
    // faster if profiling warrants it.
    // 这个函数和 nextFreeFast 有点冗余了
    func (s *mspan) nextFreeIndex() uintptr {
        ...
    }
    

    mcache 向 mcentral,如果 mcentral 不够,则向 mheap 申请。

    //go version 1.10.8
    
    func (c *mcache) refill(sizeclass int32) *mspan {
        ...
        // 向 mcentral 申请
        s = mheap_.central[sizeclass].mcentral.cacheSpan()
        ...
        return s
    }
    
    // Allocate a span to use in an MCache.
    func (c *mcentral) cacheSpan() *mspan {
        ...
        // Replenish central list if empty.
        s = c.grow()
    }
    
    func (c *mcentral) grow() *mspan {
        npages := uintptr(class_to_allocnpages[c.sizeclass])
        size := uintptr(class_to_size[c.sizeclass])
        n := (npages << _PageShift) / size
        
        //这里向 mheap 申请
        s := mheap_.alloc(npages, c.sizeclass, false, true)
        ...
        return s
    }
    

    如果 mheap 不足,则向 OS 申请。接上面的代码 mheap_.alloc()

    func (h *mheap) alloc(npage uintptr, sizeclass int32, large bool, needzero bool) *mspan {
        ...
        var s *mspan
        systemstack(func() {
            s = h.alloc_m(npage, sizeclass, large)
        })
        ...
    }
    
    func (h *mheap) alloc_m(npage uintptr, sizeclass int32, large bool) *mspan {
        ... 
        s := h.allocSpanLocked(npage)
        ...
    }
    
    func (h *mheap) allocSpanLocked(npage uintptr) *mspan {
        ...
        s = h.allocLarge(npage)
        if s == nil {
            if !h.grow(npage) {
                return nil
            }
            s = h.allocLarge(npage)
            if s == nil {
                return nil
            }
        }
        ...
    }
    
    func (h *mheap) grow(npage uintptr) bool {
        // Ask for a big chunk, to reduce the number of mappings
        // the operating system needs to track; also amortizes
        // the overhead of an operating system mapping.
        // Allocate a multiple of 64kB.
        npage = round(npage, (64<<10)/_PageSize)
        ask := npage << _PageShift
        if ask < _HeapAllocChunk {
            ask = _HeapAllocChunk
        }
    
        v := h.sysAlloc(ask)
        ...
    }
    

    整个函数调用链如上所示,最后 sysAlloc 会调用系统调用(mmap 或者 VirtualAlloc,和初始化那部分有点类似)去向操作系统申请。

    六、内存回收

    这里只会介绍一些简单的内存回收。

    6.1 mcache 回收

    mcache 回收可以分两部分:第一部分是将 alloc 中未用完的内存归还给对应的 mcentral。

    func freemcache(c *mcache) {
        systemstack(func() {
            c.releaseAll()
            ...
    
            lock(&mheap_.lock)
            purgecachedstats(c)
            mheap_.cachealloc.free(unsafe.Pointer(c))
            unlock(&mheap_.lock)
        })
    }
    
    func (c *mcache) releaseAll() {
        for i := 0; i < _NumSizeClasses; i++ {
            s := c.alloc[i]
            if s != &emptymspan {
                mheap_.central[i].mcentral.uncacheSpan(s)
                c.alloc[i] = &emptymspan
            }
        }
        // Clear tinyalloc pool.
        c.tiny = 0
        c.tinyoffset = 0
    }
    

    6.2 mcentral 回收

    当 mspan 没有 free object 的时候,将 mspan 归还给 mheap。

    func (c *mcentral) freeSpan(s *mspan, preserve bool, wasempty bool) bool {
        ...
        lock(&c.lock)
        ...
        if s.allocCount != 0 {
            unlock(&c.lock)
            return false
        }
    
        c.nonempty.remove(s)
        unlock(&c.lock)
        mheap_.freeSpan(s, 0)
        return true
    }
    

    6.3 mheap

    mheap 并不会定时向操作系统归还,但是会对 span 做一些操作,比如合并相邻的 span。

    七、总结

    目前GoLang内存管理还处于学习阶段,上文多是整理学习了一些目前互联网的信息,参考资源都已列出,有些地方还没有理解,如有问题,请多交流。

    相关文章

      网友评论

          本文标题:GoLang-内存管理

          本文链接:https://www.haomeiwen.com/subject/yokbthtx.html