HashMap源码分析（主要基于JDK1.8）

作者: 言淺 | 来源:发表于2019-12-09 20:22 被阅读0次

Java集合框架分析(四)HashMap分析
Java集合——HashMap
HashMap源码分析（主要基于JDK1.8）
HashMap,SparseArray比较
Java 集合框架_HashMap源码解析
二、HashMap、HashSet和Hashtable
三、TreeMap和TreeSet
JDK1.8 HashMap 源码分析之resize函数（一）
HashMap源码分析
【Java集合】源码分析之HashTable

什么是HashMap

HashMap是一个用于存储Key-Value键值对的集合，也是Java中使用频率最高的用于映射(键值对)处理的数据类型。

众所周知，HashMap访问速度快，遍历顺序不确定，线程不安全。这些特点都是由底层实现决定的，将在本文中进行分析。

在JDK1.8中，对HashMap的底层实现进行了优化，接下来本文将结合JDK1.7及JDK1.8的区别，对HashMap的底层实现原理做简单的分析（主要基于JDK1.8）。

JDK1.7、JDK1.8中，HashMap的区别简述

JDK1.7
1、hash计算，4次扰动
2、实现方式：链表 + 数组
3、链表：头插法（并发扩容时容易出现环形链表）
4、扩容时重新计算桶下标

JDK1.8
1、hash计算，1次扰动
2、实现方式：链表 + 数组 / 红黑树
3、链表：尾插法
4、扩容时只计算hash值新增的bit

存储结构

都知道JDK1.7中，HashMap的实现方式是：数组 + 链表，在JDK1.8中，引入了红黑树做优化。那么，数组 + 链表 / 红黑树，这个存储结构，是怎么实现的呢？让我们把目光放到源码中，

/* ---------------- Fields -------------- */

    /**
     * The table, initialized on first use, and resized as
     * necessary. When allocated, length is always a power of two.
     * (We also tolerate length zero in some operations to allow
     * bootstrapping mechanics that are currently not needed.)
     */
    transient Node<K,V>[] table;

字段定义的一开始，就是一个Node<K,V>类型的table数组。那么再看看Node<K,V>的定义：

    /**
     * Basic hash bin node, used for most entries.  (See below for
     * TreeNode subclass, and in LinkedHashMap for its Entry subclass.)
     */
    static class Node<K,V> implements Map.Entry<K,V> {
        final int hash;
        final K key;
        V value;
        Node<K,V> next;

        ......

    }

Node<K,V>是HashMap的一个内部类，它实现了Map.Entry接口，本质上就是一个映射（键值对）。Node<K,V>中，除了key和value字段之外，还定义了hash和next字段。next字段也是Node<K,V>类型的，这意味着，Node<K,V>这个类有能力组成链表。事实也正是如此，下图中，我画出了HashMap的实现图示，包含了链表和红黑树，红黑树的定义TreeNode<K,V>同样也是HashMap中的一个内部类。

存储结构图示

左侧的是table数组，每一个圆点都代表一个Node / TreeNode。

有个小的细节，在这里提及一下。当链表达到树化阀值（TREEIFY_THRESHOLD = 8）的时候，并不会直接树化这条链表，而是先判断table数组有没有达到最小的树化容量（MIN_TREEIFY_CAPACITY = 64），如果没达到的话，此时HashMap会选择扩容，而非树化。因此，实际使用过程中，要出现图示的结构，table数组的长度至少要有64。

HashMap初始化

简单的了解下初始化方法。在实际使用过程当中，最常用的就是不带参数的默认构造函数，和指定容量的构造函数。例：

        Map<String, String> map1 = new HashMap<>();

        Map<String, String> map2 = new HashMap<>(20);

        Map<String, String> map3 = new HashMap<>(65);

不带参数的默认构造函数，实际上只做了一件事情，就是设置负载因子loadFactor的值。在此不赘述。

    /**
     * Constructs an empty <tt>HashMap</tt> with the default initial capacity
     * (16) and the default load factor (0.75).
     */
    public HashMap() {
        this.loadFactor = DEFAULT_LOAD_FACTOR; // all other fields defaulted
    }

带指定容量的构造函数，通过代码，可以看到实际上是调用了这个方法：

/* ---------------- Public operations -------------- */

    /**
     * Constructs an empty <tt>HashMap</tt> with the specified initial
     * capacity and load factor.
     *
     * @param  initialCapacity the initial capacity
     * @param  loadFactor      the load factor
     * @throws IllegalArgumentException if the initial capacity is negative
     *         or the load factor is nonpositive
     */
    public HashMap(int initialCapacity, float loadFactor) {
        if (initialCapacity < 0)
            throw new IllegalArgumentException("Illegal initial capacity: " +
                                               initialCapacity);
        if (initialCapacity > MAXIMUM_CAPACITY)
            initialCapacity = MAXIMUM_CAPACITY;
        if (loadFactor <= 0 || Float.isNaN(loadFactor))
            throw new IllegalArgumentException("Illegal load factor: " +
                                               loadFactor);
        this.loadFactor = loadFactor;
        this.threshold = tableSizeFor(initialCapacity);
    }

可以看出，除了设置负载因子loadFactor的值外，还调用了tableSizeFor方法，将其返回值赋给threshold字段。

tableSizeFor这个方法，可以重点了解下。方法入参是容量的期望值，返回则是大于等于期望值的，距离这个值最近的2的次方值。

     /**
     * Returns a power of two size for the given target capacity.
     */
    static final int tableSizeFor(int cap) {
        int n = cap - 1;
        n |= n >>> 1;
        n |= n >>> 2;
        n |= n >>> 4;
        n |= n >>> 8;
        n |= n >>> 16;
        return (n < 0) ? 1 : (n >= MAXIMUM_CAPACITY) ? MAXIMUM_CAPACITY : n + 1;
    }

通过不断的与自身无符号右移的结果做按位或，得到一个低位全部都是1的数据。然后再加1，就可以得到2的次方值。

以期望容量65为例，经过一系列无符号右移及按位或操作后，得到了实际容量128。
（这个值是容量，也是table的长度。初始化的时候暂时把这个值赋值给了threshold，一般threshold的计算方式其实是capacity * load factor，由此可以得知，当table == null时，threshold的值可能为0，也可能为当前实际容量）

tableSizeFor方法计算图示

按位或( | )：按位运算，有一个是1，结果为1，否则为0；
无符号右移( >>> )：向右移动N位，高位用0补齐；

HashMap的put方法解析

要再深入一点了解HashMap的原理，那么可以先从put方法入手，put方法大致可以分为三步：

· 检查是否需要resize()
· 确定哈希桶数组索引位置
· 存放键值至对应链表/红黑树中

先来一个整体流程图：

put方法流程图

可以看到，流程的第一步是判断是否需要扩容，这一步的目的很明确。就是保证后边的存放步骤可以存放到一个正确的数组当中。像刚才我们提到的初始化方法，就肯定要进行扩容，因为构造函数中，并没有对table字段进行赋值，这样显然是无法使用的。这个时候扩容，就会将table默认初始化成一个长度为16的Node数组，当然，如果填了初始化容量，那么table就会被初始化成长度为tableSizeFor返回值的Node数组。

计算哈希桶数组索引

此时我们已经拥有了一个长度合适的table数组，那么接下来将要计算Key在哈希桶数组中的索引，那索引的计算公式是什么呢，看源码：

     /**
     * Associates the specified value with the specified key in this map.
     * If the map previously contained a mapping for the key, the old
     * value is replaced.
     *
     * @param key key with which the specified value is to be associated
     * @param value value to be associated with the specified key
     * @return the previous value associated with <tt>key</tt>, or
     *         <tt>null</tt> if there was no mapping for <tt>key</tt>.
     *         (A <tt>null</tt> return can also indicate that the map
     *         previously associated <tt>null</tt> with <tt>key</tt>.)
     */
    public V put(K key, V value) {
        return putVal(hash(key), key, value, false, true);
    }

    /**
     * Implements Map.put and related methods
     *
     * @param hash hash for key
     * @param key the key
     * @param value the value to put
     * @param onlyIfAbsent if true, don't change existing value
     * @param evict if false, the table is in creation mode.
     * @return previous value, or null if none
     */
    final V putVal(int hash, K key, V value, boolean onlyIfAbsent,
                   boolean evict) {
        Node<K,V>[] tab; Node<K,V> p; int n, i;
        if ((tab = table) == null || (n = tab.length) == 0)
            n = (tab = resize()).length;
        if ((p = tab[i = (n - 1) & hash]) == null)
            tab[i] = newNode(hash, key, value, null);
        else {
            ......
        }
        ++modCount;
        if (++size > threshold)
            resize();
        afterNodeInsertion(evict);
        return null;
    }

    ......

    /**
     * Computes key.hashCode() and spreads (XORs) higher bits of hash
     * to lower.  Because the table uses power-of-two masking, sets of
     * hashes that vary only in bits above the current mask will
     * always collide. (Among known examples are sets of Float keys
     * holding consecutive whole numbers in small tables.)  So we
     * apply a transform that spreads the impact of higher bits
     * downward. There is a tradeoff between speed, utility, and
     * quality of bit-spreading. Because many common sets of hashes
     * are already reasonably distributed (so don't benefit from
     * spreading), and because we use trees to handle large sets of
     * collisions in bins, we just XOR some shifted bits in the
     * cheapest possible way to reduce systematic lossage, as well as
     * to incorporate impact of the highest bits that would otherwise
     * never be used in index calculations because of table bounds.
     */
    static final int hash(Object key) {
        int h;
        return (key == null) ? 0 : (h = key.hashCode()) ^ (h >>> 16);
    }

关键代码在putVal方法的第5行，p = tab[i = (n - 1) & hash] 这里。JDK1.8中并没有将哈希桶数组的索引抽成方法来计算。但从代码中我们还是可以提取出，其计算公式为：

(n - 1) & hash

话不多说，上图：
key = "hello world"
key.hashCode() = 1794106052
（1794106052的二进制表示：0110 1010 1110 1111 1110 0010 1100 0100）

哈希桶数组索引计算图示

按位与( & )：按位运算，都是1，结果为1，否则为0；
位异或( ^ )：按位运算，相反为1，相同为0；

图示中是JDK1.8中，哈希桶数组索引的计算步骤拆解。其中h ^ (h >>> 16)这一步，是为了让hashCode的高位也参与到运算中来，增加hash值的随机性。JDK1.7和JDK1.8中，hash值的计算方式是不同的，因此，在JDK1.7和JDK1.8中，同样的数据所对应的桶下标是可能不一致的。（并没有什么关系，因为除了你我，别人可能都不太关心HashMap中的数据到底是怎么存放的😑）

从图示中可以看到，key在哈希桶数组的位置，取决与它自己的hash值和当前HashMap的实际容量，也就是table的数组长度。（图示中HashMap的当前容量是16，key的桶索引其实就是其hash值的低4位）

存放逻辑

哈希桶数组索引已经有了，那么接下来就是将键值对存放到对应的数组 / 红黑树中。从流程图中可以看到，接下来就是判断是否有产生了hash冲突。本文开头提及过，HashMap的访问速度非常快。在Hash算法合理的前提下，最理想的情况是所有的键值对都均匀的散落到table数组中，在不发生哈希冲突的情况下，只要能定位到哈希桶索引，就能定位到想获取的数据。所以这里判断了两次，第一次，判断tab[i]是否为null，如果为null，那么直接存放即可；如果不是，则进行第二次判断，判断tab[i]中已存放的key是否就是我们将要处理的key，如果是的话，说明也没有发生哈希冲突，此时直接覆盖即可。如果也不是，说明发生了哈希冲突，这时候，就要区分：

链表：遍历链表，查找是否已经存在该key，已存在则覆盖原有值；若不存在，则将key、value插入链表末尾（JDK1.8：尾插法）；如果插入该数据将使得链表长度达到树化阀值8，则进行树化；

红黑树：遍历树节点，若该key已存在则覆盖原有值，若不存在则插入树节点；（红黑树这块后边单独开个文章细讲）

扩容逻辑

将数据按逻辑存放好后，在putVal方法的末尾，进行了一次检查（++size > threshold），如果判断数据已经达到了threshold，则做一次扩容resize()。

已经了解到HashMap的核心字段table之后，扩容这个操作就很好理解了。在Java中，数组是不可以自动扩容的。因此，扩容的思路就是用一个容量大的数组，替代原先容量小的数组，并将数据迁移到新数组当中。

实际上HashMap也是这么做的，用一个新的数组替代原先容量小的数组，但是在数据迁移的过程中，并不是直接一个个数据重新计算hash值及桶下标，而是只计算了hash值新增的bit。设计得非常巧妙。

还记得上边说到的哈希桶数组索引的计算方式吗？

(n - 1) & hash

还是以key = "hello world"为例，请看：

扩容机制-1

同一个key，它的hash值是不变的，那么变量在哪里呢？自然就是n了，n是数组的长度。图中HashMap的容量从16扩容到32，唯一变化的就是(n - 1)，从低4位都是1，变成了低5位都是1。
而变化的这一位，很关键的，就等于原有的容量。

扩容机制-2

由上可以得知，扩容前后，数据所在的桶下标只有2种情况：
1、新增bit为0，桶下标不变；
2、新增bit为1，桶下标 = 原有桶下标 + 原容量；

这样，就可以将原先某个哈希桶中冲突的节点，分散到2个哈希桶中了。

同时，新增bit为0 / 1，这件事情可以认为是随机的，因此扩容过程，可以把原先hash冲突的节点均匀的分散到新的桶中。

至此，HashMap的put方法就简单的解析完了。源码写得很赞，可以学到挺多的。

非线程安全

HashMap并非线程安全类，在刚才的源码解析中，可以发现在put操作中，并没有加锁，也就是说任一时刻可以多个线程同时写HashMap。那么就会有一定的几率会出现数据丢失的情况。先简单的列举以下几个场景，如：

1、两个线程同时判断完 tab[i] == null, 准备newNode赋值给 tab[i] 时；
2、两个线程同时判断到链表末尾，(e = p.next) == null，准备newNode赋值给 p.next 时；
3、一个线程扩容中，已经将新的Node数组赋值给table，但未迁移完数据。此时另一个线程执行了put / remove操作，则 put / remove 操作可能无效；

……

还有其他非线程安全的可能，这里没有全部列举出来，只是提供思路供参考。

/**
     * Implements Map.put and related methods
     *
     * @param hash hash for key
     * @param key the key
     * @param value the value to put
     * @param onlyIfAbsent if true, don't change existing value
     * @param evict if false, the table is in creation mode.
     * @return previous value, or null if none
     */
    final V putVal(int hash, K key, V value, boolean onlyIfAbsent,
                   boolean evict) {
        Node<K,V>[] tab; Node<K,V> p; int n, i;
        if ((tab = table) == null || (n = tab.length) == 0)
            n = (tab = resize()).length;
        if ((p = tab[i = (n - 1) & hash]) == null)
            tab[i] = newNode(hash, key, value, null);
        else {
            Node<K,V> e; K k;
            if (p.hash == hash &&
                ((k = p.key) == key || (key != null && key.equals(k))))
                e = p;
            else if (p instanceof TreeNode)
                e = ((TreeNode<K,V>)p).putTreeVal(this, tab, hash, key, value);
            else {
                for (int binCount = 0; ; ++binCount) {
                    if ((e = p.next) == null) {
                        p.next = newNode(hash, key, value, null);
                        if (binCount >= TREEIFY_THRESHOLD - 1) // -1 for 1st
                            treeifyBin(tab, hash);
                        break;
                    }
                    if (e.hash == hash &&
                        ((k = e.key) == key || (key != null && key.equals(k))))
                        break;
                    p = e;
                }
            }
            if (e != null) { // existing mapping for key
                V oldValue = e.value;
                if (!onlyIfAbsent || oldValue == null)
                    e.value = value;
                afterNodeAccess(e);
                return oldValue;
            }
        }
        ++modCount;
        if (++size > threshold)
            resize();
        afterNodeInsertion(evict);
        return null;
    }

当2个线程同时已经判断完了if ((p = tab[i = (n - 1) & hash]) == null)，准备执行写入数据时，先写入的数据将被后写入的数据覆盖。同理，如果2个线程同时已经判断到了链表的末尾 if ((e = p.next) == null)，准备执行 p.next = newNode(hash, key, value, null)写入数据时，后写入的数据将覆盖先写入的数据。（大前提都是，同时写入的数据，桶下标刚好相同，也就是产生了Hash冲突）

除此之外，当线程1正在扩容，已经将新数组赋值给table（已执行到table = newTab），但还未执行下边的数据迁移操作，此时线程2写入（或删除）数据，线程1再执行完对应的数据迁移操作，都会导致线程2写入的数据被覆盖（删除数据将不生效）

    /**
     * Initializes or doubles table size.  If null, allocates in
     * accord with initial capacity target held in field threshold.
     * Otherwise, because we are using power-of-two expansion, the
     * elements from each bin must either stay at same index, or move
     * with a power of two offset in the new table.
     *
     * @return the table
     */
    final Node<K,V>[] resize() {
        Node<K,V>[] oldTab = table;
        int oldCap = (oldTab == null) ? 0 : oldTab.length;
        int oldThr = threshold;
        int newCap, newThr = 0;
        
        ......

        threshold = newThr;
        @SuppressWarnings({"rawtypes","unchecked"})
            Node<K,V>[] newTab = (Node<K,V>[])new Node[newCap];
        table = newTab;
        if (oldTab != null) {

           ......

        }
        return newTab;
    }

为了加深印象，这里准备了2个例子，供参考。

下边是我本地复现线程不安全的代码，感兴趣的话可以试试。这是基于JDK1.8的。

示例1

import java.util.HashMap;

public class HashMapDemoResize1 {

    private static HashMap<String, String> map = new HashMap<String, String>();
    public static void main(String[] args) {
        map.put("5", "C");

        new Thread("Thread1") {
            public void run() {
                map.put("16", "B");
                System.out.println("put record. key: 16, value: B. result: " + map);
            };
        }.start();
        new Thread("Thread2") {
            public void run() {
                map.put("27", "A");
                System.out.println("put record. key: 27, value: A. result: " + map);
            };
        }.start();

        System.out.println("HashMapDemoResize1");
    }

}

本地我是先让两个线程都同时跑到if ((e = p.next) == null)，然后让Thread2先把key = "27"的数据put到链表末尾，再执行Thread1的代码。可以看到，打印出来的结果中，key = "27"的数据被覆盖了。

HashMapDemoResize1

示例2

import java.util.HashMap;

public class HashMapDemoResize2 {

    private static HashMap<String, String> map = new HashMap<String, String>(4);

    public static void main(String[] args) {

        map.put("12", "A");
        map.put("23", "B");
        map.put("34", "C");
        System.out.println("");

        new Thread("Thread1") {
            public void run() {
                map.put("45", "D");
                System.out.println("put record. key: 36, value: D. result: " + map);
            };
        }.start();
        new Thread("Thread2") {
            public void run() {
                map.put("56", "E");
                System.out.println("put record. key: 48, value: E. result: " + map);
            };
        }.start();

        System.out.println("HashMapDemoResize2");


    }
    
}

这里初始容量设置为4的目的是，threshold = capacity * load factor，threshold为3，也就是当put到第4个数据的时候，需要扩容resize()。
我本地让Thread1先跑到resize()方法中table = newTab这一步，然后让Thread2将key = "56"的键值对put进去，再让Thread1执行数据迁移操作。
从结果中可以看到，key = "56"的键值对被覆盖了。

HashMapDemoResize2

看到这里，相信你已经知道为什么说HashMap是线程不安全的了吧。如果要求线程安全的话，建议使用ConcurrentHashMap；如果不要求线程安全，直接使用HashMap即可。

如果这篇文章对你有帮助，不妨点个赞再走吧~ ^_

Java集合框架分析(四)HashMap分析
本篇文章主要分析一下Java集合框架中的Map部分，HashMap，该源码分析基于JDK1.8，分析工具，Andr...
Java集合——HashMap
简介本篇介绍基于JDK1.8的HashMap的数据结构以及主要方法的源码。HashMap基于散列表，继承了Abs...
HashMap源码分析（主要基于JDK1.8）
什么是HashMap HashMap是一个用于存储Key-Value键值对的集合，也是Java中使用频率最高的用于...
HashMap,SparseArray比较
数据基于JDK1.8主要分析分析SparseArray，HashMap 查找效率，内存空间占用原理分析 Hash...
Java 集合框架_HashMap源码解析
今天终于分析HashMap的源码，其实它的主要算法在我的Java 集合框架_HashMap JDK1.8新算法这篇...
二、HashMap、HashSet和Hashtable
所有集合基于jdk1.8，对源码稍做调整。 HashMap 主要变量构造方法 put get HashSet H...
三、TreeMap和TreeSet
所有集合基于jdk1.8，对源码稍做调整。 HashMap 主要变量构造方法 put get TreeSet T...
JDK1.8 HashMap 源码分析之resize函数（一）
JDK1.8 HashMap 源码分析之resize函数（一）一、HashMap内部结构 static fina...
HashMap源码分析
HashMap源码分析本文对jdk1.8的HashMap做了分析首先看一下HashMap的继承图： 1.实现了...
【Java集合】源码分析之HashTable
注：本文所有源码基于JDK1.8 概述前面我们已经分析了HashMap，TreeMap这两种常用的Map类型，但...