数据结构-散列表（HashTable）

作者: GemShi | 来源:发表于2019-01-31 17:06 被阅读53次

数据结构-散列表（HashTable）
散列表hashtable
Hashtable源码分析
HashTable哈希/散列表
线性表--散列表(HashTable)
C#之数据结构（下）
Java中散列表、树所对应的容器类。HashMap如何解决has
数据结构与算法-散列表查找实现
Hashtable 的实现原理
Hashtable 的实现原理

本质

哈希表的底层数据结构是数组，数组中每一个元素存放的是键值对。

structure of hashtable

核心原理

f(key) = index

将key传入，通过某个算法f，计算出索引，如果索引冲突，再通过某个办法对索引进行+1或-1再算一遍，直到算到不冲突为止。

负载因子 = 总键值对数 / 哈希表总长度

负载因子用来衡量用来衡量哈希表的空满程度，一般，当负载因子为0.75～1的值就会进行自动扩容。

存取过程（以iOS方法缓存列表底层实现为例）

example：Class内部结构中有个方法缓存列表，当objc_msgSend进行到搜索方法缓存列表的步骤时，为使查找更高效，缓存列表的底层就是通过哈希表实现对调用过的方法的缓存。
1.存
（1）根据key计算出索引值
（2）如果该索引中没有元素，就存入键值对
（3）如果该索引中有元素，索引冲突，就对索引进行+1或-1进行存储
（4）如果当前已存在的所有索引都有元素，就进行哈希表的扩容
（5）设置新空间是旧空间的2倍
（6）重新分配内存
（7）重新设置掩码_mask = newCapacity-1
（8）会将旧内存释放掉，清空缓存

objc_cache.mm源码解析

static void cache_fill_nolock(Class cls, SEL sel, IMP imp, id receiver)
{
    cacheUpdateLock.assertLocked();
    // Never cache before +initialize is done
    if (!cls->isInitialized()) return;
    // Make sure the entry wasn't added to the cache by some other thread 
    // before we grabbed the cacheUpdateLock.
    if (cache_getImp(cls, sel)) return;
    cache_t *cache = getCache(cls);
    cache_key_t key = getKey(sel);
    // Use the cache as-is if it is less than 3/4 full
    mask_t newOccupied = cache->occupied() + 1;
    mask_t capacity = cache->capacity();
    if (cache->isConstantEmptyCache()) {
        // Cache is read-only. Replace it.
        cache->reallocate(capacity, capacity ?: INIT_CACHE_SIZE);
    }
    else if (newOccupied <= capacity / 4 * 3) {
        // Cache is less than 3/4 full. Use it as-is.
    }
    else {
        // Cache is too full. Expand it.
        cache->expand();
    }
    // Scan for the first unused slot and insert there.
    // There is guaranteed to be an empty slot because the 
    // minimum size is 4 and we resized at 3/4 full.
    bucket_t *bucket = cache->find(key, receiver);
    if (bucket->key() == 0) cache->incrementOccupied();
    //设置bucket中的key和imp
    bucket->set(key, imp);
}
//扩容
void cache_t::expand()
{
    cacheUpdateLock.assertLocked();
    
    uint32_t oldCapacity = capacity();
    //新的空间是旧的空间的2倍
    uint32_t newCapacity = oldCapacity ? oldCapacity*2 : INIT_CACHE_SIZE;
    if ((uint32_t)(mask_t)newCapacity != newCapacity) {
        // mask overflow - can't grow further
        // fixme this wastes one bit of mask
        newCapacity = oldCapacity;
    }
    //重新分配内存
    reallocate(oldCapacity, newCapacity);
}

void cache_t::reallocate(mask_t oldCapacity, mask_t newCapacity)
{
    bool freeOld = canBeFreed();
    bucket_t *oldBuckets = buckets();
    bucket_t *newBuckets = allocateBuckets(newCapacity);
    // Cache's old contents are not propagated. 
    // This is thought to save cache memory at the cost of extra cache fills.
    // fixme re-measure this
    assert(newCapacity > 0);
    assert((uintptr_t)(mask_t)(newCapacity-1) == newCapacity-1);
    //会重新设置最新的_mask = newCapacity - 1;
    setBucketsAndMask(newBuckets, newCapacity - 1);
    
    if (freeOld) {
        //会将旧的释放掉，清空缓存
        cache_collect_free(oldBuckets, oldCapacity);
        cache_collect(false);
    }
}

2.取
（1）通过key得到索引
（2）do-while循环，如果散列表元素的key恰好等于按位与掩码（&_mask）取出的索引，就直接返回
（3）如果不是，就将索引-1继续查找

objc_cache.mm源码解析

bucket_t * cache_t::find(cache_key_t k, id receiver)
{
    assert(k != 0);
    //返回buckets散列表
    bucket_t *b = buckets();
    mask_t m = mask();
    //得到索引
    mask_t begin = cache_hash(k, m);
    mask_t i = begin;
    do {
        //如果buket_t的索引的恰好等于通过&_mask取出的索引，就直接返回
        if (b[i].key() == 0  ||  b[i].key() == k) {
            return &b[i];
        }
    } while ((i = cache_next(i, m)) != begin);
    //如果不是，直接i-1，如果减到0还不行，就直接是_mask（相当于再从最后一位继续减）
    
    // hack
    Class cls = (Class)((uintptr_t)this - offsetof(objc_class, cache));
    cache_t::bad_cache(receiver, (SEL)k, cls);
}

static inline mask_t cache_hash(cache_key_t key, mask_t mask) 
{
    //按位与mask得到索引
    return (mask_t)(key & mask);
}

#if __arm__  ||  __x86_64__  ||  __i386__
// objc_msgSend has few registers available.
// Cache scan increments and wraps at special end-marking bucket.
#define CACHE_END_MARKER 1
static inline mask_t cache_next(mask_t i, mask_t mask) {
    return (i+1) & mask;
}
#elif __arm64__
// objc_msgSend has lots of registers available.
// Cache scan decrements. No end marker needed.
#define CACHE_END_MARKER 0
static inline mask_t cache_next(mask_t i, mask_t mask) {
    //arm64架构，索引-1
    return i ? i-1 : mask;
}

总结

Objective-C中的实现，优缺点并存，但相对于实际情况而言，还是优点大于缺点。因为对于方法缓存列表，一般不会有大量的数据，扩容或许是少数情况。此时，你可能会有疑问，当数据量少的时候，岂不是牺牲了内存？然而，哈希表就是采用了空间换时间，牺牲内存空间，换取存取效率的方法。所以，整体符合需求即可。
优点：
整体而言，提升了存取效率，时间复杂度为O(1)，无需遍历。
缺点：
当哈希表比较大时，如果扩容会导致瞬间效率降低。

不同编程语言，哈希表的实现也各有特点，但是本质和原理不变。

例如：
对于大量的数据或者数据库中的存取，Java和Redis也有自己的优化方法。
1.Java中，当哈希函数不合理导致链表过长时，会使用红黑树来保证插入和查找的效率。
2.Redis 使用增量式扩容方法优化了这个缺点，同时还有拉链法的实现(放在链表头部头插，因为新插入的调用概率会高)。