在讨论hashCode之前,我先提出几个问题,后面我会一一解答
- 什么是hashCode?
- hashCode有什么用?
- 为什么会发生哈希冲突,可以避免吗?
- 为什么重写equals()方法一定要重写hashCode()方法?
- hashCode在Java中是如何生成的?
在讨论上述问题之前,我们先看看java里面是如何对hashCode定义的,下面我摘抄一段java8 Object对hashCode()的一段注释
/**
* Returns a hash code value for the object. This method is
* supported for the benefit of hash tables such as those provided by
* {@link java.util.HashMap}.
* <p>
* The general contract of {@code hashCode} is:
* <ul>
* <li>Whenever it is invoked on the same object more than once during
* an execution of a Java application, the {@code hashCode} method
* must consistently return the same integer, provided no information
* used in {@code equals} comparisons on the object is modified.
* This integer need not remain consistent from one execution of an
* application to another execution of the same application.
* <li>If two objects are equal according to the {@code equals(Object)}
* method, then calling the {@code hashCode} method on each of
* the two objects must produce the same integer result.
* <li>It is <em>not</em> required that if two objects are unequal
* according to the {@link java.lang.Object#equals(java.lang.Object)}
* method, then calling the {@code hashCode} method on each of the
* two objects must produce distinct integer results. However, the
* programmer should be aware that producing distinct integer results
* for unequal objects may improve the performance of hash tables.
* </ul>
*/
我把他的意思大致归纳于以下几点
- 每个对象都能返回一个hashCode,hashCode是用来加速哈希表的
- 在同一java应用程序执行期间,对同一个对象调用超过一次时,如果equals()比较的内容没有发生变化,hashCode()的返回应该也不应该发生变化。(特别需要注意的是同一java程序,不同java程序可以允许hashCode发生变化)
- 根据equals()方法,两个对象相等,那么hashCode也必须要相等
- 根据equals()方法,两个对象不相等,那么hashCode可能相等,但是要尽可能的让他们不相等,这样可以提升哈希表的性能
什么是hashCode?
hashCode是用来加速哈希表的,在同一个java应用程序中,hashCode是对java对象的一个签名,相等的对象hashCode一定相等。
hashCode有什么用?
上面也已经提到了,是用来加速哈希表的,采取了空间换时间的方式。
这一篇博客不是说哈希表的,因此这里不做展开
大致思路是先通过hashcode去哈希表里面找到对应的哈希桶,然后调用equals()方法去判断哈希桶里面挂载的元素有没有相等的。
为什么相等的元素会出现在同一个哈希桶里面?
前文已经说过了,相等的对象hashcode一定相等,hashcode相等的,对象却不一定相等,因此需要再调用equals()方法做判断
为什么会发生哈希冲突,可以避免吗?
先来看一下java里面hashCode的方法签名
public native int hashCode();
可以看到hashCode本质上面来说是一个int,int的表示的数据范围是有限的,而对象是无限的,把一个无限的集合压缩进一个有限的集合中,发生哈希冲突就成了一个必然事件,哈希冲突是无法避免的。
当然,在我们的应用中,对象不可能是无穷无尽的,因此选择好的哈希算法是可以降低哈希冲突的概率的
为什么重写equals()方法一定要重写hashCode()方法?
这个问题实际上前面已经多次提到了,根据java规范,equals()相等的对象,hashCode()必须相等
重写equals()必然导致对象是否相等的结果发生变化,因此也就需要修改hashCode()方法
如果只修改equals()不修改hashCode()方法,就会导致使用哈希表的时候不能准确定位到哈希桶,导致哈希表工作异常
hashCode在Java中是如何生成的?
这个问题其实很复杂,我再摘抄一段java8 Object里面的注释
/**
* As much as is reasonably practical, the hashCode method defined by
* class {@code Object} does return distinct integers for distinct
* objects. (This is typically implemented by converting the internal
* address of the object into an integer, but this implementation
* technique is not required by the
* Java™ programming language.)
*/
这句话的意思是说,很多时候hashcode是对象内存地址的一个映射,但是java里面不是的。
那么java里面的hashCode是如何生成的呢?
实际上java默认的hashcode是由一套随机算法生成的,只与对象生成的顺序和线程有关。
下面我用代码证明下
public static void main(String[] args) {
for (int i = 0; i < 5; i++) {
Object o = new Object();
System.out.println("hashCode:" + o.hashCode());
System.out.println("address:" + VM.current().addressOf(o));
}
}
可以直接用Unsafe类拿到内存地址,不过操作比较繁琐,因此我用了别人封装的工具包
<dependency>
<groupId>org.openjdk.jol</groupId>
<artifactId>jol-core</artifactId>
<version>0.10</version>
</dependency>
反复执行上面代码,结果如下
第一次 | 第二次 | 第三次 |
---|---|---|
hashCode:531885035 address:31856529176 hashCode:705265961 address:31874395832 hashCode:428746855 address:31874396520 hashCode:317983781 address:31874397208 hashCode:987405879 address:31874397896 | hashCode:531885035 address:31856529344 hashCode:705265961 address:31874395856 hashCode:428746855 address:31874396544 hashCode:317983781 address:31874397232 hashCode:987405879 address:31874397920 | hashCode:531885035 address:31856529920 hashCode:705265961 address:31874407200 hashCode:428746855 address:31874407888 hashCode:317983781 address:31874408576 hashCode:987405879 address:31874409264 |
我们可以清楚的看到,相同的顺序hashCode是一样的,内存地址是不一样的,因此可以肯定java的hashCode和内存是无关的,看起来与创建顺序有关。
接下来我们用代码去查看发生哈希冲突时的情况
public static void main(String[] args) {
HashSet<Integer> set = new HashSet<Integer>();
int count = 0, num = 0;
while (true) {
count++;
Object o = new Object();
if (set.contains(o.hashCode())) {
num++;
System.out.println("count:" + count);
System.out.println("hashCode:" + o.hashCode());
if (num == 5) break;
}
set.add(o.hashCode());
}
}
重复运行上面代码
第一次 | 第二次 | 第三次 |
---|---|---|
count:105730 hashCode:2134400190 count:111177 hashCode:651156501 count:121838 hashCode:1867750575 count:145361 hashCode:2038112324 count:146294 hashCode:1164664992 | count:105730 hashCode:2134400190 count:111177 hashCode:651156501 count:121838 hashCode:1867750575 count:145361 hashCode:2038112324 count:146294 hashCode:1164664992 | count:105730 hashCode:2134400190 count:111177 hashCode:651156501 count:121838 hashCode:1867750575 count:145361 hashCode:2038112324 count:146294 hashCode:1164664992 |
我们可以看到,发生哈希冲突的位置是一样的,发生哈希冲突时的hashCode也是一样的,因此我们可以断定hashcode的生成与对象生成的顺序有关。
接下来我们用不同的线程去重复上述操作
public static void main(String[] args) throws InterruptedException {
new Thread(() ->{
HashSet<Integer> set = new HashSet<Integer>();
int count = 0, num = 0;
while (true) {
count++;
Object o = new Object();
if (set.contains(o.hashCode())) {
num++;
System.out.println("count:" + count);
System.out.println("hashCode:" + o.hashCode());
if (num == 5) break;
}
set.add(o.hashCode());
}
}).start();
Thread.sleep(1000L);
}
结果如下
第一次 | 第二次 | 第三次 |
---|---|---|
count:53869 hashCode:820543451 count:87947 hashCode:951498229 count:99859 hashCode:175788428 count:100940 hashCode:1979813773 count:123438 hashCode:916565907 | count:51579 hashCode:1573800482 count:124111 hashCode:1834341042 count:139761 hashCode:2021276643 count:142276 hashCode:1807321828 count:143515 hashCode:1522017140 | count:69186 hashCode:742597489 count:102967 hashCode:2084692455 count:144290 hashCode:424089733 count:176112 hashCode:53258565 count:178639 hashCode:1046986547 |
看起来没有任何规律了,这就需要去扒java的源码才能解答这个现象了
在Object中我们可以看到hashCode实际上是调用了IHashCode
static JNINativeMethod methods[] = {
{"hashCode", "()I", (void *)&JVM_IHashCode},
{"wait", "(J)V", (void *)&JVM_MonitorWait},
{"notify", "()V", (void *)&JVM_MonitorNotify},
{"notifyAll", "()V", (void *)&JVM_MonitorNotifyAll},
{"clone", "()Ljava/lang/Object;", (void *)&JVM_Clone},
};
然后在jvm.cpp找到了IHashCode实际调用的是FastHashCode
JVM_ENTRY(jint, JVM_IHashCode(JNIEnv* env, jobject handle))
JVMWrapper("JVM_IHashCode");
// as implemented in the classic virtual machine; return 0 if object is NULL
return handle == NULL ? 0 : ObjectSynchronizer::FastHashCode (THREAD, JNIHandles::resolve_non_null(handle)) ;
JVM_END
接着扒FastHashCode,FastHashCode在synchronizer.cpp中
intptr_t ObjectSynchronizer::FastHashCode (Thread * Self, oop obj) {
if (UseBiasedLocking) {
// NOTE: many places throughout the JVM do not expect a safepoint
// to be taken here, in particular most operations on perm gen
// objects. However, we only ever bias Java instances and all of
// the call sites of identity_hash that might revoke biases have
// been checked to make sure they can handle a safepoint. The
// added check of the bias pattern is to avoid useless calls to
// thread-local storage.
if (obj->mark()->has_bias_pattern()) {
// Box and unbox the raw reference just in case we cause a STW safepoint.
Handle hobj (Self, obj) ;
// Relaxing assertion for bug 6320749.
assert (Universe::verify_in_progress() ||
!SafepointSynchronize::is_at_safepoint(),
"biases should not be seen by VM thread here");
BiasedLocking::revoke_and_rebias(hobj, false, JavaThread::current());
obj = hobj() ;
assert(!obj->mark()->has_bias_pattern(), "biases should be revoked by now");
}
}
// hashCode() is a heap mutator ...
// Relaxing assertion for bug 6320749.
assert (Universe::verify_in_progress() ||
!SafepointSynchronize::is_at_safepoint(), "invariant") ;
assert (Universe::verify_in_progress() ||
Self->is_Java_thread() , "invariant") ;
assert (Universe::verify_in_progress() ||
((JavaThread *)Self)->thread_state() != _thread_blocked, "invariant") ;
ObjectMonitor* monitor = NULL;
markOop temp, test;
intptr_t hash;
markOop mark = ReadStableMark (obj);
// object should remain ineligible for biased locking
assert (!mark->has_bias_pattern(), "invariant") ;
//mark是对象头
if (mark->is_neutral()) {
hash = mark->hash(); // 取出hash值(实际上可以理解为缓存,有就直接返回,没有就生成一个新的)
if (hash) { // if it has hash, just return it
return hash;
}
hash = get_next_hash(Self, obj); // 这是生成hashcode的核心方法
temp = mark->copy_set_hash(hash); // merge the hash code into header
// use (machine word version) atomic operation to install the hash
test = (markOop) Atomic::cmpxchg_ptr(temp, obj->mark_addr(), mark);
if (test == mark) {
return hash;
}
// If atomic operation failed, we must inflate the header
// into heavy weight monitor. We could add more code here
// for fast path, but it does not worth the complexity.
} else if (mark->has_monitor()) {
monitor = mark->monitor();
temp = monitor->header();
assert (temp->is_neutral(), "invariant") ;
hash = temp->hash();
if (hash) {
return hash;
}
// Skip to the following code to reduce code size
} else if (Self->is_lock_owned((address)mark->locker())) {
temp = mark->displaced_mark_helper(); // this is a lightweight monitor owned
assert (temp->is_neutral(), "invariant") ;
hash = temp->hash(); // by current thread, check if the displaced
if (hash) { // header contains hash code
return hash;
}
// WARNING:
// The displaced header is strictly immutable.
// It can NOT be changed in ANY cases. So we have
// to inflate the header into heavyweight monitor
// even the current thread owns the lock. The reason
// is the BasicLock (stack slot) will be asynchronously
// read by other threads during the inflate() function.
// Any change to stack may not propagate to other threads
// correctly.
}
// Inflate the monitor to set hash code
monitor = ObjectSynchronizer::inflate(Self, obj);
// Load displaced header and check it has hash code
mark = monitor->header();
assert (mark->is_neutral(), "invariant") ;
hash = mark->hash();
if (hash == 0) {
hash = get_next_hash(Self, obj);
temp = mark->copy_set_hash(hash); // merge hash code into header
assert (temp->is_neutral(), "invariant") ;
test = (markOop) Atomic::cmpxchg_ptr(temp, monitor, mark);
if (test != mark) {
// The only update to the header in the monitor (outside GC)
// is install the hash code. If someone add new usage of
// displaced header, please update this code
hash = test->hash();
assert (test->is_neutral(), "invariant") ;
assert (hash != 0, "Trivial unexpected object/monitor header usage.");
}
}
// We finally get the hash
return hash;
}
get_next_hash
static inline intptr_t get_next_hash(Thread * Self, oop obj) {
intptr_t value = 0 ;
//随机获取一个
if (hashCode == 0) {
// This form uses an unguarded global Park-Miller RNG,
// so it's possible for two threads to race and generate the same RNG.
// On MP system we'll have lots of RW access to a global, so the
// mechanism induces lots of coherency traffic.
value = os::random() ;
} else
//根据内存地址计算一个
if (hashCode == 1) {
// This variation has the property of being stable (idempotent)
// between STW operations. This can be useful in some of the 1-0
// synchronization schemes.
intptr_t addrBits = cast_from_oop<intptr_t>(obj) >> 3 ;
value = addrBits ^ (addrBits >> 5) ^ GVars.stwRandom ;
} else
//返回1
if (hashCode == 2) {
value = 1 ; // for sensitivity testing
} else
//不太清楚什么意思 求大佬解释
if (hashCode == 3) {
value = ++GVars.hcSequence ;
} else
//直接返回内存地址
if (hashCode == 4) {
value = cast_from_oop<intptr_t>(obj) ;
}
//这是一种生成随机数的一种算法
else {
// Marsaglia's xor-shift scheme with thread-specific state
// This is probably the best overall implementation -- we'll
// likely make this the default in future releases.
unsigned t = Self->_hashStateX ;
t ^= (t << 11) ;
Self->_hashStateX = Self->_hashStateY ;
Self->_hashStateY = Self->_hashStateZ ;
Self->_hashStateZ = Self->_hashStateW ;
unsigned v = Self->_hashStateW ;
v = (v ^ (v >> 19)) ^ (t ^ (t >> 8)) ;
Self->_hashStateW = v ;
value = v ;
}
value &= markOopDesc::hash_mask;
if (value == 0) value = 0xBAD ;
assert (value != markOopDesc::no_hash, "invariant") ;
TEVENT (hashCode: GENERATE) ;
return value;
}
到了这里我们基本上就知道了hashcode是怎么生成的了,实际上在jdk1.8在是用的第五种生成方式,我们可以在Linux系统下输入:java -XX:+PrintFlagsFinal -version|grep hashCode命令查看
ok,接下来我们来分析一下第5种方式,看到这个代码我的第一反应是懵逼的,那个什么_hashStateX是个什么鬼?
我们来看一下thead.cpp里面是怎样定义的:
// thread-specific hashCode stream generator state - Marsaglia shift-xor form
_hashStateX = os::random() ;
_hashStateY = 842502087 ;
_hashStateZ = 0x8767 ; // (int)(3579807591LL & 0xffff) ;
_hashStateW = 273326509 ;
在thread里面定义了一个随机数,三个常数,通过这四个数根据上述的算法来生成hashcode。
具体原理请参考论文:Xorshift RNGs
因为在上述算法在,需要得到线程里面的一个随机数作为一个初始值,上述算法在前后具有因果关系,后面的结果是根据前面的结果推算而来的,因此对于相同的线程来说,在某种意义上,对象的hashcode的生成是和顺序有关的。
为什么主函数里面的hashcode的生成一直是有序的呢?因为主线程里面的是固定值。
可见hashCode在1.8中和内存地址是无关的,与所在线程(或者说生成的一个随机数)以及生成顺序有关
网友评论