美文网首页
class Student(二):内存分配情况

class Student(二):内存分配情况

作者: 淡定小问题 | 来源:发表于2020-08-09 18:12 被阅读0次
问题导向的学习方法

摘要:class Student系列,希望通过对一段非常简单的代码分析,以问题为导向,加深自己对代码的理解。

如题,一段非常简单的代码如下:

class Student {
    int age;
    String name;

    static Student demo() {
        Student xm = new Student();
        xm.age += 10;
        xm.name = "小ming😊";
        return xm;
    }

    public static void main(String[] args) {
        Student.demo();
    }

1. 执行demo方法时,哪些地方分配了哪些内存?分别是多大?

  1. 方法栈压栈一个新的栈帧
  2. 栈帧
    • 局部变量表,有一个指向堆上对象的引用 xm: 8字节

      引用大小一般是机器字长,32位上是4字节,64机器是8字节, 后面讨论默认为64位机器),但是64位寻址空间位4G * 4G = 16GG,一般是用不到这么大内存的,因此部分JVM实现会压缩引用大小,用更少的空间存储引用。后面的讨论不考虑这个实现相关的优化

    • 返回值地址 (8字节

    • 操作数栈

      • 栈式虚拟机才存在,例如:Hotspot,普通的桌面级和server JVM实现 (编译时确定最大深度,根据代码不同而不同 x字节)
      • Android用的Dalvik和ART是寄存器式的,不存在
    • 对常量方法的引用

  3. 堆上分配了一个Student对象,Student对象由四部分组成
    • 对象头(包含了指向class的指针,gc信息,锁情况等相关信息)(16字节左右)
    • field:age int/值 类型4字节
    • field:name String/引用类型 一个机器字长(8字节
    • 对齐填充(一般是4字节或者8字节对齐)(对Student对象来说,需要填充4字节
  4. 常量池里面的 “小ming😊”
    • Java字符串使用UTF-16编码(所以,Character对象是16位),“小ming” = 5 * 2 字节
    • emoji 不在Unicode 常见字符编码内,需要用两个character表示 2字节 (参见Java String注释,CodePoint API)

思考题:10 在哪里?

2. 字符串占用多少内存?编码方式?

如上面,分析过了。Java中使用UTF-16编码,不在常见字符集内的,使用一个codePoint(两个Character)来表示。
其它地方,目前大部分默认使用UTF-8作为默认编码。UTF-8 是变长编码,前面的字符和Asicc兼容,一个汉字用三个字节表示。
字符/Unicode编码是一个比较复杂的话题,我了解的比较浅,这里就不班门弄斧了,有兴趣的小伙伴可以继续深入研究。

3. 方法栈上的内存布局是什么样的?堆上的内存布局?

注意:内存布局和内存大小是两个相关但不同的概念,内存布局含义更丰富一些,例如,内存是连续的还是离散的,不同内存之间的关系。相同的内存消耗,不同的内存布局可能对性能影响非常大

内存布局 (图画的有点久了,将就看下吧) 栈帧的结构

扩展:

1. String 的lazy,cache hashCode

    /** Cache the hash code for the string */
    private int hash; // Default to 0

    /**
     * Returns a hash code for this string. The hash code for a
     * {@code String} object is computed as
     * <blockquote><pre>
     * s[0]*31^(n-1) + s[1]*31^(n-2) + ... + s[n-1]
     * </pre></blockquote>
     * using {@code int} arithmetic, where {@code s[i]} is the
     * <i>i</i>th character of the string, {@code n} is the length of
     * the string, and {@code ^} indicates exponentiation.
     * (The hash value of the empty string is zero.)
     *
     * @return  a hash code value for this object.
     */
    public int hashCode() {
        int h = hash;
        final int len = length();
        if (h == 0 && len > 0) {
            for (int i = 0; i < len; i++) {
                h = 31 * h + charAt(i);
            }
            hash = h;
        }
        return h;
    }

非常有意思的一点,String的hashCode和平时些的即时计算的代码非常不同,用了一个辅助成员变量来缓存,并且是延迟计算。

我理解主要是基于几点考虑:
1. String对象是不可变的,为缓存hashCode提供了前提
2. 对于字符串长度非常长的情况下,缓存策略可以避免非常严重的badcase(String的hash计算是和字符串长度成正比的, 复杂度O(n))
3. 内存相对是廉价/不敏感的,毕竟对象头都占了16个字节了,一个int的消耗不足挂齿

2. Unicode, Character, String

String.java

/**
  * <p>A {@code String} represents a string in the UTF-16 format
 * in which <em>supplementary characters</em> are represented by <em>surrogate
 * pairs</em> (see the section <a href="Character.html#unicode">Unicode
 * Character Representations</a> in the {@code Character} class for
 * more information).
 * Index values refer to {@code char} code units, so a supplementary
 * character uses two positions in a {@code String}.
 * <p>The {@code String} class provides methods for dealing with
 * Unicode code points (i.e., characters), in addition to those for
 * dealing with Unicode code units (i.e., {@code char} values).
 */

Character.java

/**
 * <p><a name="BMP">The set of characters from U+0000 to U+FFFF</a> is
 * sometimes referred to as the <em>Basic Multilingual Plane (BMP)</em>.
 * <a name="supplementary">Characters</a> whose code points are greater
 * than U+FFFF are called <em>supplementary character</em>s.  The Java
 * platform uses the UTF-16 representation in {@code char} arrays and
 * in the {@code String} and {@code StringBuffer} classes. In
 * this representation, supplementary characters are represented as a pair
 * of {@code char} values, the first from the <em>high-surrogates</em>
 * range, (&#92;uD800-&#92;uDBFF), the second from the
 * <em>low-surrogates</em> range (&#92;uDC00-&#92;uDFFF).
*
 * <p>A {@code char} value, therefore, represents Basic
 * Multilingual Plane (BMP) code points, including the surrogate
 * code points, or code units of the UTF-16 encoding. An
 * {@code int} value represents all Unicode code points,
 * including supplementary code points. The lower (least significant)
 * 21 bits of {@code int} are used to represent Unicode code
 * points and the upper (most significant) 11 bits must be zero.
 * Unless otherwise specified, the behavior with respect to
 * supplementary characters and surrogate {@code char} values is
 * as follows:
 *
 * <ul>
 * <li>The methods that only accept a {@code char} value cannot support
 * supplementary characters. They treat {@code char} values from the
 * surrogate ranges as undefined characters. For example,
 * {@code Character.isLetter('\u005CuD840')} returns {@code false}, even though
 * this specific value if followed by any low-surrogate value in a string
 * would represent a letter.
 *
 * <li>The methods that accept an {@code int} value support all
 * Unicode characters, including supplementary characters. For
 * example, {@code Character.isLetter(0x2F81A)} returns
 * {@code true} because the code point value represents a letter
 * (a CJK ideograph).
 */

Basic Multilingual Plane (BMP)范围内的字符可以使用一个Character表示,范围外的字符需要用两个Character表示

扩展思考

  1. 如何正确的截断一个带有emoji字符的文本?
  2. 为什么要进行对齐?除了对象的对齐之外,哪些地方还用到了对齐?
  3. 为什么我们自己定义的类大部分是直接计算hashCode ?哪些场景下适用String style lazy cache类似的hashCode实现?
  4. 为什么String对象要设计为不可变的?自定义类如何做到不可变?不可变对象有什么好处?
  5. 栈帧的局部变量表里面都有啥?有this?

参考文章:

  1. 精美图文带你掌握 JVM 内存布局
  2. Unicode 编码及 UTF-32, UTF-16 和 UTF-8

相关文章

网友评论

      本文标题:class Student(二):内存分配情况

      本文链接:https://www.haomeiwen.com/subject/semrrktx.html