【Netflix Hollow系列】深入Hollow底层读写引擎

作者: 分布式与微服务 | 来源:发表于2022-08-28 09:09 被阅读0次

前言

作为Hollow系列的开篇，首先向大家介绍下Hollow是什么，可以做什么。Hollow是由Netflix开源的一款 Java 库和工具包，旨在有效缓存不属于“大数据”的数据集，这些数据集可以存放在本地磁盘，也可以存放在像S3这样的存储中。

Hollow基于Producer-Consumer模型定义了如何处理这些数据集，通过Producer生产Blob，然后被Consumer消费。能够为企业级元数据管理提供可靠的数据模型能力。

在国内关于Hollow的文章并不多，或者说只有一两篇介绍Hollow的文章，这几篇文章都缺少对整个Hollow的体系和设计结构的详细分析和介绍，这也让我感觉到有必要将Hollow介绍给大家。

面临的问题

架构复杂

业务系统在启动时，需要将计算过程中使用到的缓存数据加载到内存中，而这些缓存的数据很可能是存储在MySQL或者由服务通过API的形式提供。此时，我们的服务会形成一种简单的处理模型。

以上模型，对于简单的系统来讲已经足够满足业务需要，但是当系统复杂起来后，可能会变成如下图所示的模型。

上图中只是列举了两个DB，四个服务。当系统更为复杂后，这个关系将变成一团乱麻。可能一开始的刚刚接入的时候还可以忍受，但是当系统不断迭代，人员不断流动后，整个系统将变得难以维护，最终的结局可能就是推倒重构了。

重复工作

从上图中可以看出，Service-A需要依赖DB-1和DB-2两个数据库，同时，Service-B也需要依赖相同的两个DB。由于数据模型相同，加载方式也想通，Service-A和Service-B需要做相同的事情，如果Service-A和Service-B是在同一个团队中那么可能会统一抽象后进行封装避免重复工作，但是当不再一个团队中时，重复的工作可能就无法避免了。

Hollow是什么

Hollow是为了处理小型缓存数据集而设计的工具包。这些数据集可能是系统使用的元数据，类似于演员姓名、城市名称、国家代码、地理位置等。很多的元数据构成了不同业务模型的数据集，处理这种数据集的传统方法包括数据存储或串行化，但这可能会有可靠性和延迟问题。

在Netflix的内部，Hollow 取代作为取代原先的内存数据集框架Zeno被设计出来的。Hollow的数据集使用了紧凑的、固定长度的、强类型的数据编码表示。这种编码最小化了数据集占用的空间，并将编码记录“打包在 JVM 堆上合并的可重用内存条中，以避免影响繁忙的服务器上的 GC 行为。

Hollow官方对于Hollow设计出发点的阐述。

Software engineers often encounter problems which require the dissemination of small or moderately sized data sets which don’t fit the label “big data”. To solve these problems, we often send the data to an RDBMS or nosql data store and query it at runtime, or serialize the data as json or xml, distribute it, and keep a local copy on each consumer.

Scaling each of these solutions presents different challenges. Sending the data to an RDBMS, nosql data store, or even a memcached cluster may allow your dataset to grow indefinitely large, but there are limitations on the latency and frequency with which you can interact with that dataset. Serializing and keeping a local copy (if in RAM) can allow many orders of magnitude lower latency and higher frequency access, but this approach has many scaling challenges:

The dataset size is limited by available RAM.

The full dataset may need to be re-downloaded each time it is updated.

Updating the dataset may require significant CPU resources or impact GC behavior.

Netflix, serving many billions of personalized requests each day, has a few use cases for which the latency of a remote datastore would be highly undesirable given the frequency with which those datasets are accessed.

这里有一篇Hollow的主要作者Drew Koszewnik 介绍Hollow的访谈，有兴趣可以看下。

Hollow设计的目标

Hollow 在被设计之初，便围绕着以下三个目标展开：

最大的开发敏捷性
高度优化的性能和资源管理
极高的稳定性和可靠性

开发敏捷性

开发敏捷性不仅仅针对Consumer，同时也针对Producer；可能用客户端服务端的描述更容易理解些。Hollow提供了基于HollowSchema的数据模型生成能力，通过HollowAPI体现出来。HollowAPI实际查询的数据集可以被轻松的进行索引，以实现快速的查询，此外还可以轻松实现Producer的数据merge和split，以及在Consumer的数据filter。

同时，Hollow还提供了丰富的UI工具，包括历史工具、对比工具、视图工具等。

上面的这些功能会大幅提升我们的开发效率，使得我们的开发变得更加敏捷高效。

优越的性能

通常情况下，对于缓存我们会有两种方式，一种是全量的替换，一种是增量的更新。Hollow将增量更新的逻辑进行的封装，可以方便的管理增量的数据。这样做可以无需为每次更新重新传输整个全量的数据，对于性能有较为显著的提升。如果不这样做，我们可能会遇到的一个经常的会出现的场景是，当服务端更新缓存时，客户端的响应时间会有明显的上升。

此外，Hollow基于池化的思想，可以方便的重用堆内存，进而对GC产生了很大帮助。

Hollow主要考虑的性能因素包括：

堆足迹
访问的计算成本
更新的 GC 影响
更新的计算成本
网络更新成本

Hollow的适用场景

上文中大体介绍了Hollow的设计初衷和可以完成的功能。本章节介绍下Hollow的适用场景。

Hollow并不适用于所有大小的数据集，适用于能够轻松的将全量数据存储到内存的业务数据。如果数据足够大，则将整个数据集保存在内存中是不可行的。对于大数据量的数据可能有其他的解决方案，比如分治法或者适用大数据处理等。

上文中所说的全量存储并不代表说内存中仅仅可以存储一份全量数据，内存中至少可以保留两份全量数据，因为当发生【broken delta chain】时，consumer需要存储两份全量数据到内存，以便于能够尽快使得consumer使用最新的数据集合。当全量数据进行切换时，需要内存可以支撑保留两份全量数据，这也是在使用Hollow时需要注意的一点。

double cache

关于double cache的问题，实际上不仅仅是Hollow需要处理，任何需要缓存的应用都需要考虑这个问题。通常的做法都是内存中由一份旧的缓存数据，新的全量数据会加载到同旧缓存数据相同结构的变量中，当新数据Ready后，通过指针的切换已达到快速的更新数据的目的。当然对于大数据量的处理，其中的一种解决方案是将机器拉出再重新加载全量数据来实现。

接下来详细介绍Hollow整体的体系架构，侧重于Hollow包中每个类之间的层次关系。并简单介绍下具体每个类的职责，但是并不展开详细讨论。

体系架构

从Hollow的源码中可以看出来，Hollow主要将代码分成为core和API主要两个层次。

但是通过对Hollow的使用发现，将Hollow分为basic、core、API、UI四个层次能够更容易的理解Hollow的整体设计，因此将整体的架构分为如下图所示的四层，并将每个层次涉及的核心接口和类按照不同的功能模块进行了划分。

这里提一点 Hollow 的一个核心概念，不断变化的数据集的时间线可以分解为离散的数据状态，每个状态都是特定时间点数据的完整快照。 Hollow数据集的状态是并不是每个瞬时都会产生，而是按照一定的固定周期产生，那么在每个周期内离散的数据，都会体现在每个固定周期产生的数据快照中。

接下来我将分别介绍下每一层的结构。

基础层

HollowRecord

Hollow的数据集Blob都可以通过HollowRecord来进行读写。HollowRecord 是访问 Hollow 数据集中任何类型记录的数据的基本接口。

public interface HollowRecord {

    public int getOrdinal();

    public HollowSchema getSchema();

    public HollowTypeDataAccess getTypeDataAccess();

    public HollowRecordDelegate getDelegate();

}

HollowSchema

HollowSchema定义了Hollow数据的模型，使得Hollow可以不急依赖于具体的业务模型。

private final String name;

public HollowSchema(String name) {
    if (name == null || name.isEmpty()) {
        throw new IllegalArgumentException("Type name in Hollow Schema was " + (name == null ? "null" : "an empty string"));
    }
    this.name = name;
}

HollowAPI

HollowAPI 将 HollowDataAccess 进行了包装。是所有通过 GeneratedHollowAPI 生成的API类的父类。

public HollowAPI(HollowDataAccess dataAccess) {
    this.dataAccess = dataAccess;
    this.typeAPIs = new ArrayList<HollowTypeAPI>();
}

FixedLengthData

Hollow实现池化的重要接口类，定义了固定长度大小的byte数据。Hollow中的每条记录都以固定长度的位数开始。这些位保存在 FixedLengthData 数据结构中，这些数据结构可以由长数组或 ByteBuffers 支持。

FixedLengthData的注释中举了一个例子：

如果查询 EncodedLongBuffer 以获取以下示例位范围中从位 7 开始的 6 位值：

0001000100100001101000010100101001111010101010010010101

将返回二进制值 100100 或十进制值 36。

因此，有两种方法可以从给定位索引的位串中获取元素值。

对长度小于 59 位的值使用 getElementValue。
对长度不超过 64 位的值使用推荐的 getLargeElementValue。

VariableLengthData

顾名思义，可变长度的数据定义，在Hollow中可以被认为是一个单字节数组或未定义长度的缓冲区。当一个字节写入大于当前分配的数组/缓冲区的索引时，它将自动增长。

public interface VariableLengthData extends ByteData {

    /**
     * Load <i>length</i> bytes of data from the supplied {@code HollowBlobInput}
     *
     * @param in the {@code HollowBlobInput}
     * @param length the length of the data to load
     * @throws IOException if data could not be loaded
     */
    void loadFrom(HollowBlobInput in, long length) throws IOException;

    /**
     * Copy bytes from another {@code VariableLengthData} object.
     *
     * @param src the source {@code VariableLengthData}
     * @param srcPos position in source data to begin copying from
     * @param destPos position in destination data to begin copying to
     * @param length length of data to copy in bytes
     */
    void copy(ByteData src, long srcPos, long destPos, long length);

    /**
     * Copies data from the provided source into destination, guaranteeing that if the update is seen
     * by another thread, then all other writes prior to this call are also visible to that thread.
     *
     * @param src the source data
     * @param srcPos position in source data to begin copying from
     * @param destPos position in destination to begin copying to
     * @param length length of data to copy in bytes
     */
    void orderedCopy(VariableLengthData src, long srcPos, long destPos, long length);

    /**
     * Data size in bytes
     * @return size in bytes
     */
    long size();
}

ArraySegmentRecycler

Hollow实现池化的又一个重要接口类。ArraySegmentRecycler维护了一个在内存池上，并保存在堆上的数组。池中的每个数组都有固定的长度。当 Hollow 中需要长数组或字节数组时，它会将池化的数组段拼接在一起作为 SegmentedByteArray 或 SegmentedLongArray 使用。

public interface ArraySegmentRecycler {

    public int getLog2OfByteSegmentSize();

    public int getLog2OfLongSegmentSize();

    public long[] getLongArray();

    public void recycleLongArray(long[] arr);

    public byte[] getByteArray();

    public void recycleByteArray(byte[] arr);

    public void swap();

}

HollowTypeAPI

HollowTypeAPI 提供了访问 Hollow 记录中数据的方法，而无需创建包装对象作为句柄。相反，序数可以直接用作数据的句柄。这在紧密循环中很有用，在这种情况下，使用 Generated 或 GenericHollowObjectAPI 导致的过多对象创建会非常昂贵。

public abstract class HollowTypeAPI {

    protected final HollowAPI api;
    protected final HollowTypeDataAccess typeDataAccess;

    protected HollowTypeAPI(HollowAPI api, HollowTypeDataAccess typeDataAccess) {
        this.api = api;
        this.typeDataAccess = typeDataAccess;
    }

    public HollowAPI getAPI() {
        return api;
    }

    public HollowTypeDataAccess getTypeDataAccess() {
        return typeDataAccess;
    }

    public void setSamplingDirector(HollowSamplingDirector samplingDirector) {
        typeDataAccess.setSamplingDirector(samplingDirector);
    }

    public void setFieldSpecificSamplingDirector(HollowFilterConfig fieldSpec, HollowSamplingDirector director) {
        typeDataAccess.setFieldSpecificSamplingDirector(fieldSpec, director);
    }

    public void ignoreUpdateThreadForSampling(Thread t) {
        typeDataAccess.ignoreUpdateThreadForSampling(t);
    }

    public Collection<SampleResult> getAccessSampleResults() {
        return typeDataAccess.getSampler().getSampleResults();
    }

}

核心层

HollowWriteStateEngine

HollowWriteStateEngine是Producer的核心的功能承载类，提供了数据写入的核心功能。具体的继承关系如下图。HollowWriteStateEngine在两个阶段之间来回循环：

添加记录
写数据集状态

HollowReadStateEngine

HollowReadStateEngine使得Consumer可以正常读取数据核心功能类，提供了强大的数据读取功能。具体的依赖关系如下图所示。

HollowTypeWriteState

HollowTypeWriteState 包含并且是 HollowWriteStateEngine 中特定类型的所有记录的核心功能实现。

public HollowTypeWriteState(HollowSchema schema, int numShards) {
    this.schema = schema;
    this.ordinalMap = new ByteArrayOrdinalMap();
    this.serializedScratchSpace = new ThreadLocal<ByteDataArray>();
    this.currentCyclePopulated = new ThreadSafeBitSet();
    this.previousCyclePopulated = new ThreadSafeBitSet();
    this.numShards = numShards;

    if(numShards != -1 && ((numShards & (numShards - 1)) != 0 || numShards <= 0))
        throw new IllegalArgumentException("Number of shards must be a power of 2!  Check configuration for type " + schema.getName());
}

HollowTypeReadState

HollowTypeReadState 包含并且是 HollowReadStateEngine 中特定类型的所有记录的核心功能实现。

public HollowTypeReadState(HollowReadStateEngine stateEngine, MemoryMode memoryMode, HollowSchema schema) {
    this.stateEngine = stateEngine;
    this.memoryMode = memoryMode;
    this.schema = schema;
    this.stateListeners = EMPTY_LISTENERS;
}

HollowDataAccess

HollowDataAccess 是消费者对 Hollow 数据集的管理核心接口。其中最常见的 HollowDataAccess 类型是 HollowReadStateEngine。

Hollow 数据在内存中存储的访问层由 HollowDataAccess 实现。

public interface HollowDataAccess extends HollowDataset {

    /**
     * @param typeName the type name
     * @return The handle to data for a specific type in this dataset.
     */
    HollowTypeDataAccess getTypeDataAccess(String typeName);

    /**
     * @param typeName The type name
     * @param ordinal optional parameter.  When known, may provide a more optimal data access implementation for traversal of historical data access.
     * @return The handle to data for a specific type in this dataset.
     */
    HollowTypeDataAccess getTypeDataAccess(String typeName, int ordinal);

    /**
     * @return The names of all types in this dataset
     */
    Collection<String> getAllTypes();

    @Override
    List<HollowSchema> getSchemas();

    @Override
    HollowSchema getSchema(String name);

    @Deprecated
    HollowObjectHashCodeFinder getHashCodeFinder();

    MissingDataHandler getMissingDataHandler();

    void resetSampling();

    boolean hasSampleResults();

}

HollowHashIndex

Hollow的索引对于Hollow数据集的写入和搜索都至关重要。HollowHashIndex 用于索引非主键数据。这种类型的索引可以将多个键映射到单个匹配记录，和/或将多个记录映射到单个键。哈希键中的字段定义可以通过点符号进行分层（遍历多个记录类型）。

/**
     * Define a {@link HollowHashIndex}.
     *
     * @param stateEngine The state engine to index
     * @param type The query starts with the specified type
     * @param selectField The query will select records at this field (specify "" to select the specified type).
     * The selectField may span collection elements and/or map keys or values, which can result in multiple matches per record of the specified start type.
     * @param matchFields The query will match on the specified match fields.  The match fields may span collection elements and/or map keys or values.
     */
public HollowHashIndex(HollowReadStateEngine stateEngine, String type, String selectField, String... matchFields) {
    requireNonNull(type, "Hollow Hash Index creation failed because type was null");
    requireNonNull(stateEngine, "Hollow Hash Index creation on type [" + type
                   + "] failed because read state wasn't initialized");

    this.stateEngine = stateEngine;
    this.type = type;
    this.typeState = (HollowObjectTypeReadState) stateEngine.getTypeState(type);
    this.selectField = selectField;
    this.matchFields = matchFields;

    reindexHashIndex();
}

API层

HollowProducer

HollowProducer是Hollow为提升使用便利性封装的Producer的API，如果觉得HollowProducer无法满足自身项目需求，可以基于上文中的HollowTypeWriteStateEngine自己实现需要的Producer。

HollowProducer包含了以下接口：

Announcer
Publisher
Blob
ReadState
WriteState
Populator
VersionMinter

HollowConsumer

HollowConsumer是Hollow为提升使用便利性封装的Consumer的API，如果觉得HollowConsumer无法满足自身项目需求，可以基于上文中的HollowTypeReadStateEngine自己实现需要的Consumer。

HollowConsumer包含了以下接口：

AnnouncementWatcher
Blob
BlobRetriever
RefreshListener

Utils

HollowAPIGenerator

HollowAPIGenerator 用于生成定义 HollowAPI 实现的 java 代码。 java 代码基于数据模型生成，数据模型本身是由 HollowSchema 定义的，HollowAPIGenerator还会提供包含基于数据模型中特定字段的用于遍历数据集的便捷方法，包括索引、主键等。

protected HollowAPIGenerator(String apiClassname,
                             String packageName,
                             HollowDataset dataset,
                             Set<String> parameterizedTypes,
                             boolean parameterizeAllClassNames,
                             boolean useErgonomicShortcuts,
                             Path destinationPath) {
    this.apiClassname = apiClassname;
    this.packageName = packageName;
    this.dataset = dataset;
    this.hasCollectionsInDataSet = hasCollectionsInDataSet(dataset);
    this.parameterizedTypes = parameterizedTypes;
    this.parameterizeClassNames = parameterizeAllClassNames;
    this.ergonomicShortcuts = useErgonomicShortcuts ? new HollowErgonomicAPIShortcuts(dataset) : HollowErgonomicAPIShortcuts.NO_SHORTCUTS;

    if (destinationPath != null && packageName != null && !packageName.trim().isEmpty()) {
        Path packagePath = Paths.get(packageName.replace(".", File.separator));
        if (!destinationPath.toAbsolutePath().endsWith(packagePath)) {
            destinationPath = destinationPath.resolve(packagePath);
        }
    }
    this.destinationPath = destinationPath;
}

GenericHollowObject

GenericHollowObject 是基于 OBJECT 类型记录的通用类。通过 HollowAPI 可用于以编程方式检查数据集Blob，而无需自定义生成的 API，提供了方便的数据处理方式。

HollowSampler

HollowSampler是负责Hollow数据采样职责的接口类，包含以下5种具体实现类。

public interface HollowSampler {

    public void setSamplingDirector(HollowSamplingDirector director);

    public void setFieldSpecificSamplingDirector(HollowFilterConfig fieldSpec, HollowSamplingDirector director);

    public void setUpdateThread(Thread t);

    public boolean hasSampleResults();

    public Collection<SampleResult> getSampleResults();

    public void reset();

}

HollowTestRecord

HollowTestRecord提供了Hollow用于测试的一些模拟方法。

HollowPerformanceAPI

HollowPerformanceAPI提供了Hollow用于性能分析的接口方法。

UI层

HollowUIRouter

HollowUIRouter负责解析UI层的路由规则。

protected final String baseUrlPath;
protected final VelocityEngine velocityEngine;

public HollowUIRouter(String baseUrlPath) {
    if(!baseUrlPath.startsWith("/"))
        baseUrlPath = "/" + baseUrlPath;
    if(baseUrlPath.endsWith("/"))
        baseUrlPath = baseUrlPath.substring(0, baseUrlPath.length() - 1);

    this.baseUrlPath = baseUrlPath;
    this.velocityEngine = initVelocity();
}

HollowObjectView

HollowObjectView提供了Hollow数据展示层的视图类。

private final HollowDiffViewRow rootRow;
private final ExactRecordMatcher exactRecordMatcher;

private int totalVisibilityCount;

public HollowObjectView(HollowDiffViewRow rootRow, ExactRecordMatcher exactRecordMatcher) {
    this.rootRow = rootRow;
    this.exactRecordMatcher = exactRecordMatcher;
}

HollowDiffUI

HollowDiffUI提供Hollow数据的对比UI功能。

HollowHistoryUI

HollowHistoryUI提供Hollow历史数据查询和对比功能。

HollowExplorerUI

HollowExplorerUI提供了Hollow数据的展示功能。

HollowJsonAdapter

HollowJsonAdapter 提供了可以以 JSON 编码的数据填充 HollowWriteStateEngine 的适配功能。从最新的源码注释可以看出，针对HollowJsonAdapter，Hollow还有一些TODO的工作可以完成，这样可以是的HollowJsonAdapter更加的友好。

/// TODO: Would be nice to be able to take a HollowDataset here, if only producing FlatRecords,
///       instead of requiring a HollowWriteStateEngine
public HollowJsonAdapter(HollowWriteStateEngine stateEngine, String typeName) {
    super(typeName, "populate");
    this.stateEngine = stateEngine;
    this.hollowSchemas = new HashMap<String, HollowSchema>();
    this.canonicalObjectFieldMappings = new HashMap<String, ObjectFieldMapping>();
    this.passthroughDecoratedTypes = new HashSet<String>();

    for(HollowSchema schema : stateEngine.getSchemas()) {
        hollowSchemas.put(schema.getName(), schema);
        if(schema instanceof HollowObjectSchema)
            canonicalObjectFieldMappings.put(schema.getName(), new ObjectFieldMapping(schema.getName(), this));
    }

    // TODO: Special 'passthrough' processing.
    this.passthroughRecords = new ThreadLocal<PassthroughWriteRecords>();
}

总结

Hollow不仅仅提供了丰富的API类，同时也通过优秀的接口设计，使得每个人都可以通过实现具体的接口来丰富Hollow的功能。

当我们需要基于Hollow进行二次开发或者对现有的Hollow功能进行扩展和丰富时，可以依赖于基础层实现自己的核心层，也可以基于核心层实现对更加适用于自身系统的API层，当然也可以基于API层实现更加现代化UI。

实际上，这也对我们自己在实际的编码产生了一定的启发，即任何系统都应该有明确的层次划分，当然这个层次划分可能只有两层，也可能有5-6层，甚至更多。上层依赖于下层，下层是上层的基础。OSI七层协网络议模型就是一个很好的例子：应用层（Application）、表示层（Presentation）、会话层（Session）、传输层（Transport）、网络层（Network）、数据链路层（Data Link）、物理层（Physical）。