树-多路搜索树和B树, since 2022-06-12

作者: Mc杰夫 | 来源:发表于2022-06-12 05:31 被阅读0次

树-多路搜索树和B树, since 2022-06-12
有点乱
多路搜索树B树
数据结构经典算法复习
B+树和B树的区别
B+树与B树
B-Tree/多路搜索树
B树与B+树
【恋上数据结构与算法一】(十)B树
B-/B+树看 MySQL索引结构

(2022.06.12 Sun)
B-tree即B树，是多路搜索树的一种，常用于数据库的实现。

多路搜索树 Multiway Search tree

定义 $w$ 是有序树的一个节点，如果 $w$ 有d个子节点，则称其为d-node。多路搜索树 $T$ 是一个有序树，满足如下性质：

每个内部节点包含至少两个节点，即每个内部节点都是d-node，其中 $d\ge 2$
每个树 $T$ 的内部d-node $w$ 有子树 $c_1, \cdots, c_d$ ， $w$ 保存了d-1个k-v对有序集合 $(k_1, v_1), \cdots, (k_{d-1}, v_{d-1})$ ，其中 $k_1 \le k_2 \cdots \le k_{d-1}$
定义 $k_0 = - \infty, k_d = + \infty$ ，对 $w$ 的以 $c_i$ 子树节点中的 $(k, v)$ ， $i=1,\cdots, d$ ，有 $k_{i-1} \le k \le k_i$

一个d-node保存了 $d-1$ 个常规key。外部节点external node仅仅作为占位符，不保存任何数据，用None引用表示。根据上面定义，多路搜索树的k-v对的个数与外部节点数有如下关系：

一个有n个元素的多路搜索树有n+1个外部节点。

在多路搜索树中搜索

从树 $T$ 的根节点开始，在d-node $w$ ，比较搜索的key $k$ 和 $w$ 上的key $k_1, \cdots, k_{d-1}$ 。如果 $k=k_i$ ，则搜索完成。否则，我们在 $w$ 的子节点 $c_i$ 上搜索 $k_{i-1} < k < k_i$ 。如果到达外部节点，我们直到树 $T$ 中没有key $k$ ，搜索失败。

Multiway search insertion

多路搜索树的实现

对于非二叉树的一般树实现，节点的子节点保存为一个容器。
We discussed a linked data structure for representing a general tree.
This representation can also be used for a multiway search tree. When using a general tree to implement a multiway search tree, we must store at each node one or more key-value pairs associated with that node. That is, we need to store with $w$
a reference to some collection that stores the items for $w$ .
During a search for key $k$ in a multiway search tree, the primary operation needed when navigating a node is finding the smallest key at that node that is greater than or equal to $k$ . For this reason, it is natural to model the information at a node itself as a sorted map, allowing use of the find_ge(k) method. We say such
a map serves as a secondary data structure to support the primary data structure represented by the entire multiway search tree. This reasoning may at first seem like a circular argument, since we need a representation of a (secondary) ordered
map to represent a (primary) ordered map. We can avoid any circular dependence, however, by using the bootstrapping technique, where we use a simple solution to a problem to create a new, more advanced solution. In the context of a multiway search tree, a natural choice for the secondary structure at each node is the SortedTableMap. Because we want to determine the associated value in case of a match for key $k$ , and otherwise the
corresponding child ci such that $k_{i−1} < k < k_i$ , we recommend having each key $k_i$ in the secondary structure map to the pair $(v_i, c_i)$ . With such a realization of a
multiway search tree $T$ , processing a d-node $w$ while searching for an item of $T$ with key $k$ can be performed using a binary search operation in $O(\log d)$ time. Let
$d_{max}$ denote the maximum number of children of any node of $T$ , and let $h$ denote the height of $T$ . The search time in a multiway search tree is therefore $O(h \space \log d_{max})$ .
If $d_{max}$ is a constant, the running time for performing a search is $O(h)$ .
The primary efficiency goal for a multiway search tree is to keep the height as small as possible. We next discuss a strategy that caps $d_{max}$ at 4 while guaranteeing a height $h$ that is logarithmic in $n$ , the total number of items stored in the map.

(2, 4)-tree (2,4)树

(2,4)树是多路搜索树的一种，也叫2-3-4树或2-4树，满足如下两个属性：

尺寸属性：每个内部节点最多有4个子节点，最少2个，这也是该数据结构名字的来源
深度属性：所有外部节点有相同的高度

假定外部节点都是None。(2, 4)树保证了每个节点的二级数据结构足够小，同时保证了多路主树的平衡。

(2, 4)树的高度有如下定理：

Proposition：存储了n个元素的(2, 4)树的高度是 $O(\log n)$

证明该定理可以考虑两种阶段情况，即每个内部节点都只有2个子节点，和有4个子节点。当每个内部节点有2个子节点，则形成了二叉树，对于外部节点的个数与高度满足 $2^{h} = n+1$ ，或 $h=\log(n+1)$ ；每个内部节点有4个节点，则外部节点的高度与节点个数满足 $4^{h} = n+1$ ，或 $h=\frac{1}{2} \log (n+1)$ 。这两种极端情况使得真实的高度 $h$ 所在的范围是 $\frac{1}{2} \log(n+1) \le h \le\log (n+1)$ ，得证。

该定理表明尺寸和深度属性对于保持多路搜索树平衡来说已经足够。此外，这个定理也显示出对(2, 4)树做检索的时间复杂度为 $O(\log n)$ ，且二级存储结构的实现对于设计来说影响不大因为最大子节点数 $d_{max}$ 为常数。

插入和删除

执行对(2, 4)树 $T$ 的插入 $(k, v)$ 之前，首先执行检索。如果树中没有key $k$ ，则搜索终止于外部节点 $z$ 。设外部节点 $z$ 的父/母节点是 $w$ 。在 $w$ 中插入新元素，并在 $z$ 的左侧加入新的子节点 $y$ 。这种插入方法并没有破坏深度属性，因为我们在已经存在的外部节点中加入了新的外部节点，但却破坏了尺寸属性。节点 $w$ 之前是4-node，插入后成为5-node，使得 $T$ 不再是(2, 4)树。这种与尺寸属性的冲突称作在点 $w$ 的overflow。

设 $w$ 的子节点为 $c_1,\cdots,c_5$ ， $k_1,\cdots,k_4$ 是保存在 $w$ 的keys。为避免overflow的影响，在 $w$ 上执行拆分(split)操作如下：

用和代替，使得
- $w'$ 是3-node，子节点 $c_1, c_2, c_3$ ，存储keys $k_1$ 和 $k_2$
- $w''$ 是2-node，子节点 $c_4, c_5$ ，存储keys $k_4$
如果 $w$ 是 $T$ 的根节点，创建一个新的根节点 $u$ ，否则将 $u$ 设为 $w$ 的父/母节点
将key $k_3$ 插入 $u$ ，使其成为 $w'$ 和 $w''$ 的父/母节点，所以如果 $w$ 是 $u$ 的第 $i$ 个子节点，则 $w'$ 和 $w''$ 相应的成为 $u$ 的 $i$ 和 $i+1$ 个子节点。

对节点 $w$ 的拆分导致新的overflow在 $w$ 的父/母节点 $u$ ，可以在 $u$ 继续执行拆分。拆分操作或者消除overflow效果，或者将overflow传播到当前节点的父/母节点。

性能分析
$d_{max}$ 最多为4，初始对新key $k$ 的所搜在每一层的时间是 $O(1)$ ，因此总的时间是 $O(\log n)$ ，考虑到树的高度是 $O(\log n)$ 。对于单个节点插入新key的操作运行时间 $O(1)$ 。执行多次split操作的次数被树的高度限制，所以插入阶段的运行时间是 $O(\log n)$ 。(2, 4)树执行插入操作的总时间的是 $O(\log n)$ 。

split demo

insert demo case

insert from empty demo

删除：placeholder

(2, 4)树的性能分析

(2, 4)树的渐进性能(asymptotic performance)保证了多数情况下的log边界。

有 $n$ 个节点的(2, 4)树的高度是 $O(\log n)$
拆分、转换和融合操作的时间是 $O(1)$
检索、插入、删除需要访问 $O(\log n)$ 个节点

B-tree

B-tree/B树是一种(a, b)树， $d$ 阶B树满足 $a=\lceil d/2 \rceil$ 和 $b=d$ 。

Reference

1 Goodrich and etc., Data Structures and Algorithms in Python

树-多路搜索树和B树, since 2022-06-12
(2022.06.12 Sun)B-tree即B树，是多路搜索树的一种，常用于数据库的实现。多路搜索树 Mult...
有点乱
B树，多路搜索树。
多路搜索树B树
二叉搜索树可以看作是2路的搜索树，也就是每个节点只有2个子节点，如果将这种情况推广，也就是说每个节点可以有多个（≧...
数据结构经典算法复习
二叉搜索树, 平衡二叉树(AVL) 红黑树 B树(平衡多路搜索树) B+树(在B树上改造) 二叉搜索树...
B+树和B树的区别
B-树 B-树概述 B-树,这里的 B 表示 balance( 平衡的意思),B-树是一种多路自平衡的搜索树（B树...
B+树与B树
简介 B树主要来自二叉平衡树的扩展，即m叉平衡树，主要源于多路搜索 B+树主要来源于分块查找的扩展，既可以多路搜索...
B-Tree/多路搜索树
B-Tree/多路搜索树 B-Tree 又叫 B树/B-树，- 是连接符号，不是减号，不能读做 B减树，常用做文件...
B树与B+树
B树数据结构 B树示意图 B＋树的性质B+树是B树的变体，也是一种多路搜索树，其定义基本与B树同，除了：１.非叶子...
【恋上数据结构与算法一】(十)B树
B树(B-tree、B-树) ◼ B树是一种平衡的多路搜索树，多用于文件系统、数据库的实现 ◼ 仔细观察B树，有什...
B-/B+树看 MySQL索引结构
B-树 B-树,这里的 B 表示 balance( 平衡的意思),B-树是一种多路自平衡的搜索树。它类似普通的平衡...