美文网首页
数据类型练习题

数据类型练习题

作者: 从此不迷茫 | 来源:发表于2020-02-18 16:53 被阅读0次

    19.

    21.证明下式给出的集合差度满足度量公理。

    size(A,B)=size(A-B)+size(B-A)

    其中,A和B是集合,A-B是集合差。

    22.讨论如何将相关值从区间[-1,1]映射到区间[0,1]。注意,你所使用的变换类型可能取决于你的应用。因此,考虑两种应用:对时间序列聚类,给定一个时间序列预测另一个的性质。

    对于时间序列聚类,相对强的正相关时间序列放到一起。给定一个时间序列预测另一个时间序列,必须考虑强正或负相关。这种转换即sim=|corr|是合适的,这种进考虑预测大小而不是方向。

    23.给定一个在区间[0,1]取值的相似性度量,描述两种将该相似度变换成区间[0,∞]中的相异度方法。

    d=(1-s)/s

    d=-log s

    24.通常,邻近度定义在一对对象之间。

    (a)阐述两种定义一组对象之间邻近度的方法。

    Two examples are the following: (i) based on pairwise proximity, i.e., minimum pairwise similarity or maximum pairwise dissimilarity, or (ii) for points in Euclidean space compute a centroid (the mean of all the points—see Section 8.2) and then compute the sum or average of the distances of the points to the centroid.基于成对邻近,即最小成对相似性或最大成对相似性。对于欧几里德空间中的点,计算一个质心,然后计算点到质心的距离之和或平均值。

    (b).如何定义欧几里得空间中两个点集之间的距离?

    一种方法是计算两组点的质心之间的距离。

    (c).如何定义两个数据对象集之间的邻近度?(除邻近度定义在任意一对对象之间外,对数据对象不做任何假定。)

    One approach is to compute the average pairwise proximity of objects in one group of objects with those objects in the other group. Other approaches are to take the minimum or maximum proximity.一种方法是计算一组对象与另一组对象的平均成对接近度。其他方法是采取最小或最大接近。

    注意,集群的凝聚力与它们之间的一组对象的接近的概念有关,并且群集的分离与两组对象接近的概念有关。此外,两个聚类的邻近性是凝聚层次聚类中的一个重要概念。

    25.

    不幸的是,提示中有一个错误和缺乏清晰度。提示应措辞如下:提示:如果z是S的任意点,则三角形不等式d(x, y) ≤ d(x, z)+d(y, z), 应该写作 d(y, z) ≥ d(x, y)−d(x, z).

    另一个三角不等式的应用为:d(x, z) ≤ d(x, y)+d(y, z),即d(y, z) ≥ d(x, z)−d(x, y).如果任何一个不等式中得到的d(y,z)的下界比δ大,d(y,z)不需要计算。如果不等式d(y, z) ≤ d(y, x)+d(x, z)中计算的上边界d(y,z)不大于≯δ,则d(x,z)不需要计算。

    (b).If x = y then no calculations are necessary. As x becomes farther away, typically more distance calculations are needed.

    (c).Let x and y be the two points and let x∗ and y∗ be the points in S that are closest to the two points, respectively. If d(x∗, y∗)+2 ≤ β, then we can safely conclude d(x, y) ≤ β. Likewise, if d(x∗, y∗)−2 ≥ β, then we can safely conclude d(x, y) ≥ β. These formulas are derived by considering the cases where x and y are as far from x∗ and y∗ as possible and as far or close to each other as possible. 这些公式是通过考虑x和y尽可能远离x*和y*以及尽可能彼此远离或接近的情况而得出的。

    26.证明1减Jaccard相似度是两个数据对象x和y之间的一种距离度量,该度量满足d(x,y)=1-J(x,y)。

    1(a). Because J(x, y) ≤ 1, d(x, y) ≥ 0.

    1(b). Because J(x, x)=1, d(x, x)=0

    2. Because J(x, y) = J(y, x), d(x, y) = d(y, x)

    3. (Proof due to Jeffrey Ullman)

    minhash(x) is the index of first nonzero entry of x prob(minhash(x) = k) is the probability tha minhash(x) = k when x is randomly permuted. 是当x被随机排列时,minhash(x)=k的概率。

    Note that prob(minhash(x) = minhash(y)) = J(x, y) (minhash lemma)

    Therefore, d(x, y)=1−prob(minhash(x) = minhash(y)) = prob(minhash(x) = minhash(y)) We have to show that, prob(minhash(x) = minhash(z)) ≤ prob(minhash(x) = minhash(y)) +prob(minhash(y) = minhash(z)

    但是,请注意,无论何时minhash(x)=minhash(z),那么 minhash(x)=minhash(y)和minhash(y)=minhash(z)中的至少一个必须为true。

    27.证明定义为两个数据向量x和y之间夹角的距离度量满足度量公理d(x,y)=arccos(cos(x,y))。

    Note that angles are in the range 0 to 180.

    1(a). Because 0 ≤ cos(x, y) ≤ 1, d(x, y) ≥ 0.

    1(b). Because cos(x, x)=1, d(x, x) = arccos(1) = 0

    2. Because cos(x, y) = cos(y, x), d(x, y) = d(y, x)

    3. If the three vectors lie in a plane then it is obvious that the angle between x and z must be less than or equal to the sum of the angles between x and y and y and z.如果三个矢量位于一个平面上,那么很明显x和z之间的角度必须小于或等于x和y以及y和z之间的角度之和。 If y is the projection of y into the plane defined by x and z, then note that the angles between x and y and y and z are greater than those between x and y and y and z.如果y是y在x和z定义的平面上的投影,那么请注意x和y以及y和z之间的角度大于x和y以及y和z之间的角度。

    28.解释为什么计算两个属性之间的邻近度比计算两个对象之间的相似度简单。

    In general, an object can be a record whose fields (attributes) are of different types. To compute the overall similarity of two objects in this case, we need to decide how to compute the similarity for each attribute and then combine these similarities. This can be done straightforwardly by using Equations 2.15 or 2.16, but is still somewhat ad hoc, at least compared to proximity measures such as the Euclidean distance or correlation, which are mathematically well founded. In contrast, the values of an attribute are all of the same type, and thus, if another attribute is of the same type, then the computation of similarity is conceptually and computationally straightforward.

    通常,对象可以是字段(属性)属于不同类型的记录。在这种情况下,要计算两个对象的总体相似性,我们需要决定如何计算每个属性的相似性,然后将这些相似性组合起来。这可以通过使用方程2.15或2.16直接地完成,但仍然有点特别,至少与接近的度量,例如欧几里得距离或相关性相比,这些欧几里得距离或相关性在数学上是有充分根据的。相反,一个属性的值都是同一类型的,因此,如果另一个属性是同一类型的,那么相似度的计算在概念上和计算上都是直接的。

    相关文章

      网友评论

          本文标题:数据类型练习题

          本文链接:https://www.haomeiwen.com/subject/juagfhtx.html