第九章分类数据

作者: 陈易男 | 来源:发表于2021-01-07 15:09 被阅读0次

第九章分类数据
统计学第1、2章
第一学段（1~2年级）
数据分析第一篇神马是数据分析
数据分析(三)：《数据挖掘R语言实战》
缓存过滤器
SQL注入
数据分类
python基础-10-数据分析python——pandas——
傻瓜统计学

整章知识架构

一、cat对象

cat对象

cat对象属性

类别本身，通过Index类型存储
是否有序，通过的cat属性访问

s.cat.categories
s.cat.ordered
s.cat.codes #编号

类别的增删改

s = s.cat.add_categories('Graduate') # 增加一个毕业生类别
s = s.cat.remove_categories('Freshman')
s = s.cat.set_categories(['Sophomore','PhD']) # 新类别为大二学生
和博士
s = s.cat.remove_unused_categories() # 移除了未出现的博士生类别
s = s.cat.rename_categories({'Sophomore':'本科二年级学生'})

二、有序分类

有序分类

序的建立

通过s.cat.as_ordered()可以将类别转化为有序，有序类别和无序类别可以通过as_unordered和reorder_categories互相转化

s = df.Grade.astype('category')
s = s.cat.reorder_categories(['Freshman', 'Sophomore', 'Junior', 'Senior'],ordered=True)
s.head()
s.cat.as_unordered().head()

排序和比较

有序的类别可以使用sort_index和sort_values进行排序
也可以是使用比较运算符进行比较，主义在使用大小比较时比较的对象必须存在category中，不然无法比较

三、区间类别

区间类别

cut和qcut

cut：可以指定分割区间的数量或者通过list指定端点值

pd.cut(s, bins=2, right=False)
pd.cut(s, bins=[-np.infty, 1.2, 1.8, 2.2, np.infty])

qcut：可以指定分位等分的数量或者通过list指定端点值（分位数）

pd.qcut(s, q=3)
pd.qcut(s, q=[0,0.2,0.8,1])

二者皆可使用labels指定区间名称

区间的构造

通过pd.Interval构造, 指定左右端点和闭合开闭状态

my_interval = pd.Interval(0, 1, 'right')

通过pd.IntervalIndex构造
- 从cut或者qcut的结果转换
- from_breaks
- from_arrays
- from_tuples
- interval_range

id_interval = pd.IntervalIndex(pd.cut(s, 3))

pd.IntervalIndex.from_breaks([1,3,6,10], closed='both')

pd.IntervalIndex.from_arrays(left = [1,3,6,10], right = [5,4,9,11], closed = 'neither')

pd.IntervalIndex.from_tuples([(1,5),(3,4),(6,9),(10,11)], closed='neither')

pd.interval_range(start=1,end=5,periods=8)

【练一练】

无论是interval_range还是下一章时间序列中的date_range都是给定了等差序列中四要素中的三个，从而确定整个序列。请回顾等差数列中的首项、末项、项数和公差的联系，写出interval_range中四个参数之间的恒等关系。

(end - start) / freq == periodes

区间的属性与方法

overlaps：判断是否有交集

id_demo.overlaps(pd.Interval(40,60))

contains：判断区间是否含有某个元素

id_demo.contains(4)

属性：
- left
- right
- mid
- length

id_demo.left
id_demo.right
id_demo.mid
id_demo.length

四、练习

Ex1：统计未出现的类别

在第五章中介绍了crosstab函数，在默认参数下它能够对两个列的组合出现的频数进行统计汇总：

df = pd.DataFrame({'A':['a','b','c','a'], 'B':['cat','cat','dog','cat']})
pd.crosstab(df.A, df.B)

但事实上有些列存储的是分类变量，列中并不一定包含所有的类别，此时如果想要对这些未出现的类别在crosstab结果中也进行汇总，则可以指定dropna参数为False：
请实现一个带有dropna参数的my_crosstab函数来完成上面的功能。

构造s1与s2的dataframe，将s1.name作为index，索引出相关的行，然后使用==计算与s2.name相等的元素的个数

def my_crosstab(s1, s2, dropna=True):
    columns = s2.cat.categories[s2.cat.categories.isin(s2)]
    table = pd.concat([s1,s2], axis=1).set_index(s1.name)
    if dropna:
        _columns = columns
    else:
        _columns = s2.cat.categories
    ret = pd.DataFrame(index=index, columns=_columns, data=np.zeros((len(index), len(_columns))))
    res = res.rename_axis(index=s1.name, columns=s2.name).astype('int')
    for idx in index:
        content = table.loc[idx]
        for c in columns:
            ret.loc[idx, c] = (content == c).values.sum()
    return ret
my_crosstab(s1, s2, dropna=False)

result

Ex2：钻石数据集

现有一份关于钻石的数据集，其中carat, cut, clarity, price分别表示克拉重量、切割质量、纯净度和价格，样例如下：

df = pd.read_csv('../data/diamonds.csv') 
df.head(3)

分别对df.cut在object类型和category类型下使用nunique函数，并比较它们的性能。
钻石的切割质量可以分为五个等级，由次到好分别是Fair, Good, Very Good, Premium, Ideal，纯净度有八个等级，由次到好分别是I1, SI2, SI1, VS2, VS1, VVS2, VVS1, IF，请对切割质量按照由好到次的顺序排序，相同切割质量的钻石，按照纯净度进行由次到好的排序。
分别采用两种不同的方法，把cut, clarity这两列按照由好到次的顺序，映射到从0到n-1的整数，其中n表示类别的个数。
对每克拉的价格按照分别按照分位数（q=[0.2, 0.4, 0.6, 0.8]）与[1000, 3500, 5500, 18000]割点进行分箱得到五个类别Very Low, Low, Mid, High, Very High，并把按这两种分箱方法得到的category序列依次添加到原表中。
第4问中按照整数分箱得到的序列中，是否出现了所有的类别？如果存在没有出现的类别请把该类别删除。
对第4问中按照分位数分箱得到的序列，求每个样本对应所在区间的左右端点值和长度。

根据结果可知， category类型速度略快

# 性能测量
%timeit -n 100 df.cut.nunique()
cat = df.cut.astype('category')
%timeit -n 100 cat.nunique()

performance

转换为category类型后使用reorder_categories转换成有序类型排序即可

df.cut = df.cut.astype('category').cat.reorder_categories(['Fair', 'Good', 'Very Good', 'Premium', 'Ideal'])
df.clarity = df.clarity.astype('category').cat.reorder_categories(['I1', 'SI2', 'SI1', 'VS2', 'VS1', 'VVS2', 'VVS1', 'IF'])
df.sort_values(['cut', 'clarity'], ascending=[False, True]).head()

result

利用cat.code或者replace

df.cut = df.cut.cat.reorder_categories(df.cut.cat.categories[::-1])
df.clarity = df.clarity.cat.reorder_categories(df.clarity.cat.categories[::-1])
df.cut = df.cut.cat.codes # 方法一：利用cat.codes
clarity_cat = df.clarity.cat.categories
df.clarity = df.clarity.replace(dict(zip(clarity_cat, np.arange(len(clarity_cat))))) # 方法二：使用replace映射

使用qcut和cut，注意的是对区间进行补全，使其正确地分为5个区间

pricePerCarat = df.price / df.carat
type1 = pd.qcut(pricePerCarat, q=[0, 0.2, 0.4, 0.6, 0.8, 1], labels=['Very Low', 'Low', 'Mid', 'High', 'Very High'])
type2 = pd.cut(pricePerCarat, bins=[0, 1000, 3500, 5500, 18000, np.inf], labels=['Very Low', 'Low', 'Mid', 'High', 'Very High'])
type1.name = 'type1'
type2.name = 'type2'
df = pd.concat([df, type1, type2], axis=1)
df.head()

result

通过唯一值的数目判断所有的种类是否都出现了，可以知道使用cut划分区间的种类中少了Very Low和Very High，使用remove_categories移除不存在的类别即可

print(df.type1.cat.categories.nunique() == df.type1.nunique())
print(df.type2.cat.categories.nunique() == df.type2.nunique())
cond = df.type2.cat.categories.isin(df.type2)
df.type2.cat.remove_categories(df.type2.cat.categories[~cond])

result

使用pd.IntervalIndex将分区结果转换成区间之后调用相关属性即可

interval = pd.IntervalIndex(pd.qcut(pricePerCarat, q=[0, 0.2, 0.4, 0.6, 0.8, 1]))
interval.right #右端点
interval.left #左端点
interval.length #区间长度

result right

result left

result length

网友评论

本文标题：第九章分类数据

本文链接：https://www.haomeiwen.com/subject/yjszoktx.html

延伸阅读

深度阅读

您也可以注册成为美文阅读网的作者，发表您的原创作品、分享您的心情！

第九章分类数据

一、cat对象

二、有序分类

三、区间类别

【练一练】

四、练习

Ex1：统计未出现的类别

Ex2：钻石数据集

相关文章

第九章分类数据

统计学第1、2章

第一学段（1~2年级）

数据分析第一篇神马是数据分析

数据分析(三)：《数据挖掘R语言实战》

缓存过滤器

SQL注入

数据分类

python基础-10-数据分析python——pandas——

傻瓜统计学

网友评论

延伸阅读

深度阅读

栏目导航

热点阅读

第九章 分类数据

一、cat对象

二、有序分类

三、区间类别

【练一练】

四、练习

Ex1：统计未出现的类别

Ex2：钻石数据集

相关文章

网友评论

延伸阅读

深度阅读

栏目导航

热点阅读

第九章分类数据