机器学习入门笔记二 pandas基本操作

作者: 一只当归 | 来源:发表于2019-03-17 23:26 被阅读12次

机器学习入门笔记二 pandas基本操作
机器学习入门笔记二 pandas高级操作
大师兄的Python机器学习笔记:Pandas库
大师兄的Python机器学习笔记:实现评估模型
[机器学习入门] 李弘毅机器学习笔记-7 （Brief Intr
[机器学习入门] 李宏毅机器学习笔记-15 （Unsupervi
[机器学习入门] 李宏毅机器学习笔记-24（introducti
[机器学习入门] 李弘毅机器学习笔记-16 （Unsupervi
[机器学习入门] 李弘毅机器学习笔记-17（Unsupervis
数据分析学习笔记（6）—— 泰坦尼克号生存预测

pandas 是基于NumPy 的一种工具，pandas就是字典型的numpy，就是numpy像是一个列表，pandas就更像是一个字典。利用pandas可以高效的操作大型数据集，因为其中包含了大量快速便捷的处理数据的函数和方法。

读入数据

下面是常见的支持的可读入数据

image.png

常用的方法为前两个，所以以下例子就举前两个。

#注意read_table需要指定分隔符,用参数 sep 指定 
#read_table可读取txt文件
iris_text = pd.read_table('https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data',
                         sep = ',')
#这里我是使用网页链接读取的uci数据集上的鸢尾花数据集，有兴趣可以上uci数据集官网去看看
iris_csv =  pd.read_csv('https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data')
print(iris_text.head())
print('--------------------------------------')
print(iris_csv.head())

   5.1  3.5  1.4  0.2  Iris-setosa
0  4.9  3.0  1.4  0.2  Iris-setosa
1  4.7  3.2  1.3  0.2  Iris-setosa
2  4.6  3.1  1.5  0.2  Iris-setosa
3  5.0  3.6  1.4  0.2  Iris-setosa
4  5.4  3.9  1.7  0.4  Iris-setosa
--------------------------------------
   5.1  3.5  1.4  0.2  Iris-setosa
0  4.9  3.0  1.4  0.2  Iris-setosa
1  4.7  3.2  1.3  0.2  Iris-setosa
2  4.6  3.1  1.5  0.2  Iris-setosa
3  5.0  3.6  1.4  0.2  Iris-setosa

数据结构

Pandas有两个主要的数据结构：Series和DataFrame。 Series类似Numpy中的一维数组，DataFrame则是使用较多的多维表格数据结构，这里主要介绍的是dataframe。
Series
一维数组，与Numpy中的一维array类似。二者与Python基本的数据结构List也很相近，其区别是:List中的元素可以是不同的数据类型，而Array和Series中则只允许存储相同的数据类型，这样可以更有效的使用内存，提高运算效率。 Pandas的数据类型实际上就是一个数据对应一个索引（行标），还可以有列标。对于一维的series是没有columns的定义的。

#生成一个series
test_series1 = pd.Series([1,2,3])
#指定索引
test_series2  = pd.Series([1,2,3],index = ['a','b','c'])
#这里用enumerate函数做比较
test3 = list(enumerate([1,2,3])) #enumerate函数返回一个索引序列，同时列出数据和数据下标
print(test_series1)
print('-----------------------------')
print(test_series2 )
print('-----------------------------')
print(test3)

0    1
1    2
2    3
dtype: int64
-----------------------------
a    1
b    2
c    3
dtype: int64
-----------------------------
[(0, 1), (1, 2), (2, 3)]

DataFrame

二维的表格型数据结构,可以将DataFrame理解为Series的容器。 DataFrame可以跟Numpy一样根据索引取出其中的数据，只是DataFrame索引方式更加多样化。DataFrame不仅可以根据默认的行列编号来索引，还可以根据标签序列来索引。行标签index，列标签columns。

上面鸢尾花数据集读进来其实就是一个DataFrame。

#创建一个时间索引，periods指定长度
dates = pd.date_range('20180516',periods=6)
#以时间索引创建一个二维dataframe
df = pd.DataFrame(np.random.randint(6,size=(6,4)),index=dates,
                columns=['a','b','c','d'])
print(df)

            a  b  c  d
2018-05-16  2  3  2  2
2018-05-17  2  0  0  2
2018-05-18  1  1  1  0
2018-05-19  1  5  5  4
2018-05-20  1  4  2  0
2018-05-21  5  4  3  1

还可以用字典的方式来创建dataframe，因为dataframe就像是一个字典。

df = pd.DataFrame({'a':1,'b':'hello python','c':np.arange(2),
                    'd':['o','k'],'e':['你','好']})
print(df)

   a             b  c  d  e
0  1  hello python  0  o  你
1  1  hello python  1  k  好

Dataframe的属性多样，以下为几种常用的：

dtype：查看数据类型。
index：查看行序列或者索引。
columns：查看各列的标签。
values：查看数据框内的数据，也即不含表头索引的数据。
info：返回当前的信息,有无nan值和内存占用，数据类型等。
describe() ：查看数据的一些信息，如每一列的极值，均值，中位数之类的，只能对数值型数据统计信息。
transpose() ：转置，也可用Ｔ来操作。
sort_index() ：排序，可按行或列index排序输出。
sort_values() ：按数据值来排序。
cov():得到协方差矩阵。
corr(): 得到相关性矩阵。
value_counts(): 可统计不同值个数。

以下还是用鸢尾花数据集解释
查看当前信息

# 查看当前信息
print(iris_csv.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 149 entries, 0 to 148
Data columns (total 5 columns):
5.1            149 non-null float64
3.5            149 non-null float64
1.4            149 non-null float64
0.2            149 non-null float64
Iris-setosa    149 non-null object
dtypes: float64(4), object(1)
memory usage: 5.9+ KB
None

查看统计信息

# 查看统计信息
print(iris_csv.describe())
#当然也可以直接用min(),mean()这些去得到返回值，用法和numpy相同

              5.1         3.5         1.4         0.2
count  149.000000  149.000000  149.000000  149.000000
mean     5.848322    3.051007    3.774497    1.205369
std      0.828594    0.433499    1.759651    0.761292
min      4.300000    2.000000    1.000000    0.100000
25%      5.100000    2.800000    1.600000    0.300000
50%      5.800000    3.000000    4.400000    1.300000
75%      6.400000    3.300000    5.100000    1.800000
max      7.900000    4.400000    6.900000    2.500000

排序

#排序，分为按值排序和按索引排序
#按索引排序,默认是从小到大，可以指定ascending=false从大到小排序
"""按行标签从大到小排序"""
print(iris_csv.sort_index(axis=0,ascending=False)[:5])

print('----------------------------------------')

#按值排序，默认是从小到大，可以指定ascending=false从小到大排序
'''按1.4列值从大到小排序'''
print(iris_csv.sort_values(by="1.4")[:5])

     5.1  3.5  1.4  0.2     Iris-setosa
148  5.9  3.0  5.1  1.8  Iris-virginica
147  6.2  3.4  5.4  2.3  Iris-virginica
146  6.5  3.0  5.2  2.0  Iris-virginica
145  6.3  2.5  5.0  1.9  Iris-virginica
144  6.7  3.0  5.2  2.3  Iris-virginica
----------------------------------------
    5.1  3.5  1.4  0.2  Iris-setosa
21  4.6  3.6  1.0  0.2  Iris-setosa
12  4.3  3.0  1.1  0.1  Iris-setosa
34  5.0  3.2  1.2  0.2  Iris-setosa
13  5.8  4.0  1.2  0.2  Iris-setosa
37  4.4  3.0  1.3  0.2  Iris-setosa

协方差矩阵

print(iris_csv.cov())  #协方差矩阵和相关性矩阵在分析中可是很重要的



          5.1       3.5       1.4       0.2
5.1  0.686568 -0.037279  1.270362  0.515347
3.5 -0.037279  0.187921 -0.316731 -0.115749
1.4  1.270362 -0.316731  3.096372  1.289124
0.2  0.515347 -0.115749  1.289124  0.579566

相关系数矩阵
这个相关系数矩阵可是很重要的，相关矩阵第i行第j列的元素就是原矩阵第i列和第j列的相关系数，值的绝对值越大相关性越高，为0则不相关。

print(iris_csv.corr())   #值的绝对值越大相关性越高，为0则不相关

          5.1       3.5       1.4       0.2
5.1  1.000000 -0.103784  0.871283  0.816971
3.5 -0.103784  1.000000 -0.415218 -0.350733
1.4  0.871283 -0.415218  1.000000  0.962314
0.2  0.816971 -0.350733  0.962314  1.000000

value_counts()

iris_csv['Iris-setosa'].value_counts()     
"""value_counts()可以清楚看出某一列的分布情况，可以指定ascending = True变为倒序，
还可添加bins将数据自动分为几类"""

Iris-versicolor    50
Iris-virginica     50
Iris-setosa        49
Name: Iris-setosa, dtype: int64

统计值从小到大排序

iris_csv['3.5'].value_counts(ascending = True)

4.0     1
2.0     1
4.4     1
4.1     1
4.2     1
3.9     2
3.7     3
2.2     3
2.4     3
3.6     3
2.3     4
2.6     5
3.5     5
3.3     6
3.8     6
2.5     8
2.7     9
2.9    10
3.1    12
3.4    12
3.2    13
2.8    14
3.0    26
Name: 3.5, dtype: int64

设置bins

#如下，指定bins为5,自动归为了五类，当一列有很多值的情况下，我们可以考虑用这个方法将其分为几类
iris_csv['3.5'].value_counts(ascending = True,bins = 5)

(3.92, 4.4]       4
(1.997, 2.48]    11
(3.44, 3.92]     19
(2.48, 2.96]     46
(2.96, 3.44]     69
Name: 3.5, dtype: int64

数据选取

先使用一些切片的方法

# 可以直接取指定的一列，取出来就是一个series
Iris_setosa = iris_csv['Iris-setosa']
print(Iris_setosa[:5])

print('--------------------------------')

#当然，也可指定多列
df_iris1 = iris_csv[['1.4','Iris-setosa']]
print(df_iris1[:5])

print('--------------------------------')

#还可以直接切片,不过这里我只会横着切
df_iris2 = iris_csv[:2]
print(df_iris2)

0    Iris-setosa
1    Iris-setosa
2    Iris-setosa
3    Iris-setosa
4    Iris-setosa
Name: Iris-setosa, dtype: object
--------------------------------
   1.4  Iris-setosa
0  1.4  Iris-setosa
1  1.3  Iris-setosa
2  1.5  Iris-setosa
3  1.4  Iris-setosa
4  1.7  Iris-setosa
--------------------------------
   5.1  3.5  1.4  0.2  Iris-setosa
0  4.9  3.0  1.4  0.2  Iris-setosa
1  4.7  3.2  1.3  0.2  Iris-setosa

高级索引方式

loc（）：根据标签选取列,DataFrame行的表示方式有两种，一种是通过显式的行标签来索引，另一种是通过默认隐式的行号来索引。loc方法是通过行标签来索引选取目标行，可以配合列标签来选取特定位置的数据。
iloc（）：根据序列选取行,使用隐式（即为从0到无穷的数据索引）的行序列号来选取数据使用iloc，可以搭配列序列号来更简单的选取特定位点的数据
ix（）：组合使用索引和标签来选取特定位置,loc只能使用显式标签来选取数据，而iloc只能使用隐式序列号来选取数据，ix则能将二者结合起来使用，ix可以混用显式标签与隐式序列号。

注意：ix方法已经要被淘汰了，所以就不使用它了

上面说的这么复杂，其实就是两句话

loc 用label来去定位
iloc 用position来去定位

#以下我们换一个泰坦尼克的数据集
df = pd.read_csv('titanic.csv')
print(df.info())



<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
PassengerId    891 non-null int64
Survived       891 non-null int64
Pclass         891 non-null int64
Name           891 non-null object
Sex            891 non-null object
Age            714 non-null float64
SibSp          891 non-null int64
Parch          891 non-null int64
Ticket         891 non-null object
Fare           891 non-null float64
Cabin          204 non-null object
Embarked       889 non-null object
dtypes: float64(2), int64(5), object(5)
memory usage: 83.6+ KB
None

iloc方法
索引方式和numpy差不多

print(df.iloc[0])

PassengerId                          1
Survived                             0
Pclass                               3
Name           Braund, Mr. Owen Harris
Sex                               male
Age                                 22
SibSp                                1
Parch                                0
Ticket                       A/5 21171
Fare                              7.25
Cabin                              NaN
Embarked                             S
Name: 0, dtype: object

切片

print(df.iloc[0:2])



                                                    PassengerId  Survived  \
Name                                                                        
Braund, Mr. Owen Harris                                       1         0   
Cumings, Mrs. John Bradley (Florence Briggs Tha...            2         1   

                                                    Pclass     Sex   Age  \
Name                                                                       
Braund, Mr. Owen Harris                                  3    male  22.0   
Cumings, Mrs. John Bradley (Florence Briggs Tha...       1  female  38.0   

                                                    SibSp  Parch     Ticket  \
Name                                                                          
Braund, Mr. Owen Harris                                 1      0  A/5 21171   
Cumings, Mrs. John Bradley (Florence Briggs Tha...      1      0   PC 17599   

                                                       Fare Cabin Embarked  
Name                                                                        
Braund, Mr. Owen Harris                              7.2500   NaN        S  
Cumings, Mrs. John Bradley (Florence Briggs Tha...  71.2833   C85        C

两边切

print(df.iloc[0:4,0:4])

                                                    PassengerId  Survived  \
Name                                                                        
Braund, Mr. Owen Harris                                       1         0   
Cumings, Mrs. John Bradley (Florence Briggs Tha...            2         1   
Heikkinen, Miss. Laina                                        3         1   
Futrelle, Mrs. Jacques Heath (Lily May Peel)                  4         1   

                                                    Pclass     Sex  
Name                                                                
Braund, Mr. Owen Harris                                  3    male  
Cumings, Mrs. John Bradley (Florence Briggs Tha...       1  female  
Heikkinen, Miss. Laina                                   3  female  
Futrelle, Mrs. Jacques Heath (Lily May Peel)             1  female

loc方法
用于标签索引

df = df.set_index('Name')         #将name设为索引用于loc遍历
df.loc['Braund, Mr. Owen Harris']

PassengerId            1
Survived               0
Pclass                 3
Sex                 male
Age                   22
SibSp                  1
Parch                  0
Ticket         A/5 21171
Fare                7.25
Cabin                NaN
Embarked               S
Name: Braund, Mr. Owen Harris, dtype: object

指定行和列索引

df.loc['Braund, Mr. Owen Harris','Sex']  #指定行和列索引

'male'

切片

df.loc['Braund, Mr. Owen Harris':'Heikkinen, Miss. Laina',:]    #一样可以切片

image.png
bool类型索引

其实和前面numpy的操作是一样的，这里就举个例子
首先，得到一个bool列表

print(df['Sex'] == 'female') #得到bool列表



Name
Braund, Mr. Owen Harris                                      False
Cumings, Mrs. John Bradley (Florence Briggs Thayer)           True
Heikkinen, Miss. Laina                                        True
Futrelle, Mrs. Jacques Heath (Lily May Peel)                  True
Allen, Mr. William Henry                                     False
Moran, Mr. James                                             False
McCarthy, Mr. Timothy J                                      False
Palsson, Master. Gosta Leonard                               False
Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg)             True
Nasser, Mrs. Nicholas (Adele Achem)                           True
Sandstrom, Miss. Marguerite Rut                               True
Bonnell, Miss. Elizabeth                                      True
Saundercock, Mr. William Henry                               False
Andersson, Mr. Anders Johan                                  False
Vestrom, Miss. Hulda Amanda Adolfina                          True
Hewlett, Mrs. (Mary D Kingcome)                               True
Rice, Master. Eugene                                         False
Williams, Mr. Charles Eugene                                 False
Vander Planke, Mrs. Julius (Emelia Maria Vandemoortele)       True
Masselmani, Mrs. Fatima                                       True
Fynney, Mr. Joseph J                                         False
Beesley, Mr. Lawrence                                        False
McGowan, Miss. Anna "Annie"                                   True
Sloper, Mr. William Thompson                                 False
Palsson, Miss. Torborg Danira                                 True
Asplund, Mrs. Carl Oscar (Selma Augusta Emilia Johansson)     True
Emir, Mr. Farred Chehab                                      False
Fortune, Mr. Charles Alexander                               False
O'Dwyer, Miss. Ellen "Nellie"                                 True
Todoroff, Mr. Lalio                                          False
                                                             ...  
Giles, Mr. Frederick Edward                                  False
Swift, Mrs. Frederick Joel (Margaret Welles Barron)           True
Sage, Miss. Dorothy Edith "Dolly"                             True
Gill, Mr. John William                                       False
Bystrom, Mrs. (Karolina)                                      True
Duran y More, Miss. Asuncion                                  True
Roebling, Mr. Washington Augustus II                         False
van Melkebeke, Mr. Philemon                                  False
Johnson, Master. Harold Theodor                              False
Balkic, Mr. Cerin                                            False
Beckwith, Mrs. Richard Leonard (Sallie Monypeny)              True
Carlsson, Mr. Frans Olof                                     False
Vander Cruyssen, Mr. Victor                                  False
Abelson, Mrs. Samuel (Hannah Wizosky)                         True
Najib, Miss. Adele Kiamie "Jane"                              True
Gustafsson, Mr. Alfred Ossian                                False
Petroff, Mr. Nedelio                                         False
Laleff, Mr. Kristo                                           False
Potter, Mrs. Thomas Jr (Lily Alexenia Wilson)                 True
Shelley, Mrs. William (Imanita Parrish Hall)                  True
Markun, Mr. Johann                                           False
Dahlberg, Miss. Gerda Ulrika                                  True
Banfield, Mr. Frederick James                                False
Sutehall, Mr. Henry Jr                                       False
Rice, Mrs. William (Margaret Norton)                          True
Montvila, Rev. Juozas                                        False
Graham, Miss. Margaret Edith                                  True
Johnston, Miss. Catherine Helen "Carrie"                      True
Behr, Mr. Karl Howell                                        False
Dooley, Mr. Patrick                                          False
Name: Sex, Length: 891, dtype: bool

然后去找到值

df[df['Sex'] == 'female'][:5]

image.png

得到特定列的结果

df.loc[df['Sex'] == 'male','Age'][:5]      #得到性别为女性的年龄

Name
Braund, Mr. Owen Harris           22.0
Allen, Mr. William Henry          35.0
Moran, Mr. James                   NaN
McCarthy, Mr. Timothy J           54.0
Palsson, Master. Gosta Leonard     2.0
Name: Age, dtype: float64

此外，还有where操作

where(cond，other = nan) 我这里只介绍前两个参数，有兴趣的可以去pandas官方文档上去查，cond就是传入的矩阵，other指定不满足条件的值等于什么。

不使用other参数

df.where(df>2)[:5]

image.png

使用other参数

df.where(df>2,0)[:5]

image.png

还有query操作
使用布尔表达式查询DataFrame的列。

df1 = pd.DataFrame(np.random.randn(5, 2), columns=list('ab'))
print(df1.query('a > b'))   #其实就等于   df1[df1.a > df1.b]

          a         b
1  1.164991 -1.354645
3  0.993299  0.188703

合并操作

这里就将concat和merge操作，这两个操作在做特征工程时时常会用到

concat([df1,df2,…]，axis=[ ]) 将矩阵按行或列合并，指定axis。
merge(df1,df2，on = '',how = '') 将矩阵按某个键和怎样的方式融合，on指定融合的键，可以是一个元组；how指定融合方式，有left(左),right(右),outer(全连接),默认为左连接。

left = pd.DataFrame({'key': ['K0', 'K1', 'K2', 'K3'],
                    'A': ['A0', 'A1', 'A2', 'A3'], 
                    'B': ['B0', 'B1', 'B2', 'B3']})
right = pd.DataFrame({'key': ['K0', 'K1', 'K2', 'K3'],
                    'C': ['C0', 'C1', 'C2', 'C3'], 
                    'D': ['D0', 'D1', 'D2', 'D3']})
print(left)
print(right)

    A   B key
0  A0  B0  K0
1  A1  B1  K1
2  A2  B2  K2
3  A3  B3  K3
    C   D key
0  C0  D0  K0
1  C1  D1  K1
2  C2  D2  K2
3  C3  D3  K3

concat操作

res = pd.concat([left,right],axis = 1)
print(res)

    A   B key   C   D key
0  A0  B0  K0  C0  D0  K0
1  A1  B1  K1  C1  D1  K1
2  A2  B2  K2  C2  D2  K2
3  A3  B3  K3  C3  D3  K3

merge操作

res = pd.merge(left, right, on = 'key')
print(res)

    A   B key   C   D
0  A0  B0  K0  C0  D0
1  A1  B1  K1  C1  D1
2  A2  B2  K2  C2  D2
3  A3  B3  K3  C3  D3

另外还有一种join方法,可以按键添加矩阵进去,不过要设置其的索引与要加入的矩阵的键有关系.

right.set_index('key',inplace = True)
res = left.join(right, on = 'key')
print(res)

    A   B key   C   D
0  A0  B0  K0  C0  D0
1  A1  B1  K1  C1  D1
2  A2  B2  K2  C2  D2
3  A3  B3  K3  C3  D3

这些就是我觉得比较基础和常用的pandas方法，不理解的地方一定要查api文档然后自己练习一下。
下一篇介绍一些高级的用法。

机器学习入门笔记二 pandas基本操作
pandas 是基于NumPy 的一种工具，pandas就是字典型的numpy，就是numpy像是一个列表，pan...
机器学习入门笔记二 pandas高级操作
这篇主要介绍一些我觉得常用的一些高级用法，主要包括groupby操作，apply，map操作，pivot_tabl...
大师兄的Python机器学习笔记:Pandas库
大师兄的Python机器学习笔记:实现评估模型一、关于Pandas 1. Pandas和Numpy Pandas...
大师兄的Python机器学习笔记:实现评估模型
大师兄的Python机器学习笔记:数据重抽样大师兄的Python机器学习笔记:Pandas库一、混淆矩阵 1. ...
[机器学习入门] 李弘毅机器学习笔记-7 （Brief Intr
[机器学习入门] 李弘毅机器学习笔记-7 （Brief Introduction of Deep Learning...
[机器学习入门] 李宏毅机器学习笔记-15 （Unsupervi
[机器学习入门] 李宏毅机器学习笔记-15 （Unsupervised Learning: WordEmbeddi...
[机器学习入门] 李宏毅机器学习笔记-24（introducti
[机器学习入门] 李宏毅机器学习笔记-24（introduction of Structured Learning...
[机器学习入门] 李弘毅机器学习笔记-16 （Unsupervi
[机器学习入门] 李弘毅机器学习笔记-16 （Unsupervised Learning: Neighbor Em...
[机器学习入门] 李弘毅机器学习笔记-17（Unsupervis
[机器学习入门] 李弘毅机器学习笔记-17（Unsupervised Learning: DeepAuto-enc...
数据分析学习笔记（6）—— 泰坦尼克号生存预测
本次笔记主要记录了一个机器学习的入门实战，泰坦尼克号生存预测。主要涉及的知识点有，python的pandas包，机...