pandas学习-5

作者: 蓝剑狼 | 来源:发表于2018-08-12 21:19 被阅读5次

pandas学习-5
科学计算库pandas执行示例
Pandas 学习笔记-5
第5章 Pandas入门(1)
尝尝pandas(6)
Pandas基础之DataFrame,Series
03.pandas基础操作
大师兄的Python机器学习笔记:Pandas库
Pandas
机器学习集训营---第三周总结

Pandas数据结构Dataframe：索引

Dataframe既有行索引也有列索引，可以被看做由Series组成的字典（共用一个索引）

选择列 / 选择行 / 切片 / 布尔判断

# 选择行与列

df = pd.DataFrame(np.random.rand(12).reshape(3,4)*100,
                   index = ['one','two','three'],
                   columns = ['a','b','c','d'])
print("1".center(40,'*'))
print(df)

data1 = df['a']
data2 = df[['a','c']]
print("2".center(40,'*'))
print(data1,'\n',type(data1))
print("3".center(40,'*'))
print(data2,'\n',type(data2))
# 按照列名选择列，只选择一列输出Series，选择多列输出Dataframe

data3 = df.loc['one']
data4 = df.loc[['one','two']]
print("4".center(40,'*'))
print(data3,'\n',type(data3))
print("5".center(40,'*'))
print(data4,'\n',type(data4))
# 按照index选择行，只选择一行输出Series，选择多行输出Dataframe
#执行结果
*******************1********************
               a          b          c          d
one    44.252039   9.367770  52.196322  54.905461
two    98.031746  81.958438   9.527486  82.967234
three  83.547295  23.754233   8.578580  51.565959
*******************2********************
one      44.252039
two      98.031746
three    83.547295
Name: a, dtype: float64 
 <class 'pandas.core.series.Series'>
*******************3********************
               a          c
one    44.252039  52.196322
two    98.031746   9.527486
three  83.547295   8.578580 
 <class 'pandas.core.frame.DataFrame'>
*******************4********************
a    44.252039
b     9.367770
c    52.196322
d    54.905461
Name: one, dtype: float64 
 <class 'pandas.core.series.Series'>
*******************5********************
             a          b          c          d
one  44.252039   9.367770  52.196322  54.905461
two  98.031746  81.958438   9.527486  82.967234 
 <class 'pandas.core.frame.DataFrame'>

# df[] - 选择列
# 一般用于选择列，也可以选择行

df = pd.DataFrame(np.random.rand(12).reshape(3,4)*100,
                   index = ['one','two','three'],
                  columns = ['a','b','c','d'])
print("1".center(40,'*'))
print(df)

data1 = df['a']
data2 = df[['b','c']]  # 尝试输入 data2 = df[['b','c','e']]
print("2".center(40,'*'))
print(data1)
print("3".center(40,'*'))
print(data2)
# df[]默认选择列，[]中写列名（所以一般数据colunms都会单独制定，不会用默认数字列名，以免和index冲突）
# 单选列为Series，print结果为Series格式
# 多选列为Dataframe，print结果为Dataframe格式

data3 = df[:1]
# data3 = df[0]
# data3 = df['one']
print("4".center(40,'*'))
print(data3,'\n',type(data3))
# df[]中为数字时，默认选择行，且只能进行切片的选择，不能单独选择（df[0]）
# 输出结果为Dataframe，即便只选择一行
# df[]不能通过索引标签名来选择行(df['one'])

# 核心笔记：df[col]一般用于选择列，[]中写列名
#执行结果
*******************1********************
               a          b          c          d
one    35.520221  64.848157   5.812035  11.379704
two     4.063157  96.357304  89.484171  87.545658
three  98.608574  75.918864  31.055319  22.553646
*******************2********************
one      35.520221
two       4.063157
three    98.608574
Name: a, dtype: float64
*******************3********************
               b          c
one    64.848157   5.812035
two    96.357304  89.484171
three  75.918864  31.055319
*******************4********************
             a          b         c          d
one  35.520221  64.848157  5.812035  11.379704 
 <class 'pandas.core.frame.DataFrame'>

# df.loc[] - 按index选择行

df1 = pd.DataFrame(np.random.rand(16).reshape(4,4)*100,
                   index = ['one','two','three','four'],
                   columns = ['a','b','c','d'])
df2 = pd.DataFrame(np.random.rand(16).reshape(4,4)*100,
                   columns = ['a','b','c','d'])
print("1".center(40,'*'))
print(df1)
print("2".center(40,'*'))
print(df2)


data1 = df1.loc['one']
data2 = df2.loc[1]
print("3".center(40,'*'))
print(data1)
print("4".center(40,'*'))
print(data2)
print('单标签索引\n-----')
# 单个标签索引，返回Series

data3 = df1.loc[['two','three','five']]
data4 = df2.loc[[3,2,1]]
print("5".center(40,'*'))
print(data3)
print("6".center(40,'*'))
print(data4)
print('多标签索引\n-----')
# 多个标签索引，如果标签不存在，则返回NaN
# 顺序可变

data5 = df1.loc['one':'three']
data6 = df2.loc[1:3]
print("7".center(40,'*'))
print(data5)
print("8".center(40,'*'))
print(data6)
print('切片索引')
# 可以做切片对象
# 末端包含

# 核心笔记：df.loc[label]主要针对index选择行，同时支持指定index，及默认数字index
#执行结果
*******************1********************
               a          b          c          d
one    77.451739  30.571703  64.803881  40.301115
two    60.018758  82.042454  96.573686  15.759030
three  60.696044  74.762078   0.316767  34.972233
four   43.905231  72.191279  93.145908  63.543389
*******************2********************
           a          b          c          d
0  63.075883  43.567530  15.135401  82.973379
1  59.717961  75.257481  99.692578  67.124844
2  79.859617  53.077732   3.832747  38.450555
3  99.624674   9.681684  64.472354  28.139052
*******************3********************
a    77.451739
b    30.571703
c    64.803881
d    40.301115
Name: one, dtype: float64
*******************4********************
a    59.717961
b    75.257481
c    99.692578
d    67.124844
Name: 1, dtype: float64
单标签索引
-----
*******************5********************
               a          b          c          d
two    60.018758  82.042454  96.573686  15.759030
three  60.696044  74.762078   0.316767  34.972233
five         NaN        NaN        NaN        NaN
*******************6********************
           a          b          c          d
3  99.624674   9.681684  64.472354  28.139052
2  79.859617  53.077732   3.832747  38.450555
1  59.717961  75.257481  99.692578  67.124844
多标签索引
-----
*******************7********************
               a          b          c          d
one    77.451739  30.571703  64.803881  40.301115
two    60.018758  82.042454  96.573686  15.759030
three  60.696044  74.762078   0.316767  34.972233
*******************8********************
           a          b          c          d
1  59.717961  75.257481  99.692578  67.124844
2  79.859617  53.077732   3.832747  38.450555
3  99.624674   9.681684  64.472354  28.139052
切片索引

# df.iloc[] - 按照整数位置（从轴的0到length-1）选择行
# 类似list的索引，其顺序就是dataframe的整数位置，从0开始计

df = pd.DataFrame(np.random.rand(16).reshape(4,4)*100,
                   index = ['one','two','three','four'],
                   columns = ['a','b','c','d'])
print("1".center(40,'*'))
print(df)

print("2".center(40,'*'))
print(df.iloc[0])
print("3".center(40,'*'))
print(df.iloc[-1])
#print(df.iloc[4])
print('单位置索引\n-----')
# 单位置索引
# 和loc索引不同，不能索引超出数据行数的整数位置
print("4".center(40,'*'))
print(df.iloc[[0,2]])
print("5".center(40,'*'))
print(df.iloc[[3,2,1]])
print('多位置索引\n-----')
# 多位置索引
# 顺序可变
print("6".center(40,'*'))
print(df.iloc[1:3])
print("7".center(40,'*'))
print(df.iloc[::2])
print('切片索引')
# 切片索引
# 末端不包含
# 执行结果
*******************1********************
               a          b          c          d
one    50.862519  83.335659  37.581797   9.538039
two    50.433386  32.915710  49.080627  51.744583
three  50.805240  30.260431  17.442769  72.023780
four   72.270772  96.909730  13.421593  96.353553
*******************2********************
a    50.862519
b    83.335659
c    37.581797
d     9.538039
Name: one, dtype: float64
*******************3********************
a    72.270772
b    96.909730
c    13.421593
d    96.353553
Name: four, dtype: float64
单位置索引
-----
*******************4********************
               a          b          c          d
one    50.862519  83.335659  37.581797   9.538039
three  50.805240  30.260431  17.442769  72.023780
*******************5********************
               a          b          c          d
four   72.270772  96.909730  13.421593  96.353553
three  50.805240  30.260431  17.442769  72.023780
two    50.433386  32.915710  49.080627  51.744583
多位置索引
-----
*******************6********************
               a          b          c          d
two    50.433386  32.915710  49.080627  51.744583
three  50.805240  30.260431  17.442769  72.023780
*******************7********************
               a          b          c          d
one    50.862519  83.335659  37.581797   9.538039
three  50.805240  30.260431  17.442769  72.023780
切片索引

# 布尔型索引
# 和Series原理相同

df = pd.DataFrame(np.random.rand(16).reshape(4,4)*100,
                   index = ['one','two','three','four'],
                   columns = ['a','b','c','d'])
print("1".center(40,'*'))
print(df)
print('------')

b1 = df < 20
print("2".center(40,'*'))
print(b1,type(b1))
print("3".center(40,'*'))
print(df[b1])  # 也可以书写为 df[df < 20]
# 不做索引则会对数据每个值进行判断
# 索引结果保留 所有数据：True返回原数据，False返回值为NaN

b2 = df['a'] > 50
print("4".center(40,'*'))
print(b2,type(b2))
print("5".center(40,'*'))
print(df[b2])  # 也可以书写为 df[df['a'] > 50]
# 单列做判断
# 索引结果保留 单列判断为True的行数据，包括其他列

b3 = df[['a','b']] > 50
print("6".center(40,'*'))
print(b3,type(b3))
print("7".center(40,'*'))
print(df[b3])  # 也可以书写为 df[df[['a','b']] > 50]
# 多列做判断
# 索引结果保留 所有数据：True返回原数据，False返回值为NaN

b4 = df.loc[['one','three']] < 50
print("8".center(40,'*'))
print(b4,type(b4))
print("9".center(40,'*'))
print(df[b4])  # 也可以书写为 df[df.loc[['one','three']] < 50]
# 多行做判断
# 索引结果保留 所有数据：True返回原数据，False返回值为NaN
#执行结果
*******************1********************
               a          b          c          d
one    26.434806  34.117824  19.677857  46.504626
two    86.658355  81.060611  51.598853  44.611262
three  68.528156  98.356481  83.397588  82.896992
four   38.606510  41.794018  60.998152  78.224933
------
*******************2********************
           a      b      c      d
one    False  False   True  False
two    False  False  False  False
three  False  False  False  False
four   False  False  False  False <class 'pandas.core.frame.DataFrame'>
*******************3********************
        a   b          c   d
one   NaN NaN  19.677857 NaN
two   NaN NaN        NaN NaN
three NaN NaN        NaN NaN
four  NaN NaN        NaN NaN
*******************4********************
one      False
two       True
three     True
four     False
Name: a, dtype: bool <class 'pandas.core.series.Series'>
*******************5********************
               a          b          c          d
two    86.658355  81.060611  51.598853  44.611262
three  68.528156  98.356481  83.397588  82.896992
*******************6********************
           a      b
one    False  False
two     True   True
three   True   True
four   False  False <class 'pandas.core.frame.DataFrame'>
*******************7********************
               a          b   c   d
one          NaN        NaN NaN NaN
two    86.658355  81.060611 NaN NaN
three  68.528156  98.356481 NaN NaN
four         NaN        NaN NaN NaN
*******************8********************
           a      b      c      d
one     True   True   True   True
three  False  False  False  False <class 'pandas.core.frame.DataFrame'>
*******************9********************
               a          b          c          d
one    26.434806  34.117824  19.677857  46.504626
two          NaN        NaN        NaN        NaN
three        NaN        NaN        NaN        NaN
four         NaN        NaN        NaN        NaN

# 布尔型索引
# 和Series原理相同

df = pd.DataFrame(np.random.rand(16).reshape(4,4)*100,
                   index = ['one','two','three','four'],
                   columns = ['a','b','c','d'])
print("1".center(40,'*'))
print(df)
print('------')

b1 = df < 20
print("2".center(40,'*'))
print(b1,type(b1))
print("3".center(40,'*'))
print(df[b1])  # 也可以书写为 df[df < 20]
# 不做索引则会对数据每个值进行判断
# 索引结果保留 所有数据：True返回原数据，False返回值为NaN

b2 = df['a'] > 50
print("4".center(40,'*'))
print(b2,type(b2))
print("5".center(40,'*'))
print(df[b2])  # 也可以书写为 df[df['a'] > 50]
# 单列做判断
# 索引结果保留 单列判断为True的行数据，包括其他列

b3 = df[['a','b']] > 50
print("6".center(40,'*'))
print(b3,type(b3))
print("7".center(40,'*'))
print(df[b3])  # 也可以书写为 df[df[['a','b']] > 50]
# 多列做判断
# 索引结果保留 所有数据：True返回原数据，False返回值为NaN

b4 = df.loc[['one','three']] < 50
print("8".center(40,'*'))
print(b4,type(b4))
print("9".center(40,'*'))
print(df[b4])  # 也可以书写为 df[df.loc[['one','three']] < 50]
# 多行做判断
# 索引结果保留 所有数据：True返回原数据，False返回值为NaN
#执行结果
*******************1********************
               a          b          c          d
one    26.434806  34.117824  19.677857  46.504626
two    86.658355  81.060611  51.598853  44.611262
three  68.528156  98.356481  83.397588  82.896992
four   38.606510  41.794018  60.998152  78.224933
------
*******************2********************
           a      b      c      d
one    False  False   True  False
two    False  False  False  False
three  False  False  False  False
four   False  False  False  False <class 'pandas.core.frame.DataFrame'>
*******************3********************
        a   b          c   d
one   NaN NaN  19.677857 NaN
two   NaN NaN        NaN NaN
three NaN NaN        NaN NaN
four  NaN NaN        NaN NaN
*******************4********************
one      False
two       True
three     True
four     False
Name: a, dtype: bool <class 'pandas.core.series.Series'>
*******************5********************
               a          b          c          d
two    86.658355  81.060611  51.598853  44.611262
three  68.528156  98.356481  83.397588  82.896992
*******************6********************
           a      b
one    False  False
two     True   True
three   True   True
four   False  False <class 'pandas.core.frame.DataFrame'>
*******************7********************
               a          b   c   d
one          NaN        NaN NaN NaN
two    86.658355  81.060611 NaN NaN
three  68.528156  98.356481 NaN NaN
four         NaN        NaN NaN NaN
*******************8********************
           a      b      c      d
one     True   True   True   True
three  False  False  False  False <class 'pandas.core.frame.DataFrame'>
*******************9********************
               a          b          c          d
one    26.434806  34.117824  19.677857  46.504626
two          NaN        NaN        NaN        NaN
three        NaN        NaN        NaN        NaN
four         NaN        NaN        NaN        NaN

# 多重索引：比如同时索引行和列
# 先选择列再选择行 —— 相当于对于一个数据，先筛选字段，再选择数据量

df = pd.DataFrame(np.random.rand(16).reshape(4,4)*100,
                   index = ['one','two','three','four'],
                   columns = ['a','b','c','d'])
print("1".center(40,'*'))
print(df)
print('------')
print("2".center(40,'*'))
print(df['a'].loc[['one','three']])# 选择a列的one，three行
print("3".center(40,'*'))
print(df[['b','c','d']].iloc[::2])   # 选择b，c，d列的one，three行
print("4".center(40,'*'))
print(df[df['a'] < 50].iloc[:2])   # 选择满足判断索引的前两行数据
#执行结果
*******************1********************
               a          b          c          d
one    90.650107  65.405366  78.994304  67.269502
two    68.413380  60.022026  40.080027  26.599064
three  69.483353  99.762443  17.153750  43.870798
four   46.855326  89.543055  65.151681  90.754392
------
*******************2********************
one      90.650107
three    69.483353
Name: a, dtype: float64
*******************3********************
               b          c          d
one    65.405366  78.994304  67.269502
three  99.762443  17.153750  43.870798
*******************4********************
              a          b          c          d
four  46.855326  89.543055  65.151681  90.754392

pandas学习-5
Pandas数据结构Dataframe：索引 Dataframe既有行索引也有列索引，可以被看做由Series组成...
科学计算库pandas执行示例
pandas1 pandas2 pandas3 pandas4 pandas5
Pandas 学习笔记-5
先提交后补充
第5章 Pandas入门(1)
以下内容主要学习自《利用Python进行数据分析》第5章 Pandas入门(1) pandas所包含的数据结构和...
尝尝pandas(6)
今天我们将通过学习pandas读取和写入数据来结束pandas的学习。pandas可以读取的数据类型有很多种，在这...
Pandas基础之DataFrame,Series
pandas使用(1) note:学习环境python3.5,pandas库 pandas是基于NumPy的一个非...
03.pandas基础操作
3、pandas基础操作 1. pandas 介绍 1.1 为什么学习pandas numpy已经可以帮助我们进行...
大师兄的Python机器学习笔记:Pandas库
大师兄的Python机器学习笔记:实现评估模型一、关于Pandas 1. Pandas和Numpy Pandas...
Pandas
学习资料 http://c.biancheng.net/pandas/what-is-pandas.html[ht...
机器学习集训营---第三周总结
第三周学习总结，主要内容： pandas数据统计与分析的学习结合selenium编写爬虫学习pandas在机器...