Pandas数据结构Dataframe:索引
Dataframe既有行索引也有列索引,可以被看做由Series组成的字典(共用一个索引)
选择列 / 选择行 / 切片 / 布尔判断
# 选择行与列
df = pd.DataFrame(np.random.rand(12).reshape(3,4)*100,
index = ['one','two','three'],
columns = ['a','b','c','d'])
print("1".center(40,'*'))
print(df)
data1 = df['a']
data2 = df[['a','c']]
print("2".center(40,'*'))
print(data1,'\n',type(data1))
print("3".center(40,'*'))
print(data2,'\n',type(data2))
# 按照列名选择列,只选择一列输出Series,选择多列输出Dataframe
data3 = df.loc['one']
data4 = df.loc[['one','two']]
print("4".center(40,'*'))
print(data3,'\n',type(data3))
print("5".center(40,'*'))
print(data4,'\n',type(data4))
# 按照index选择行,只选择一行输出Series,选择多行输出Dataframe
#执行结果
*******************1********************
a b c d
one 44.252039 9.367770 52.196322 54.905461
two 98.031746 81.958438 9.527486 82.967234
three 83.547295 23.754233 8.578580 51.565959
*******************2********************
one 44.252039
two 98.031746
three 83.547295
Name: a, dtype: float64
<class 'pandas.core.series.Series'>
*******************3********************
a c
one 44.252039 52.196322
two 98.031746 9.527486
three 83.547295 8.578580
<class 'pandas.core.frame.DataFrame'>
*******************4********************
a 44.252039
b 9.367770
c 52.196322
d 54.905461
Name: one, dtype: float64
<class 'pandas.core.series.Series'>
*******************5********************
a b c d
one 44.252039 9.367770 52.196322 54.905461
two 98.031746 81.958438 9.527486 82.967234
<class 'pandas.core.frame.DataFrame'>
# df[] - 选择列
# 一般用于选择列,也可以选择行
df = pd.DataFrame(np.random.rand(12).reshape(3,4)*100,
index = ['one','two','three'],
columns = ['a','b','c','d'])
print("1".center(40,'*'))
print(df)
data1 = df['a']
data2 = df[['b','c']] # 尝试输入 data2 = df[['b','c','e']]
print("2".center(40,'*'))
print(data1)
print("3".center(40,'*'))
print(data2)
# df[]默认选择列,[]中写列名(所以一般数据colunms都会单独制定,不会用默认数字列名,以免和index冲突)
# 单选列为Series,print结果为Series格式
# 多选列为Dataframe,print结果为Dataframe格式
data3 = df[:1]
# data3 = df[0]
# data3 = df['one']
print("4".center(40,'*'))
print(data3,'\n',type(data3))
# df[]中为数字时,默认选择行,且只能进行切片的选择,不能单独选择(df[0])
# 输出结果为Dataframe,即便只选择一行
# df[]不能通过索引标签名来选择行(df['one'])
# 核心笔记:df[col]一般用于选择列,[]中写列名
#执行结果
*******************1********************
a b c d
one 35.520221 64.848157 5.812035 11.379704
two 4.063157 96.357304 89.484171 87.545658
three 98.608574 75.918864 31.055319 22.553646
*******************2********************
one 35.520221
two 4.063157
three 98.608574
Name: a, dtype: float64
*******************3********************
b c
one 64.848157 5.812035
two 96.357304 89.484171
three 75.918864 31.055319
*******************4********************
a b c d
one 35.520221 64.848157 5.812035 11.379704
<class 'pandas.core.frame.DataFrame'>
# df.loc[] - 按index选择行
df1 = pd.DataFrame(np.random.rand(16).reshape(4,4)*100,
index = ['one','two','three','four'],
columns = ['a','b','c','d'])
df2 = pd.DataFrame(np.random.rand(16).reshape(4,4)*100,
columns = ['a','b','c','d'])
print("1".center(40,'*'))
print(df1)
print("2".center(40,'*'))
print(df2)
data1 = df1.loc['one']
data2 = df2.loc[1]
print("3".center(40,'*'))
print(data1)
print("4".center(40,'*'))
print(data2)
print('单标签索引\n-----')
# 单个标签索引,返回Series
data3 = df1.loc[['two','three','five']]
data4 = df2.loc[[3,2,1]]
print("5".center(40,'*'))
print(data3)
print("6".center(40,'*'))
print(data4)
print('多标签索引\n-----')
# 多个标签索引,如果标签不存在,则返回NaN
# 顺序可变
data5 = df1.loc['one':'three']
data6 = df2.loc[1:3]
print("7".center(40,'*'))
print(data5)
print("8".center(40,'*'))
print(data6)
print('切片索引')
# 可以做切片对象
# 末端包含
# 核心笔记:df.loc[label]主要针对index选择行,同时支持指定index,及默认数字index
#执行结果
*******************1********************
a b c d
one 77.451739 30.571703 64.803881 40.301115
two 60.018758 82.042454 96.573686 15.759030
three 60.696044 74.762078 0.316767 34.972233
four 43.905231 72.191279 93.145908 63.543389
*******************2********************
a b c d
0 63.075883 43.567530 15.135401 82.973379
1 59.717961 75.257481 99.692578 67.124844
2 79.859617 53.077732 3.832747 38.450555
3 99.624674 9.681684 64.472354 28.139052
*******************3********************
a 77.451739
b 30.571703
c 64.803881
d 40.301115
Name: one, dtype: float64
*******************4********************
a 59.717961
b 75.257481
c 99.692578
d 67.124844
Name: 1, dtype: float64
单标签索引
-----
*******************5********************
a b c d
two 60.018758 82.042454 96.573686 15.759030
three 60.696044 74.762078 0.316767 34.972233
five NaN NaN NaN NaN
*******************6********************
a b c d
3 99.624674 9.681684 64.472354 28.139052
2 79.859617 53.077732 3.832747 38.450555
1 59.717961 75.257481 99.692578 67.124844
多标签索引
-----
*******************7********************
a b c d
one 77.451739 30.571703 64.803881 40.301115
two 60.018758 82.042454 96.573686 15.759030
three 60.696044 74.762078 0.316767 34.972233
*******************8********************
a b c d
1 59.717961 75.257481 99.692578 67.124844
2 79.859617 53.077732 3.832747 38.450555
3 99.624674 9.681684 64.472354 28.139052
切片索引
# df.iloc[] - 按照整数位置(从轴的0到length-1)选择行
# 类似list的索引,其顺序就是dataframe的整数位置,从0开始计
df = pd.DataFrame(np.random.rand(16).reshape(4,4)*100,
index = ['one','two','three','four'],
columns = ['a','b','c','d'])
print("1".center(40,'*'))
print(df)
print("2".center(40,'*'))
print(df.iloc[0])
print("3".center(40,'*'))
print(df.iloc[-1])
#print(df.iloc[4])
print('单位置索引\n-----')
# 单位置索引
# 和loc索引不同,不能索引超出数据行数的整数位置
print("4".center(40,'*'))
print(df.iloc[[0,2]])
print("5".center(40,'*'))
print(df.iloc[[3,2,1]])
print('多位置索引\n-----')
# 多位置索引
# 顺序可变
print("6".center(40,'*'))
print(df.iloc[1:3])
print("7".center(40,'*'))
print(df.iloc[::2])
print('切片索引')
# 切片索引
# 末端不包含
# 执行结果
*******************1********************
a b c d
one 50.862519 83.335659 37.581797 9.538039
two 50.433386 32.915710 49.080627 51.744583
three 50.805240 30.260431 17.442769 72.023780
four 72.270772 96.909730 13.421593 96.353553
*******************2********************
a 50.862519
b 83.335659
c 37.581797
d 9.538039
Name: one, dtype: float64
*******************3********************
a 72.270772
b 96.909730
c 13.421593
d 96.353553
Name: four, dtype: float64
单位置索引
-----
*******************4********************
a b c d
one 50.862519 83.335659 37.581797 9.538039
three 50.805240 30.260431 17.442769 72.023780
*******************5********************
a b c d
four 72.270772 96.909730 13.421593 96.353553
three 50.805240 30.260431 17.442769 72.023780
two 50.433386 32.915710 49.080627 51.744583
多位置索引
-----
*******************6********************
a b c d
two 50.433386 32.915710 49.080627 51.744583
three 50.805240 30.260431 17.442769 72.023780
*******************7********************
a b c d
one 50.862519 83.335659 37.581797 9.538039
three 50.805240 30.260431 17.442769 72.023780
切片索引
# 布尔型索引
# 和Series原理相同
df = pd.DataFrame(np.random.rand(16).reshape(4,4)*100,
index = ['one','two','three','four'],
columns = ['a','b','c','d'])
print("1".center(40,'*'))
print(df)
print('------')
b1 = df < 20
print("2".center(40,'*'))
print(b1,type(b1))
print("3".center(40,'*'))
print(df[b1]) # 也可以书写为 df[df < 20]
# 不做索引则会对数据每个值进行判断
# 索引结果保留 所有数据:True返回原数据,False返回值为NaN
b2 = df['a'] > 50
print("4".center(40,'*'))
print(b2,type(b2))
print("5".center(40,'*'))
print(df[b2]) # 也可以书写为 df[df['a'] > 50]
# 单列做判断
# 索引结果保留 单列判断为True的行数据,包括其他列
b3 = df[['a','b']] > 50
print("6".center(40,'*'))
print(b3,type(b3))
print("7".center(40,'*'))
print(df[b3]) # 也可以书写为 df[df[['a','b']] > 50]
# 多列做判断
# 索引结果保留 所有数据:True返回原数据,False返回值为NaN
b4 = df.loc[['one','three']] < 50
print("8".center(40,'*'))
print(b4,type(b4))
print("9".center(40,'*'))
print(df[b4]) # 也可以书写为 df[df.loc[['one','three']] < 50]
# 多行做判断
# 索引结果保留 所有数据:True返回原数据,False返回值为NaN
#执行结果
*******************1********************
a b c d
one 26.434806 34.117824 19.677857 46.504626
two 86.658355 81.060611 51.598853 44.611262
three 68.528156 98.356481 83.397588 82.896992
four 38.606510 41.794018 60.998152 78.224933
------
*******************2********************
a b c d
one False False True False
two False False False False
three False False False False
four False False False False <class 'pandas.core.frame.DataFrame'>
*******************3********************
a b c d
one NaN NaN 19.677857 NaN
two NaN NaN NaN NaN
three NaN NaN NaN NaN
four NaN NaN NaN NaN
*******************4********************
one False
two True
three True
four False
Name: a, dtype: bool <class 'pandas.core.series.Series'>
*******************5********************
a b c d
two 86.658355 81.060611 51.598853 44.611262
three 68.528156 98.356481 83.397588 82.896992
*******************6********************
a b
one False False
two True True
three True True
four False False <class 'pandas.core.frame.DataFrame'>
*******************7********************
a b c d
one NaN NaN NaN NaN
two 86.658355 81.060611 NaN NaN
three 68.528156 98.356481 NaN NaN
four NaN NaN NaN NaN
*******************8********************
a b c d
one True True True True
three False False False False <class 'pandas.core.frame.DataFrame'>
*******************9********************
a b c d
one 26.434806 34.117824 19.677857 46.504626
two NaN NaN NaN NaN
three NaN NaN NaN NaN
four NaN NaN NaN NaN
# 布尔型索引
# 和Series原理相同
df = pd.DataFrame(np.random.rand(16).reshape(4,4)*100,
index = ['one','two','three','four'],
columns = ['a','b','c','d'])
print("1".center(40,'*'))
print(df)
print('------')
b1 = df < 20
print("2".center(40,'*'))
print(b1,type(b1))
print("3".center(40,'*'))
print(df[b1]) # 也可以书写为 df[df < 20]
# 不做索引则会对数据每个值进行判断
# 索引结果保留 所有数据:True返回原数据,False返回值为NaN
b2 = df['a'] > 50
print("4".center(40,'*'))
print(b2,type(b2))
print("5".center(40,'*'))
print(df[b2]) # 也可以书写为 df[df['a'] > 50]
# 单列做判断
# 索引结果保留 单列判断为True的行数据,包括其他列
b3 = df[['a','b']] > 50
print("6".center(40,'*'))
print(b3,type(b3))
print("7".center(40,'*'))
print(df[b3]) # 也可以书写为 df[df[['a','b']] > 50]
# 多列做判断
# 索引结果保留 所有数据:True返回原数据,False返回值为NaN
b4 = df.loc[['one','three']] < 50
print("8".center(40,'*'))
print(b4,type(b4))
print("9".center(40,'*'))
print(df[b4]) # 也可以书写为 df[df.loc[['one','three']] < 50]
# 多行做判断
# 索引结果保留 所有数据:True返回原数据,False返回值为NaN
#执行结果
*******************1********************
a b c d
one 26.434806 34.117824 19.677857 46.504626
two 86.658355 81.060611 51.598853 44.611262
three 68.528156 98.356481 83.397588 82.896992
four 38.606510 41.794018 60.998152 78.224933
------
*******************2********************
a b c d
one False False True False
two False False False False
three False False False False
four False False False False <class 'pandas.core.frame.DataFrame'>
*******************3********************
a b c d
one NaN NaN 19.677857 NaN
two NaN NaN NaN NaN
three NaN NaN NaN NaN
four NaN NaN NaN NaN
*******************4********************
one False
two True
three True
four False
Name: a, dtype: bool <class 'pandas.core.series.Series'>
*******************5********************
a b c d
two 86.658355 81.060611 51.598853 44.611262
three 68.528156 98.356481 83.397588 82.896992
*******************6********************
a b
one False False
two True True
three True True
four False False <class 'pandas.core.frame.DataFrame'>
*******************7********************
a b c d
one NaN NaN NaN NaN
two 86.658355 81.060611 NaN NaN
three 68.528156 98.356481 NaN NaN
four NaN NaN NaN NaN
*******************8********************
a b c d
one True True True True
three False False False False <class 'pandas.core.frame.DataFrame'>
*******************9********************
a b c d
one 26.434806 34.117824 19.677857 46.504626
two NaN NaN NaN NaN
three NaN NaN NaN NaN
four NaN NaN NaN NaN
# 多重索引:比如同时索引行和列
# 先选择列再选择行 —— 相当于对于一个数据,先筛选字段,再选择数据量
df = pd.DataFrame(np.random.rand(16).reshape(4,4)*100,
index = ['one','two','three','four'],
columns = ['a','b','c','d'])
print("1".center(40,'*'))
print(df)
print('------')
print("2".center(40,'*'))
print(df['a'].loc[['one','three']])# 选择a列的one,three行
print("3".center(40,'*'))
print(df[['b','c','d']].iloc[::2]) # 选择b,c,d列的one,three行
print("4".center(40,'*'))
print(df[df['a'] < 50].iloc[:2]) # 选择满足判断索引的前两行数据
#执行结果
*******************1********************
a b c d
one 90.650107 65.405366 78.994304 67.269502
two 68.413380 60.022026 40.080027 26.599064
three 69.483353 99.762443 17.153750 43.870798
four 46.855326 89.543055 65.151681 90.754392
------
*******************2********************
one 90.650107
three 69.483353
Name: a, dtype: float64
*******************3********************
b c d
one 65.405366 78.994304 67.269502
three 99.762443 17.153750 43.870798
*******************4********************
a b c d
four 46.855326 89.543055 65.151681 90.754392
网友评论