pandas basic

作者: 且行歌 | 来源:发表于2018-02-06 16:49 被阅读40次

开始吧!

pandas主要用于数据分析,准确而言,是对数值的分析,而Python对Excel和SPSS的超越之处就在于对海量数据的处理能力.

pandas 数据结构

import pandas as pd

Series

obj = pd.Series([4,7,-5,3]) #生成series对象
obj
0    4
1    7
2   -5
3    3
dtype: int64
obj.index #索引
RangeIndex(start=0, stop=4, step=1)
obj.values #值
array([ 4,  7, -5,  3])
obj2 = pd.Series([4,7,-5,3], index = ['d','b','a','c']) #明确索引
obj2
d    4
b    7
a   -5
c    3
dtype: int64
obj2.index #显示索引
Index(['d', 'b', 'a', 'c'], dtype='object')

索引

obj2['a']
-5
obj2['d'] = 6 #索引并赋值
obj2 #作用于原series对象
d    6
b    7
a   -5
c    3
dtype: int64
obj2[['c','a','d']] #多个索引加双中括号
c    3
a   -5
d    6
dtype: int64

比较和简单运算

obj2[obj2 > 0] #按条件选取
d    6
b    7
c    3
dtype: int64
obj2 * 2 #运算
d    12
b    14
a   -10
c     6
dtype: int64
import numpy as np
np.exp(obj2) #作用于每个元素
d     403.428793
b    1096.633158
a       0.006738
c      20.085537
dtype: float64
'b' in obj2 #布尔值判断

True

数据类型转换

sdata = {'Ohio': 35000, 'Texas': 71000, 'Oregon': 16000, 'Utah': 5000} #dict
obj3 = pd.Series(sdata) #转换
obj3
Ohio      35000
Oregon    16000
Texas     71000
Utah       5000
dtype: int64
states = ['California', 'Ohio', 'Oregon', 'Texas'] #指定索引
obj4 = pd.Series(sdata, index = states)
obj4

California        NaN
Ohio          35000.0
Oregon        16000.0
Texas         71000.0
dtype: float64

判断缺失数据

pd.isnull(obj4)
California     True
Ohio          False
Oregon        False
Texas         False
dtype: bool
pd.notnull(obj4)
California    False
Ohio           True
Oregon         True
Texas          True
dtype: bool
obj4.isnull() #等价写法
California     True
Ohio          False
Oregon        False
Texas         False
dtype: bool

算术操作

obj3 + obj4
California         NaN
Ohio           70000.0
Oregon         32000.0
Texas         142000.0
Utah               NaN
dtype: float64

命名

obj4.name = 'population' #obj4的name
obj4.index.name = 'state' #索引的name
obj4
state
California        NaN
Ohio          35000.0
Oregon        16000.0
Texas         71000.0
Name: population, dtype: float64

索引重命名

obj.index
RangeIndex(start=0, stop=4, step=1)
obj.index =  ['Bob', 'Steve', 'Jeff', 'Ryan']
obj
Bob      4
Steve    7
Jeff    -5
Ryan     3
dtype: int64

DataFrame

data = {'state': ['Ohio', 'Ohio', 'Ohio', 'Nevada', 'Nevada', 'Nevada'],
        'year': [2000, 2001, 2002, 2001, 2002, 2003],
        'pop': [1.5, 1.7, 3.6, 2.4, 2.9, 3.2]}

frame = pd.DataFrame(data) #生成
frame

<div>
<style scoped>
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>
<table border="1" class="dataframe">
<thead>
<tr style="text-align: right;">
<th></th>
<th>pop</th>
<th>state</th>
<th>year</th>
</tr>
</thead>
<tbody>
<tr>
<th>0</th>
<td>1.5</td>
<td>Ohio</td>
<td>2000</td>
</tr>
<tr>
<th>1</th>
<td>1.7</td>
<td>Ohio</td>
<td>2001</td>
</tr>
<tr>
<th>2</th>
<td>3.6</td>
<td>Ohio</td>
<td>2002</td>
</tr>
<tr>
<th>3</th>
<td>2.4</td>
<td>Nevada</td>
<td>2001</td>
</tr>
<tr>
<th>4</th>
<td>2.9</td>
<td>Nevada</td>
<td>2002</td>
</tr>
<tr>
<th>5</th>
<td>3.2</td>
<td>Nevada</td>
<td>2003</td>
</tr>
</tbody>
</table>
</div>

head,选取前五项

frame.head()

<div>
<style scoped>
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>
<table border="1" class="dataframe">
<thead>
<tr style="text-align: right;">
<th></th>
<th>pop</th>
<th>state</th>
<th>year</th>
</tr>
</thead>
<tbody>
<tr>
<th>0</th>
<td>1.5</td>
<td>Ohio</td>
<td>2000</td>
</tr>
<tr>
<th>1</th>
<td>1.7</td>
<td>Ohio</td>
<td>2001</td>
</tr>
<tr>
<th>2</th>
<td>3.6</td>
<td>Ohio</td>
<td>2002</td>
</tr>
<tr>
<th>3</th>
<td>2.4</td>
<td>Nevada</td>
<td>2001</td>
</tr>
<tr>
<th>4</th>
<td>2.9</td>
<td>Nevada</td>
<td>2002</td>
</tr>
</tbody>
</table>
</div>

设定列

pd.DataFrame(data,columns = ['year','state','pop'])

<div>
<style scoped>
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>
<table border="1" class="dataframe">
<thead>
<tr style="text-align: right;">
<th></th>
<th>year</th>
<th>state</th>
<th>pop</th>
</tr>
</thead>
<tbody>
<tr>
<th>0</th>
<td>2000</td>
<td>Ohio</td>
<td>1.5</td>
</tr>
<tr>
<th>1</th>
<td>2001</td>
<td>Ohio</td>
<td>1.7</td>
</tr>
<tr>
<th>2</th>
<td>2002</td>
<td>Ohio</td>
<td>3.6</td>
</tr>
<tr>
<th>3</th>
<td>2001</td>
<td>Nevada</td>
<td>2.4</td>
</tr>
<tr>
<th>4</th>
<td>2002</td>
<td>Nevada</td>
<td>2.9</td>
</tr>
<tr>
<th>5</th>
<td>2003</td>
<td>Nevada</td>
<td>3.2</td>
</tr>
</tbody>
</table>
</div>

设定行

frame2 = pd.DataFrame(data, 
   ....:                       index=['one', 'two', 'three', 'four',
   ....:                              'five', 'six'])
frame2

<div>
<style scoped>
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>
<table border="1" class="dataframe">
<thead>
<tr style="text-align: right;">
<th></th>
<th>pop</th>
<th>state</th>
<th>year</th>
</tr>
</thead>
<tbody>
<tr>
<th>one</th>
<td>1.5</td>
<td>Ohio</td>
<td>2000</td>
</tr>
<tr>
<th>two</th>
<td>1.7</td>
<td>Ohio</td>
<td>2001</td>
</tr>
<tr>
<th>three</th>
<td>3.6</td>
<td>Ohio</td>
<td>2002</td>
</tr>
<tr>
<th>four</th>
<td>2.4</td>
<td>Nevada</td>
<td>2001</td>
</tr>
<tr>
<th>five</th>
<td>2.9</td>
<td>Nevada</td>
<td>2002</td>
</tr>
<tr>
<th>six</th>
<td>3.2</td>
<td>Nevada</td>
<td>2003</td>
</tr>
</tbody>
</table>
</div>

caution 如果不存在,则返回Nan

frame2 = pd.DataFrame(data, columns=['year', 'state', 'pop', 'debt'],
   ....:                       index=['one', 'two', 'three', 'four',
   ....:                              'five', 'six'])

frame2

<div>
<style scoped>
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>
<table border="1" class="dataframe">
<thead>
<tr style="text-align: right;">
<th></th>
<th>year</th>
<th>state</th>
<th>pop</th>
<th>debt</th>
</tr>
</thead>
<tbody>
<tr>
<th>one</th>
<td>2000</td>
<td>Ohio</td>
<td>1.5</td>
<td>NaN</td>
</tr>
<tr>
<th>two</th>
<td>2001</td>
<td>Ohio</td>
<td>1.7</td>
<td>NaN</td>
</tr>
<tr>
<th>three</th>
<td>2002</td>
<td>Ohio</td>
<td>3.6</td>
<td>NaN</td>
</tr>
<tr>
<th>four</th>
<td>2001</td>
<td>Nevada</td>
<td>2.4</td>
<td>NaN</td>
</tr>
<tr>
<th>five</th>
<td>2002</td>
<td>Nevada</td>
<td>2.9</td>
<td>NaN</td>
</tr>
<tr>
<th>six</th>
<td>2003</td>
<td>Nevada</td>
<td>3.2</td>
<td>NaN</td>
</tr>
</tbody>
</table>
</div>

frame2['debt'] = 16.5 #赋值
frame2

<div>
<style scoped>
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>
<table border="1" class="dataframe">
<thead>
<tr style="text-align: right;">
<th></th>
<th>year</th>
<th>state</th>
<th>pop</th>
<th>debt</th>
</tr>
</thead>
<tbody>
<tr>
<th>one</th>
<td>2000</td>
<td>Ohio</td>
<td>1.5</td>
<td>16.5</td>
</tr>
<tr>
<th>two</th>
<td>2001</td>
<td>Ohio</td>
<td>1.7</td>
<td>16.5</td>
</tr>
<tr>
<th>three</th>
<td>2002</td>
<td>Ohio</td>
<td>3.6</td>
<td>16.5</td>
</tr>
<tr>
<th>four</th>
<td>2001</td>
<td>Nevada</td>
<td>2.4</td>
<td>16.5</td>
</tr>
<tr>
<th>five</th>
<td>2002</td>
<td>Nevada</td>
<td>2.9</td>
<td>16.5</td>
</tr>
<tr>
<th>six</th>
<td>2003</td>
<td>Nevada</td>
<td>3.2</td>
<td>16.5</td>
</tr>
</tbody>
</table>
</div>

frame2.debt = np.arange(6.) #赋值
frame2

<div>
<style scoped>
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>
<table border="1" class="dataframe">
<thead>
<tr style="text-align: right;">
<th></th>
<th>year</th>
<th>state</th>
<th>pop</th>
<th>debt</th>
</tr>
</thead>
<tbody>
<tr>
<th>one</th>
<td>2000</td>
<td>Ohio</td>
<td>1.5</td>
<td>0.0</td>
</tr>
<tr>
<th>two</th>
<td>2001</td>
<td>Ohio</td>
<td>1.7</td>
<td>1.0</td>
</tr>
<tr>
<th>three</th>
<td>2002</td>
<td>Ohio</td>
<td>3.6</td>
<td>2.0</td>
</tr>
<tr>
<th>four</th>
<td>2001</td>
<td>Nevada</td>
<td>2.4</td>
<td>3.0</td>
</tr>
<tr>
<th>five</th>
<td>2002</td>
<td>Nevada</td>
<td>2.9</td>
<td>4.0</td>
</tr>
<tr>
<th>six</th>
<td>2003</td>
<td>Nevada</td>
<td>3.2</td>
<td>5.0</td>
</tr>
</tbody>
</table>
</div>

frame2.columns #显示列
Index(['year', 'state', 'pop', 'debt'], dtype='object')
frame2.index #显示行
Index(['one', 'two', 'three', 'four', 'five', 'six'], dtype='object')

选取特定列

frame2['state']
one        Ohio
two        Ohio
three      Ohio
four     Nevada
five     Nevada
six      Nevada
Name: state, dtype: object
frame.year #等价写法
0    2000
1    2001
2    2002
3    2001
4    2002
5    2003
Name: year, dtype: int64

选取特定行

frame2.loc['three']
year     2002
state    Ohio
pop       3.6
debt      NaN
Name: three, dtype: object

特定赋值方法

val = pd.Series([-1.2, -1.5, -1.7], index=['two', 'four', 'five'])
frame2.debt = val 
frame2

<div>
<style scoped>
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>
<table border="1" class="dataframe">
<thead>
<tr style="text-align: right;">
<th></th>
<th>year</th>
<th>state</th>
<th>pop</th>
<th>debt</th>
</tr>
</thead>
<tbody>
<tr>
<th>one</th>
<td>2000</td>
<td>Ohio</td>
<td>1.5</td>
<td>NaN</td>
</tr>
<tr>
<th>two</th>
<td>2001</td>
<td>Ohio</td>
<td>1.7</td>
<td>-1.2</td>
</tr>
<tr>
<th>three</th>
<td>2002</td>
<td>Ohio</td>
<td>3.6</td>
<td>NaN</td>
</tr>
<tr>
<th>four</th>
<td>2001</td>
<td>Nevada</td>
<td>2.4</td>
<td>-1.5</td>
</tr>
<tr>
<th>five</th>
<td>2002</td>
<td>Nevada</td>
<td>2.9</td>
<td>-1.7</td>
</tr>
<tr>
<th>six</th>
<td>2003</td>
<td>Nevada</td>
<td>3.2</td>
<td>NaN</td>
</tr>
</tbody>
</table>
</div>

删除操作

frame2['eastern'] = frame2.state == 'Ohio' #布尔值,新列创建必须用['']
frame2

<div>
<style scoped>
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>
<table border="1" class="dataframe">
<thead>
<tr style="text-align: right;">
<th></th>
<th>year</th>
<th>state</th>
<th>pop</th>
<th>debt</th>
<th>eastern</th>
</tr>
</thead>
<tbody>
<tr>
<th>one</th>
<td>2000</td>
<td>Ohio</td>
<td>1.5</td>
<td>NaN</td>
<td>True</td>
</tr>
<tr>
<th>two</th>
<td>2001</td>
<td>Ohio</td>
<td>1.7</td>
<td>-1.2</td>
<td>True</td>
</tr>
<tr>
<th>three</th>
<td>2002</td>
<td>Ohio</td>
<td>3.6</td>
<td>NaN</td>
<td>True</td>
</tr>
<tr>
<th>four</th>
<td>2001</td>
<td>Nevada</td>
<td>2.4</td>
<td>-1.5</td>
<td>False</td>
</tr>
<tr>
<th>five</th>
<td>2002</td>
<td>Nevada</td>
<td>2.9</td>
<td>-1.7</td>
<td>False</td>
</tr>
<tr>
<th>six</th>
<td>2003</td>
<td>Nevada</td>
<td>3.2</td>
<td>NaN</td>
<td>False</td>
</tr>
</tbody>
</table>
</div>

del frame2['eastern']
frame2.columns
Index(['year', 'state', 'pop', 'debt'], dtype='object')

T行列转置

pop = {'Nevada': {2001: 2.4, 2002: 2.9}, 'Ohio': {2000: 1.5, 2001: 1.7, 2002: 3.6}}
frame3 = pd.DataFrame(pop)
frame3

<div>
<style scoped>
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>
<table border="1" class="dataframe">
<thead>
<tr style="text-align: right;">
<th></th>
<th>Nevada</th>
<th>Ohio</th>
</tr>
</thead>
<tbody>
<tr>
<th>2000</th>
<td>NaN</td>
<td>1.5</td>
</tr>
<tr>
<th>2001</th>
<td>2.4</td>
<td>1.7</td>
</tr>
<tr>
<th>2002</th>
<td>2.9</td>
<td>3.6</td>
</tr>
</tbody>
</table>
</div>

frame3.T

<div>
<style scoped>
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>
<table border="1" class="dataframe">
<thead>
<tr style="text-align: right;">
<th></th>
<th>2000</th>
<th>2001</th>
<th>2002</th>
</tr>
</thead>
<tbody>
<tr>
<th>Nevada</th>
<td>NaN</td>
<td>2.4</td>
<td>2.9</td>
</tr>
<tr>
<th>Ohio</th>
<td>1.5</td>
<td>1.7</td>
<td>3.6</td>
</tr>
</tbody>
</table>
</div>

不存在行被赋值为Nan

pd.DataFrame(pop,index = [2001,2002,2003])

<div>
<style scoped>
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>
<table border="1" class="dataframe">
<thead>
<tr style="text-align: right;">
<th></th>
<th>Nevada</th>
<th>Ohio</th>
</tr>
</thead>
<tbody>
<tr>
<th>2001</th>
<td>2.4</td>
<td>1.7</td>
</tr>
<tr>
<th>2002</th>
<td>2.9</td>
<td>3.6</td>
</tr>
<tr>
<th>2003</th>
<td>NaN</td>
<td>NaN</td>
</tr>
</tbody>
</table>
</div>

嵌套操作

pdata = {'Ohio': frame3['Ohio'][:-1], 'Nevada': frame3['Nevada'][:2]}

pd.DataFrame(pdata)

<div>
<style scoped>
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>
<table border="1" class="dataframe">
<thead>
<tr style="text-align: right;">
<th></th>
<th>Nevada</th>
<th>Ohio</th>
</tr>
</thead>
<tbody>
<tr>
<th>2000</th>
<td>NaN</td>
<td>1.5</td>
</tr>
<tr>
<th>2001</th>
<td>2.4</td>
<td>1.7</td>
</tr>
</tbody>
</table>
</div>

行列名

frame3.index.name = 'year';
frame3.columns.name = 'state'
frame3

<div>
<style scoped>
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>
<table border="1" class="dataframe">
<thead>
<tr style="text-align: right;">
<th>state</th>
<th>Nevada</th>
<th>Ohio</th>
</tr>
<tr>
<th>year</th>
<th></th>
<th></th>
</tr>
</thead>
<tbody>
<tr>
<th>2000</th>
<td>NaN</td>
<td>1.5</td>
</tr>
<tr>
<th>2001</th>
<td>2.4</td>
<td>1.7</td>
</tr>
<tr>
<th>2002</th>
<td>2.9</td>
<td>3.6</td>
</tr>
</tbody>
</table>
</div>

values 为两维ndarray

frame3.values
array([[nan, 1.5],
       [2.4, 1.7],
       [2.9, 3.6]])
frame2.values #自行选择最合适的dtype
array([[2000, 'Ohio', 1.5, nan],
       [2001, 'Ohio', 1.7, -1.2],
       [2002, 'Ohio', 3.6, nan],
       [2001, 'Nevada', 2.4, -1.5],
       [2002, 'Nevada', 2.9, -1.7],
       [2003, 'Nevada', 3.2, nan]], dtype=object)

索引

obj = pd.Series(range(3),index = ['a','b','c'])
index = obj.index
index
Index(['a', 'b', 'c'], dtype='object')
index[1:]
Index(['b', 'c'], dtype='object')
index[1] = 'd'  #不可变
---------------------------------------------------------------------------

TypeError                                 Traceback (most recent call last)

<ipython-input-83-0fb5613748dc> in <module>()
----> 1 index[1] = 'd'  #不可变


/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/pandas/core/indexes/base.py in __setitem__(self, key, value)
   1722 
   1723     def __setitem__(self, key, value):
-> 1724         raise TypeError("Index does not support mutable operations")
   1725 
   1726     def __getitem__(self, key):


TypeError: Index does not support mutable operations
labels = pd.Index(np.arange(3))
labels #构建索引对象
Int64Index([0, 1, 2], dtype='int64')
obj2  = pd.Series([1.5,-2.5,0],index = labels) #应用索引
obj2
0    1.5
1   -2.5
2    0.0
dtype: float64
obj2.index is labels #判断
True

列名称

frame3.columns
Index(['Nevada', 'Ohio'], dtype='object', name='state')
'Ohio' in frame3.columns
True

可包含重复对象名称

dup_labels = pd.Index(['foo', 'foo', 'bar', 'bar'])
dup_labels
Index(['foo', 'foo', 'bar', 'bar'], dtype='object')

其他方法
Method Description
append Concatenate with additional Index objects, producing a new Index
difference Compute set difference as an Index
intersection Compute set intersection
union Compute set union
isin Compute boolean array indicating whether each value is contained in the passed collection
delete Compute new Index with element at index i deleted
drop Compute new Index by deleting passed values
insert Compute new Index by inserting element at index i
is_monotonic Returns True if each element is greater than or equal to the previous element
is_unique Returns True if the Index has no duplicate values
unique Compute the array of unique values in the Index

基础功能

重建索引

obj = pd.Series([4.5, 7.2, -5.3, 3.6], index=['d', 'b', 'a', 'c'])
obj
d    4.5
b    7.2
a   -5.3
c    3.6
dtype: float64
obj2 = obj.reindex(['a','b','c','d','e'])
obj2
a   -5.3
b    7.2
c    3.6
d    4.5
e    NaN
dtype: float64

插值

obj3 = pd.Series(['blue', 'purple', 'yellow'], index=[0, 2, 4])
obj3
0      blue
2    purple
4    yellow
dtype: object
obj3.reindex(range(6),method = 'ffill') #前向插值
0      blue
1      blue
2    purple
3    purple
4    yellow
5    yellow
dtype: object
import numpy as np
frame = pd.DataFrame(np.arange(9).reshape((3, 3)),index=['a', 'c', 'd'],columns=['Ohio', 'Texas', 'California'])
frame

<div>
<style scoped>
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>
<table border="1" class="dataframe">
<thead>
<tr style="text-align: right;">
<th></th>
<th>Ohio</th>
<th>Texas</th>
<th>California</th>
</tr>
</thead>
<tbody>
<tr>
<th>a</th>
<td>0</td>
<td>1</td>
<td>2</td>
</tr>
<tr>
<th>c</th>
<td>3</td>
<td>4</td>
<td>5</td>
</tr>
<tr>
<th>d</th>
<td>6</td>
<td>7</td>
<td>8</td>
</tr>
</tbody>
</table>
</div>

frame2 = frame.reindex(['a','b','c','d'])
frame2

<div>
<style scoped>
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>
<table border="1" class="dataframe">
<thead>
<tr style="text-align: right;">
<th></th>
<th>Ohio</th>
<th>Texas</th>
<th>California</th>
</tr>
</thead>
<tbody>
<tr>
<th>a</th>
<td>0.0</td>
<td>1.0</td>
<td>2.0</td>
</tr>
<tr>
<th>b</th>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
</tr>
<tr>
<th>c</th>
<td>3.0</td>
<td>4.0</td>
<td>5.0</td>
</tr>
<tr>
<th>d</th>
<td>6.0</td>
<td>7.0</td>
<td>8.0</td>
</tr>
</tbody>
</table>
</div>

dataframe 列

states = ['Texas', 'Utah', 'California']
frame.reindex(columns = states)

<div>
<style scoped>
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>
<table border="1" class="dataframe">
<thead>
<tr style="text-align: right;">
<th></th>
<th>Texas</th>
<th>Utah</th>
<th>California</th>
</tr>
</thead>
<tbody>
<tr>
<th>a</th>
<td>1</td>
<td>NaN</td>
<td>2</td>
</tr>
<tr>
<th>c</th>
<td>4</td>
<td>NaN</td>
<td>5</td>
</tr>
<tr>
<th>d</th>
<td>7</td>
<td>NaN</td>
<td>8</td>
</tr>
</tbody>
</table>
</div>

frame.loc[['a','b','c','d'],states]  #行索引+列索引

/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/ipykernel_launcher.py:1: FutureWarning: 
Passing list-likes to .loc or [] with any missing label will raise
KeyError in the future, you can use .reindex() as an alternative.

See the documentation here:
http://pandas.pydata.org/pandas-docs/stable/indexing.html#deprecate-loc-reindex-listlike
  """Entry point for launching an IPython kernel.

<div>
<style scoped>
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>
<table border="1" class="dataframe">
<thead>
<tr style="text-align: right;">
<th></th>
<th>Texas</th>
<th>Utah</th>
<th>California</th>
</tr>
</thead>
<tbody>
<tr>
<th>a</th>
<td>1.0</td>
<td>NaN</td>
<td>2.0</td>
</tr>
<tr>
<th>b</th>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
</tr>
<tr>
<th>c</th>
<td>4.0</td>
<td>NaN</td>
<td>5.0</td>
</tr>
<tr>
<th>d</th>
<td>7.0</td>
<td>NaN</td>
<td>8.0</td>
</tr>
</tbody>
</table>
</div>

Argument Description
index New sequence to use as index. Can be Index instance or any other sequence-like Python data structure. An Index will be used exactly as is without any copying.
method Interpolation (fill) method; 'ffill' fills forward, while 'bfill' fills backward.
fill_value Substitute value to use when introducing missing data by reindexing.
limit When forward- or backfilling, maximum size gap (in number of elements) to fill.
tolerance When forward- or backfilling, maximum size gap (in absolute numeric distance) to fill for inexact matches.
level Match simple Index on level of MultiIndex; otherwise select subset of.
copy If True, always copy underlying data even if new index is equivalent to old index; if False, do not copy the data when the indexes are equivalent.

删除

obj = pd.Series(np.arange(5.), index=['a', 'b', 'c', 'd', 'e'])
obj
a    0.0
b    1.0
c    2.0
d    3.0
e    4.0
dtype: float64
new_obj = obj.drop('c') #删除c行
new_obj
a    0.0
b    1.0
d    3.0
e    4.0
dtype: float64
data = pd.DataFrame(np.arange(16).reshape((4, 4)), index=['Ohio', 'Colorado', 'Utah', 'New York'], columns=['one', 'two', 'three', 'four'])
data

<div>
<style scoped>
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>
<table border="1" class="dataframe">
<thead>
<tr style="text-align: right;">
<th></th>
<th>one</th>
<th>two</th>
<th>three</th>
<th>four</th>
</tr>
</thead>
<tbody>
<tr>
<th>Ohio</th>
<td>0</td>
<td>1</td>
<td>2</td>
<td>3</td>
</tr>
<tr>
<th>Colorado</th>
<td>4</td>
<td>5</td>
<td>6</td>
<td>7</td>
</tr>
<tr>
<th>Utah</th>
<td>8</td>
<td>9</td>
<td>10</td>
<td>11</td>
</tr>
<tr>
<th>New York</th>
<td>12</td>
<td>13</td>
<td>14</td>
<td>15</td>
</tr>
</tbody>
</table>
</div>

data.drop(['Colorado','Ohio']) #默认删除行

<div>
<style scoped>
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>
<table border="1" class="dataframe">
<thead>
<tr style="text-align: right;">
<th></th>
<th>one</th>
<th>two</th>
<th>three</th>
<th>four</th>
</tr>
</thead>
<tbody>
<tr>
<th>Utah</th>
<td>8</td>
<td>9</td>
<td>10</td>
<td>11</td>
</tr>
<tr>
<th>New York</th>
<td>12</td>
<td>13</td>
<td>14</td>
<td>15</td>
</tr>
</tbody>
</table>
</div>

data.drop('two',axis = 1) #显性标识1删除列

<div>
<style scoped>
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>
<table border="1" class="dataframe">
<thead>
<tr style="text-align: right;">
<th></th>
<th>one</th>
<th>three</th>
<th>four</th>
</tr>
</thead>
<tbody>
<tr>
<th>Ohio</th>
<td>0</td>
<td>2</td>
<td>3</td>
</tr>
<tr>
<th>Colorado</th>
<td>4</td>
<td>6</td>
<td>7</td>
</tr>
<tr>
<th>Utah</th>
<td>8</td>
<td>10</td>
<td>11</td>
</tr>
<tr>
<th>New York</th>
<td>12</td>
<td>14</td>
<td>15</td>
</tr>
</tbody>
</table>
</div>

data.drop(['two','four'],axis = 'columns') #等价写法

<div>
<style scoped>
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>
<table border="1" class="dataframe">
<thead>
<tr style="text-align: right;">
<th></th>
<th>one</th>
<th>three</th>
</tr>
</thead>
<tbody>
<tr>
<th>Ohio</th>
<td>0</td>
<td>2</td>
</tr>
<tr>
<th>Colorado</th>
<td>4</td>
<td>6</td>
</tr>
<tr>
<th>Utah</th>
<td>8</td>
<td>10</td>
</tr>
<tr>
<th>New York</th>
<td>12</td>
<td>14</td>
</tr>
</tbody>
</table>
</div>

作用于原对象

obj.drop('c',inplace = True)
obj
a    0.0
b    1.0
d    3.0
e    4.0
dtype: float64

索引/挑选和过滤

obj = pd.Series(np.arange(4.), index=['a', 'b', 'c', 'd'])
obj
a    0.0
b    1.0
c    2.0
d    3.0
dtype: float64
obj['b']#索引
1.0
obj[1] #索引
1.0
obj[['b','a','d']] #多项索引
b    1.0
a    0.0
d    3.0
dtype: float64
obj[2:4]#切片
c    2.0
d    3.0
dtype: float64
obj[[1,3]] #多项

b    1.0
d    3.0
dtype: float64
obj[obj < 2] #按条件过滤
a    0.0
b    1.0
dtype: float64
obj['b':'c'] #过滤
b    1.0
c    2.0
dtype: float64
obj['b':'c'] = 5#赋值
obj
a    0.0
b    5.0
c    5.0
d    3.0
dtype: float64

dataframe

data = pd.DataFrame(np.arange(16).reshape((4, 4)),
index=['Ohio', 'Colorado', 'Utah', 'New York'],
columns=['one', 'two', 'three', 'four'])
data

<div>
<style scoped>
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>
<table border="1" class="dataframe">
<thead>
<tr style="text-align: right;">
<th></th>
<th>one</th>
<th>two</th>
<th>three</th>
<th>four</th>
</tr>
</thead>
<tbody>
<tr>
<th>Ohio</th>
<td>0</td>
<td>1</td>
<td>2</td>
<td>3</td>
</tr>
<tr>
<th>Colorado</th>
<td>4</td>
<td>5</td>
<td>6</td>
<td>7</td>
</tr>
<tr>
<th>Utah</th>
<td>8</td>
<td>9</td>
<td>10</td>
<td>11</td>
</tr>
<tr>
<th>New York</th>
<td>12</td>
<td>13</td>
<td>14</td>
<td>15</td>
</tr>
</tbody>
</table>
</div>

data['two']#选择
Ohio         1
Colorado     5
Utah         9
New York    13
Name: two, dtype: int64
data[['three','one']]#多项

<div>
<style scoped>
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>
<table border="1" class="dataframe">
<thead>
<tr style="text-align: right;">
<th></th>
<th>three</th>
<th>one</th>
</tr>
</thead>
<tbody>
<tr>
<th>Ohio</th>
<td>2</td>
<td>0</td>
</tr>
<tr>
<th>Colorado</th>
<td>6</td>
<td>4</td>
</tr>
<tr>
<th>Utah</th>
<td>10</td>
<td>8</td>
</tr>
<tr>
<th>New York</th>
<td>14</td>
<td>12</td>
</tr>
</tbody>
</table>
</div>

data[:2] #选择

<div>
<style scoped>
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>
<table border="1" class="dataframe">
<thead>
<tr style="text-align: right;">
<th></th>
<th>one</th>
<th>two</th>
<th>three</th>
<th>four</th>
</tr>
</thead>
<tbody>
<tr>
<th>Ohio</th>
<td>0</td>
<td>1</td>
<td>2</td>
<td>3</td>
</tr>
<tr>
<th>Colorado</th>
<td>4</td>
<td>5</td>
<td>6</td>
<td>7</td>
</tr>
</tbody>
</table>
</div>

data[data['three'] > 5] #条件

<div>
<style scoped>
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>
<table border="1" class="dataframe">
<thead>
<tr style="text-align: right;">
<th></th>
<th>one</th>
<th>two</th>
<th>three</th>
<th>four</th>
</tr>
</thead>
<tbody>
<tr>
<th>Colorado</th>
<td>4</td>
<td>5</td>
<td>6</td>
<td>7</td>
</tr>
<tr>
<th>Utah</th>
<td>8</td>
<td>9</td>
<td>10</td>
<td>11</td>
</tr>
<tr>
<th>New York</th>
<td>12</td>
<td>13</td>
<td>14</td>
<td>15</td>
</tr>
</tbody>
</table>
</div>

data < 5

<div>
<style scoped>
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>
<table border="1" class="dataframe">
<thead>
<tr style="text-align: right;">
<th></th>
<th>one</th>
<th>two</th>
<th>three</th>
<th>four</th>
</tr>
</thead>
<tbody>
<tr>
<th>Ohio</th>
<td>True</td>
<td>True</td>
<td>True</td>
<td>True</td>
</tr>
<tr>
<th>Colorado</th>
<td>True</td>
<td>False</td>
<td>False</td>
<td>False</td>
</tr>
<tr>
<th>Utah</th>
<td>False</td>
<td>False</td>
<td>False</td>
<td>False</td>
</tr>
<tr>
<th>New York</th>
<td>False</td>
<td>False</td>
<td>False</td>
<td>False</td>
</tr>
</tbody>
</table>
</div>

data[data < 5] = 0 #按条件过滤并赋值
data

<div>
<style scoped>
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>
<table border="1" class="dataframe">
<thead>
<tr style="text-align: right;">
<th></th>
<th>one</th>
<th>two</th>
<th>three</th>
<th>four</th>
</tr>
</thead>
<tbody>
<tr>
<th>Ohio</th>
<td>0</td>
<td>0</td>
<td>0</td>
<td>0</td>
</tr>
<tr>
<th>Colorado</th>
<td>0</td>
<td>5</td>
<td>6</td>
<td>7</td>
</tr>
<tr>
<th>Utah</th>
<td>8</td>
<td>9</td>
<td>10</td>
<td>11</td>
</tr>
<tr>
<th>New York</th>
<td>12</td>
<td>13</td>
<td>14</td>
<td>15</td>
</tr>
</tbody>
</table>
</div>

loc和iloc

data.loc['Colorado', ['two', 'three']] #行列选择
two      5
three    6
Name: Colorado, dtype: int64
data.iloc[2, [3, 0, 1]] #行列
four    11
one      8
two      9
Name: Utah, dtype: int64
data.iloc[2] #选中第二行
one       8
two       9
three    10
four     11
Name: Utah, dtype: int64
data.iloc[[1,2],[3,0,1]] #多行多列

<div>
<style scoped>
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>
<table border="1" class="dataframe">
<thead>
<tr style="text-align: right;">
<th></th>
<th>four</th>
<th>one</th>
<th>two</th>
</tr>
</thead>
<tbody>
<tr>
<th>Colorado</th>
<td>7</td>
<td>0</td>
<td>5</td>
</tr>
<tr>
<th>Utah</th>
<td>11</td>
<td>8</td>
<td>9</td>
</tr>
</tbody>
</table>
</div>

data.loc[:'Utah','two'] #loc标名,iloc标数字
Ohio        0
Colorado    5
Utah        9
Name: two, dtype: int64
data.iloc[:,:3][data.three > 5] #冒号代表全部选中,并加入过滤条件

<div>
<style scoped>
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>
<table border="1" class="dataframe">
<thead>
<tr style="text-align: right;">
<th></th>
<th>one</th>
<th>two</th>
<th>three</th>
</tr>
</thead>
<tbody>
<tr>
<th>Colorado</th>
<td>0</td>
<td>5</td>
<td>6</td>
</tr>
<tr>
<th>Utah</th>
<td>8</td>
<td>9</td>
<td>10</td>
</tr>
<tr>
<th>New York</th>
<td>12</td>
<td>13</td>
<td>14</td>
</tr>
</tbody>
</table>
</div>

Type Notes
df[val] Select single column or sequence of columns from the DataFrame; special case conveniences: boolean array (filter rows), slice (slice rows), or boolean DataFrame (set values based on some criterion)
df.loc[val] Selects single row or subset of rows from the DataFrame by label
df.loc[:, val] Selects single column or subset of columns by label
df.loc[val1, val2] Select both rows and columns by label
df.iloc[where] Selects single row or subset of rows from the DataFrame by integer position
df.iloc[:, where] Selects single column or subset of columns by integer position
df.iloc[where_i, where_j] Select both rows and columns by integer position
df.at[label_i, label_j] Select a single scalar value by row and column label
df.iat[i, j] Select a single scalar value by row and column position (integers)
reindex method Select either rows or columns by labels
get_value, set_value methods Select single value by row and column label

整数索引

ser = pd.Series(np.arange(3.))
ser[-1] #无法操作
---------------------------------------------------------------------------

KeyError                                  Traceback (most recent call last)

<ipython-input-62-179f8d80e478> in <module>()
      1 ser = pd.Series(np.arange(3.))
----> 2 ser[-1] #无法操作


/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/pandas/core/series.py in __getitem__(self, key)
    621         key = com._apply_if_callable(key, self)
    622         try:
--> 623             result = self.index.get_value(self, key)
    624 
    625             if not is_scalar(result):


/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/pandas/core/indexes/base.py in get_value(self, series, key)
   2558         try:
   2559             return self._engine.get_value(s, k,
-> 2560                                           tz=getattr(series.dtype, 'tz', None))
   2561         except KeyError as e1:
   2562             if len(self) > 0 and self.inferred_type in ['integer', 'boolean']:


pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_value()


pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_value()


pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc()


pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.Int64HashTable.get_item()


pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.Int64HashTable.get_item()


KeyError: -1
ser
0    0.0
1    1.0
2    2.0
dtype: float64
ser2 = pd.Series(np.arange(3.), index=['a', 'b', 'c'])
ser2[-1] #自建索引就可以
2.0
ser[:1]
0    0.0
dtype: float64
ser.loc[:1]
0    0.0
1    1.0
dtype: float64
ser.iloc[:1] #注意三者区别
0    0.0
dtype: float64

运算

s1 = pd.Series([7.3, -2.5, 3.4, 1.5], index=['a', 'c', 'd', 'e'])
s2 = pd.Series([-2.1, 3.6, -1.5, 4, 3.1], index=['a', 'c', 'e', 'f', 'g'])
s1
a    7.3
c   -2.5
d    3.4
e    1.5
dtype: float64
s2
a   -2.1
c    3.6
e   -1.5
f    4.0
g    3.1
dtype: float64
s1 + s2
a    5.2
c    1.1
d    NaN
e    0.0
f    NaN
g    NaN
dtype: float64
df1 = pd.DataFrame(np.arange(9.).reshape((3, 3)), columns=list('bcd'), index=['Ohio', 'Texas', 'Colorado'])
df2 = pd.DataFrame(np.arange(12.).reshape((4, 3)), columns=list('bde'),index=['Utah', 'Ohio', 'Texas', 'Oregon'])
df1

<div>
<style scoped>
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>
<table border="1" class="dataframe">
<thead>
<tr style="text-align: right;">
<th></th>
<th>b</th>
<th>c</th>
<th>d</th>
</tr>
</thead>
<tbody>
<tr>
<th>Ohio</th>
<td>0.0</td>
<td>1.0</td>
<td>2.0</td>
</tr>
<tr>
<th>Texas</th>
<td>3.0</td>
<td>4.0</td>
<td>5.0</td>
</tr>
<tr>
<th>Colorado</th>
<td>6.0</td>
<td>7.0</td>
<td>8.0</td>
</tr>
</tbody>
</table>
</div>

df2

<div>
<style scoped>
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>
<table border="1" class="dataframe">
<thead>
<tr style="text-align: right;">
<th></th>
<th>b</th>
<th>d</th>
<th>e</th>
</tr>
</thead>
<tbody>
<tr>
<th>Utah</th>
<td>0.0</td>
<td>1.0</td>
<td>2.0</td>
</tr>
<tr>
<th>Ohio</th>
<td>3.0</td>
<td>4.0</td>
<td>5.0</td>
</tr>
<tr>
<th>Texas</th>
<td>6.0</td>
<td>7.0</td>
<td>8.0</td>
</tr>
<tr>
<th>Oregon</th>
<td>9.0</td>
<td>10.0</td>
<td>11.0</td>
</tr>
</tbody>
</table>
</div>

df1 + df2 #无共同索引返回Nan

<div>
<style scoped>
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>
<table border="1" class="dataframe">
<thead>
<tr style="text-align: right;">
<th></th>
<th>b</th>
<th>c</th>
<th>d</th>
<th>e</th>
</tr>
</thead>
<tbody>
<tr>
<th>Colorado</th>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
</tr>
<tr>
<th>Ohio</th>
<td>3.0</td>
<td>NaN</td>
<td>6.0</td>
<td>NaN</td>
</tr>
<tr>
<th>Oregon</th>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
</tr>
<tr>
<th>Texas</th>
<td>9.0</td>
<td>NaN</td>
<td>12.0</td>
<td>NaN</td>
</tr>
<tr>
<th>Utah</th>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
</tr>
</tbody>
</table>
</div>

插值

df1 = pd.DataFrame(np.arange(12.).reshape((3, 4)),columns=list('abcd'))
df2 = pd.DataFrame(np.arange(20.).reshape((4, 5)), columns=list('abcde'))

df1

<div>
<style scoped>
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>
<table border="1" class="dataframe">
<thead>
<tr style="text-align: right;">
<th></th>
<th>a</th>
<th>b</th>
<th>c</th>
<th>d</th>
</tr>
</thead>
<tbody>
<tr>
<th>0</th>
<td>0.0</td>
<td>1.0</td>
<td>2.0</td>
<td>3.0</td>
</tr>
<tr>
<th>1</th>
<td>4.0</td>
<td>5.0</td>
<td>6.0</td>
<td>7.0</td>
</tr>
<tr>
<th>2</th>
<td>8.0</td>
<td>9.0</td>
<td>10.0</td>
<td>11.0</td>
</tr>
</tbody>
</table>
</div>

df2

<div>
<style scoped>
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>
<table border="1" class="dataframe">
<thead>
<tr style="text-align: right;">
<th></th>
<th>a</th>
<th>b</th>
<th>c</th>
<th>d</th>
<th>e</th>
</tr>
</thead>
<tbody>
<tr>
<th>0</th>
<td>0.0</td>
<td>1.0</td>
<td>2.0</td>
<td>3.0</td>
<td>4.0</td>
</tr>
<tr>
<th>1</th>
<td>5.0</td>
<td>6.0</td>
<td>7.0</td>
<td>8.0</td>
<td>9.0</td>
</tr>
<tr>
<th>2</th>
<td>10.0</td>
<td>11.0</td>
<td>12.0</td>
<td>13.0</td>
<td>14.0</td>
</tr>
<tr>
<th>3</th>
<td>15.0</td>
<td>16.0</td>
<td>17.0</td>
<td>18.0</td>
<td>19.0</td>
</tr>
</tbody>
</table>
</div>

df1 + df2

<div>
<style scoped>
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>
<table border="1" class="dataframe">
<thead>
<tr style="text-align: right;">
<th></th>
<th>a</th>
<th>b</th>
<th>c</th>
<th>d</th>
<th>e</th>
</tr>
</thead>
<tbody>
<tr>
<th>0</th>
<td>0.0</td>
<td>2.0</td>
<td>4.0</td>
<td>6.0</td>
<td>NaN</td>
</tr>
<tr>
<th>1</th>
<td>9.0</td>
<td>11.0</td>
<td>13.0</td>
<td>15.0</td>
<td>NaN</td>
</tr>
<tr>
<th>2</th>
<td>18.0</td>
<td>20.0</td>
<td>22.0</td>
<td>24.0</td>
<td>NaN</td>
</tr>
<tr>
<th>3</th>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
</tr>
</tbody>
</table>
</div>

df1.add(df2,fill_value=0) #不存在数字的一方以0参加运算

<div>
<style scoped>
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>
<table border="1" class="dataframe">
<thead>
<tr style="text-align: right;">
<th></th>
<th>a</th>
<th>b</th>
<th>c</th>
<th>d</th>
<th>e</th>
</tr>
</thead>
<tbody>
<tr>
<th>0</th>
<td>0.0</td>
<td>2.0</td>
<td>4.0</td>
<td>6.0</td>
<td>4.0</td>
</tr>
<tr>
<th>1</th>
<td>9.0</td>
<td>11.0</td>
<td>13.0</td>
<td>15.0</td>
<td>9.0</td>
</tr>
<tr>
<th>2</th>
<td>18.0</td>
<td>20.0</td>
<td>22.0</td>
<td>24.0</td>
<td>14.0</td>
</tr>
<tr>
<th>3</th>
<td>15.0</td>
<td>16.0</td>
<td>17.0</td>
<td>18.0</td>
<td>19.0</td>
</tr>
</tbody>
</table>
</div>

1 / df1 #作用到每个元素

<div>
<style scoped>
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>
<table border="1" class="dataframe">
<thead>
<tr style="text-align: right;">
<th></th>
<th>a</th>
<th>b</th>
<th>c</th>
<th>d</th>
</tr>
</thead>
<tbody>
<tr>
<th>0</th>
<td>inf</td>
<td>1.000000</td>
<td>0.500000</td>
<td>0.333333</td>
</tr>
<tr>
<th>1</th>
<td>0.250000</td>
<td>0.200000</td>
<td>0.166667</td>
<td>0.142857</td>
</tr>
<tr>
<th>2</th>
<td>0.125000</td>
<td>0.111111</td>
<td>0.100000</td>
<td>0.090909</td>
</tr>
</tbody>
</table>
</div>

df1.rdiv(1) #等价写法

<div>
<style scoped>
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>
<table border="1" class="dataframe">
<thead>
<tr style="text-align: right;">
<th></th>
<th>a</th>
<th>b</th>
<th>c</th>
<th>d</th>
</tr>
</thead>
<tbody>
<tr>
<th>0</th>
<td>inf</td>
<td>1.000000</td>
<td>0.500000</td>
<td>0.333333</td>
</tr>
<tr>
<th>1</th>
<td>0.250000</td>
<td>0.200000</td>
<td>0.166667</td>
<td>0.142857</td>
</tr>
<tr>
<th>2</th>
<td>0.125000</td>
<td>0.111111</td>
<td>0.100000</td>
<td>0.090909</td>
</tr>
</tbody>
</table>
</div>

df1.reindex(columns = df2.columns,fill_value=0) #重建索引也可以插值

<div>
<style scoped>
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>
<table border="1" class="dataframe">
<thead>
<tr style="text-align: right;">
<th></th>
<th>a</th>
<th>b</th>
<th>c</th>
<th>d</th>
<th>e</th>
</tr>
</thead>
<tbody>
<tr>
<th>0</th>
<td>0.0</td>
<td>1.0</td>
<td>2.0</td>
<td>3.0</td>
<td>0</td>
</tr>
<tr>
<th>1</th>
<td>4.0</td>
<td>5.0</td>
<td>6.0</td>
<td>7.0</td>
<td>0</td>
</tr>
<tr>
<th>2</th>
<td>8.0</td>
<td>9.0</td>
<td>10.0</td>
<td>11.0</td>
<td>0</td>
</tr>
</tbody>
</table>
</div>

运算符:
add, radd (+)
sub, rsub (-)
div, rdiv (/)
floordiv, (//)
mul, rmul ()
pow, rpow (
*)

series和dataframe间操作

frame = pd.DataFrame(np.arange(12.).reshape((4, 3)),
 columns=list('bde'),
index=['Utah', 'Ohio', 'Texas', 'Oregon'])
series = frame.iloc[0]
frame

<div>
<style scoped>
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>
<table border="1" class="dataframe">
<thead>
<tr style="text-align: right;">
<th></th>
<th>b</th>
<th>d</th>
<th>e</th>
</tr>
</thead>
<tbody>
<tr>
<th>Utah</th>
<td>0.0</td>
<td>1.0</td>
<td>2.0</td>
</tr>
<tr>
<th>Ohio</th>
<td>3.0</td>
<td>4.0</td>
<td>5.0</td>
</tr>
<tr>
<th>Texas</th>
<td>6.0</td>
<td>7.0</td>
<td>8.0</td>
</tr>
<tr>
<th>Oregon</th>
<td>9.0</td>
<td>10.0</td>
<td>11.0</td>
</tr>
</tbody>
</table>
</div>

series
b    0.0
d    1.0
e    2.0
Name: Utah, dtype: float64
frame - series #元素运算

<div>
<style scoped>
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>
<table border="1" class="dataframe">
<thead>
<tr style="text-align: right;">
<th></th>
<th>b</th>
<th>d</th>
<th>e</th>
</tr>
</thead>
<tbody>
<tr>
<th>Utah</th>
<td>0.0</td>
<td>0.0</td>
<td>0.0</td>
</tr>
<tr>
<th>Ohio</th>
<td>3.0</td>
<td>3.0</td>
<td>3.0</td>
</tr>
<tr>
<th>Texas</th>
<td>6.0</td>
<td>6.0</td>
<td>6.0</td>
</tr>
<tr>
<th>Oregon</th>
<td>9.0</td>
<td>9.0</td>
<td>9.0</td>
</tr>
</tbody>
</table>
</div>

series2 = pd.Series(range(3),index = ['b','e','f'])
frame + series2 #Nan

<div>
<style scoped>
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>
<table border="1" class="dataframe">
<thead>
<tr style="text-align: right;">
<th></th>
<th>b</th>
<th>d</th>
<th>e</th>
<th>f</th>
</tr>
</thead>
<tbody>
<tr>
<th>Utah</th>
<td>0.0</td>
<td>NaN</td>
<td>3.0</td>
<td>NaN</td>
</tr>
<tr>
<th>Ohio</th>
<td>3.0</td>
<td>NaN</td>
<td>6.0</td>
<td>NaN</td>
</tr>
<tr>
<th>Texas</th>
<td>6.0</td>
<td>NaN</td>
<td>9.0</td>
<td>NaN</td>
</tr>
<tr>
<th>Oregon</th>
<td>9.0</td>
<td>NaN</td>
<td>12.0</td>
<td>NaN</td>
</tr>
</tbody>
</table>
</div>

指定运算
series3 = frame['d']
frame

<div>
<style scoped>
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>
<table border="1" class="dataframe">
<thead>
<tr style="text-align: right;">
<th></th>
<th>b</th>
<th>d</th>
<th>e</th>
</tr>
</thead>
<tbody>
<tr>
<th>Utah</th>
<td>0.0</td>
<td>1.0</td>
<td>2.0</td>
</tr>
<tr>
<th>Ohio</th>
<td>3.0</td>
<td>4.0</td>
<td>5.0</td>
</tr>
<tr>
<th>Texas</th>
<td>6.0</td>
<td>7.0</td>
<td>8.0</td>
</tr>
<tr>
<th>Oregon</th>
<td>9.0</td>
<td>10.0</td>
<td>11.0</td>
</tr>
</tbody>
</table>
</div>

series3
Utah       1.0
Ohio       4.0
Texas      7.0
Oregon    10.0
Name: d, dtype: float64
frame.sub(series3,axis = 0) #指定行参与运算

<div>
<style scoped>
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>
<table border="1" class="dataframe">
<thead>
<tr style="text-align: right;">
<th></th>
<th>b</th>
<th>d</th>
<th>e</th>
</tr>
</thead>
<tbody>
<tr>
<th>Utah</th>
<td>-1.0</td>
<td>0.0</td>
<td>1.0</td>
</tr>
<tr>
<th>Ohio</th>
<td>-1.0</td>
<td>0.0</td>
<td>1.0</td>
</tr>
<tr>
<th>Texas</th>
<td>-1.0</td>
<td>0.0</td>
<td>1.0</td>
</tr>
<tr>
<th>Oregon</th>
<td>-1.0</td>
<td>0.0</td>
<td>1.0</td>
</tr>
</tbody>
</table>
</div>

函数和映射

frame = pd.DataFrame(np.random.randn(4, 3), columns=list('bde'),index=['Utah', 'Ohio', 'Texas', 'Oregon'])
frame

<div>
<style scoped>
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>
<table border="1" class="dataframe">
<thead>
<tr style="text-align: right;">
<th></th>
<th>b</th>
<th>d</th>
<th>e</th>
</tr>
</thead>
<tbody>
<tr>
<th>Utah</th>
<td>-0.636008</td>
<td>1.531034</td>
<td>0.417312</td>
</tr>
<tr>
<th>Ohio</th>
<td>0.490817</td>
<td>-1.060737</td>
<td>0.454573</td>
</tr>
<tr>
<th>Texas</th>
<td>0.315152</td>
<td>-0.123696</td>
<td>1.613796</td>
</tr>
<tr>
<th>Oregon</th>
<td>1.031102</td>
<td>0.578078</td>
<td>-0.269054</td>
</tr>
</tbody>
</table>
</div>

np.abs(frame) #绝对值

<div>
<style scoped>
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>
<table border="1" class="dataframe">
<thead>
<tr style="text-align: right;">
<th></th>
<th>b</th>
<th>d</th>
<th>e</th>
</tr>
</thead>
<tbody>
<tr>
<th>Utah</th>
<td>0.636008</td>
<td>1.531034</td>
<td>0.417312</td>
</tr>
<tr>
<th>Ohio</th>
<td>0.490817</td>
<td>1.060737</td>
<td>0.454573</td>
</tr>
<tr>
<th>Texas</th>
<td>0.315152</td>
<td>0.123696</td>
<td>1.613796</td>
</tr>
<tr>
<th>Oregon</th>
<td>1.031102</td>
<td>0.578078</td>
<td>0.269054</td>
</tr>
</tbody>
</table>
</div>

apply函数
f = lambda x : x.max() - x.min() #lambda为匿名函数
frame.apply(f) #行应用
b    1.667110
d    2.591771
e    1.882850
dtype: float64
frame.apply(f,axis = 1) #列应用
Utah      2.167042
Ohio      1.551555
Texas     1.737492
Oregon    1.300156
dtype: float64
其他高级操作
def f(x):
    return pd.Series([x.min(),x.max()],index = ['min','max'])
frame.apply(f) #高级与否取决于编写的函数

<div>
<style scoped>
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>
<table border="1" class="dataframe">
<thead>
<tr style="text-align: right;">
<th></th>
<th>b</th>
<th>d</th>
<th>e</th>
</tr>
</thead>
<tbody>
<tr>
<th>min</th>
<td>-0.636008</td>
<td>-1.060737</td>
<td>-0.269054</td>
</tr>
<tr>
<th>max</th>
<td>1.031102</td>
<td>1.531034</td>
<td>1.613796</td>
</tr>
</tbody>
</table>
</div>

format = lambda x : '%.2f' % x 

frame.applymap(format) #全部使用

<div>
<style scoped>
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>
<table border="1" class="dataframe">
<thead>
<tr style="text-align: right;">
<th></th>
<th>b</th>
<th>d</th>
<th>e</th>
</tr>
</thead>
<tbody>
<tr>
<th>Utah</th>
<td>-0.64</td>
<td>1.53</td>
<td>0.42</td>
</tr>
<tr>
<th>Ohio</th>
<td>0.49</td>
<td>-1.06</td>
<td>0.45</td>
</tr>
<tr>
<th>Texas</th>
<td>0.32</td>
<td>-0.12</td>
<td>1.61</td>
</tr>
<tr>
<th>Oregon</th>
<td>1.03</td>
<td>0.58</td>
<td>-0.27</td>
</tr>
</tbody>
</table>
</div>

frame.e.map(format) #映射

Utah       0.42
Ohio       0.45
Texas      1.61
Oregon    -0.27
Name: e, dtype: object

排序

obj = pd.Series(range(4),index = ['d','a','b','c'])
obj.sort_index()
a    1
b    2
c    3
d    0
dtype: int64
frame = pd.DataFrame(np.arange(8).reshape((2, 4)),index=['three', 'one'],columns=['d', 'a', 'b', 'c'])
frame.sort_index()

<div>
<style scoped>
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>
<table border="1" class="dataframe">
<thead>
<tr style="text-align: right;">
<th></th>
<th>d</th>
<th>a</th>
<th>b</th>
<th>c</th>
</tr>
</thead>
<tbody>
<tr>
<th>one</th>
<td>4</td>
<td>5</td>
<td>6</td>
<td>7</td>
</tr>
<tr>
<th>three</th>
<td>0</td>
<td>1</td>
<td>2</td>
<td>3</td>
</tr>
</tbody>
</table>
</div>

frame.sort_index(1) #注意行列

<div>
<style scoped>
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>
<table border="1" class="dataframe">
<thead>
<tr style="text-align: right;">
<th></th>
<th>a</th>
<th>b</th>
<th>c</th>
<th>d</th>
</tr>
</thead>
<tbody>
<tr>
<th>three</th>
<td>1</td>
<td>2</td>
<td>3</td>
<td>0</td>
</tr>
<tr>
<th>one</th>
<td>5</td>
<td>6</td>
<td>7</td>
<td>4</td>
</tr>
</tbody>
</table>
</div>

frame.sort_index(1,ascending=False) #更改排序顺序

<div>
<style scoped>
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>
<table border="1" class="dataframe">
<thead>
<tr style="text-align: right;">
<th></th>
<th>d</th>
<th>c</th>
<th>b</th>
<th>a</th>
</tr>
</thead>
<tbody>
<tr>
<th>three</th>
<td>0</td>
<td>3</td>
<td>2</td>
<td>1</td>
</tr>
<tr>
<th>one</th>
<td>4</td>
<td>7</td>
<td>6</td>
<td>5</td>
</tr>
</tbody>
</table>
</div>

按值排序
obj = pd.Series([4,7,-3,2])
obj.sort_values()
2   -3
3    2
0    4
1    7
dtype: int64
obj = pd.Series([4,np.nan,7,np.nan,-3,2])
obj.sort_values() #缺失值会被置于末尾
4   -3.0
5    2.0
0    4.0
2    7.0
1    NaN
3    NaN
dtype: float64
dataframe
frame = pd.DataFrame({'b': [4, 7, -3, 2], 'a': [0, 1, 0, 1]})
frame

<div>
<style scoped>
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>
<table border="1" class="dataframe">
<thead>
<tr style="text-align: right;">
<th></th>
<th>a</th>
<th>b</th>
</tr>
</thead>
<tbody>
<tr>
<th>0</th>
<td>0</td>
<td>4</td>
</tr>
<tr>
<th>1</th>
<td>1</td>
<td>7</td>
</tr>
<tr>
<th>2</th>
<td>0</td>
<td>-3</td>
</tr>
<tr>
<th>3</th>
<td>1</td>
<td>2</td>
</tr>
</tbody>
</table>
</div>

frame.sort_values('b') #指定列

<div>
<style scoped>
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>
<table border="1" class="dataframe">
<thead>
<tr style="text-align: right;">
<th></th>
<th>a</th>
<th>b</th>
</tr>
</thead>
<tbody>
<tr>
<th>2</th>
<td>0</td>
<td>-3</td>
</tr>
<tr>
<th>3</th>
<td>1</td>
<td>2</td>
</tr>
<tr>
<th>0</th>
<td>0</td>
<td>4</td>
</tr>
<tr>
<th>1</th>
<td>1</td>
<td>7</td>
</tr>
</tbody>
</table>
</div>

frame.sort_values(['a','b']) #指定多个列时,会按先后顺讯进行排序

<div>
<style scoped>
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>
<table border="1" class="dataframe">
<thead>
<tr style="text-align: right;">
<th></th>
<th>a</th>
<th>b</th>
</tr>
</thead>
<tbody>
<tr>
<th>2</th>
<td>0</td>
<td>-3</td>
</tr>
<tr>
<th>0</th>
<td>0</td>
<td>4</td>
</tr>
<tr>
<th>3</th>
<td>1</td>
<td>2</td>
</tr>
<tr>
<th>1</th>
<td>1</td>
<td>7</td>
</tr>
</tbody>
</table>
</div>

rank
obj = pd.Series([7,-5,7,4,2,0,4])
obj.rank()
0    6.5
1    1.0
2    6.5
3    4.5
4    3.0
5    2.0
6    4.5
dtype: float64
obj.rank(method='first') #指定类型
0    6.0
1    1.0
2    7.0
3    4.0
4    3.0
5    2.0
6    5.0
dtype: float64
obj.rank(ascending=False, method = 'max') #降序,并指定类型
0    2.0
1    7.0
2    2.0
3    4.0
4    5.0
5    6.0
6    4.0
dtype: float64
frame = pd.DataFrame({'b': [4.3, 7, -3, 2], 'a': [0, 1, 0, 1],'c': [-2, 5, 8, -2.5]})

frame

<div>
<style scoped>
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>
<table border="1" class="dataframe">
<thead>
<tr style="text-align: right;">
<th></th>
<th>a</th>
<th>b</th>
<th>c</th>
</tr>
</thead>
<tbody>
<tr>
<th>0</th>
<td>0</td>
<td>4.3</td>
<td>-2.0</td>
</tr>
<tr>
<th>1</th>
<td>1</td>
<td>7.0</td>
<td>5.0</td>
</tr>
<tr>
<th>2</th>
<td>0</td>
<td>-3.0</td>
<td>8.0</td>
</tr>
<tr>
<th>3</th>
<td>1</td>
<td>2.0</td>
<td>-2.5</td>
</tr>
</tbody>
</table>
</div>

frame.rank(1) #dataframe指定行列,此处指定列

<div>
<style scoped>
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>
<table border="1" class="dataframe">
<thead>
<tr style="text-align: right;">
<th></th>
<th>a</th>
<th>b</th>
<th>c</th>
</tr>
</thead>
<tbody>
<tr>
<th>0</th>
<td>2.0</td>
<td>3.0</td>
<td>1.0</td>
</tr>
<tr>
<th>1</th>
<td>1.0</td>
<td>3.0</td>
<td>2.0</td>
</tr>
<tr>
<th>2</th>
<td>2.0</td>
<td>1.0</td>
<td>3.0</td>
</tr>
<tr>
<th>3</th>
<td>2.0</td>
<td>3.0</td>
<td>1.0</td>
</tr>
</tbody>
</table>
</div>

一些选项:
Method Description
'average' Default: assign the average rank to each entry in the equal group
'min' Use the minimum rank for the whole group
'max' Use the maximum rank for the whole group
'first' Assign ranks in the order the values appear in the data
'dense' Like method='min', but ranks always increase by 1 in between groups rather than the number of equal elements in a group

obj = pd.Series(range(5), index=['a', 'a', 'b', 'b', 'c'])
obj
a    0
a    1
b    2
b    3
c    4
dtype: int64
检验唯一性
obj.index.is_unique
False
obj.a #索引
a    0
a    1
dtype: int64
obj.c
4
dataframe
df = pd.DataFrame(np.random.randn(4, 3), index=['a', 'a', 'b', 'b'])

df

<div>
<style scoped>
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>
<table border="1" class="dataframe">
<thead>
<tr style="text-align: right;">
<th></th>
<th>0</th>
<th>1</th>
<th>2</th>
</tr>
</thead>
<tbody>
<tr>
<th>a</th>
<td>-0.534059</td>
<td>-0.465903</td>
<td>0.440969</td>
</tr>
<tr>
<th>a</th>
<td>-0.251819</td>
<td>-0.324293</td>
<td>-0.034794</td>
</tr>
<tr>
<th>b</th>
<td>-0.840377</td>
<td>0.590484</td>
<td>-1.700600</td>
</tr>
<tr>
<th>b</th>
<td>-1.271153</td>
<td>0.897543</td>
<td>1.486386</td>
</tr>
</tbody>
</table>
</div>

df.loc['b'] #索引

<div>
<style scoped>
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>
<table border="1" class="dataframe">
<thead>
<tr style="text-align: right;">
<th></th>
<th>0</th>
<th>1</th>
<th>2</th>
</tr>
</thead>
<tbody>
<tr>
<th>b</th>
<td>-0.840377</td>
<td>0.590484</td>
<td>-1.700600</td>
</tr>
<tr>
<th>b</th>
<td>-1.271153</td>
<td>0.897543</td>
<td>1.486386</td>
</tr>
</tbody>
</table>
</div>

描述性统计

df = pd.DataFrame([[1.4, np.nan], [7.1, -4.5],[np.nan, np.nan], [0.75, -1.3]],
index=['a', 'b', 'c', 'd'],
 columns=['one', 'two'])
df

<div>
<style scoped>
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>
<table border="1" class="dataframe">
<thead>
<tr style="text-align: right;">
<th></th>
<th>one</th>
<th>two</th>
</tr>
</thead>
<tbody>
<tr>
<th>a</th>
<td>1.40</td>
<td>NaN</td>
</tr>
<tr>
<th>b</th>
<td>7.10</td>
<td>-4.5</td>
</tr>
<tr>
<th>c</th>
<td>NaN</td>
<td>NaN</td>
</tr>
<tr>
<th>d</th>
<td>0.75</td>
<td>-1.3</td>
</tr>
</tbody>
</table>
</div>

df.sum() #求和
one    9.25
two   -5.80
dtype: float64
df.sum(1) #指定列
a    1.40
b    2.60
c    0.00
d   -0.55
dtype: float64
df.mean(1,skipna = False)
a      NaN
b    1.300
c      NaN
d   -0.275
dtype: float64
df.mean(1,skipna = True) #对na值得处理,当全为na值时,无法跳过
a    1.400
b    1.300
c      NaN
d   -0.275
dtype: float64
显示最值索引
df.idxmax() #最大值
one    b
two    d
dtype: object
df.idxmin() #最小值
one    d
two    b
dtype: object
其他
df.cumsum() #累计和

<div>
<style scoped>
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>
<table border="1" class="dataframe">
<thead>
<tr style="text-align: right;">
<th></th>
<th>one</th>
<th>two</th>
</tr>
</thead>
<tbody>
<tr>
<th>a</th>
<td>1.40</td>
<td>NaN</td>
</tr>
<tr>
<th>b</th>
<td>8.50</td>
<td>-4.5</td>
</tr>
<tr>
<th>c</th>
<td>NaN</td>
<td>NaN</td>
</tr>
<tr>
<th>d</th>
<td>9.25</td>
<td>-5.8</td>
</tr>
</tbody>
</table>
</div>

描述性统计

df.describe()

<div>
<style scoped>
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>
<table border="1" class="dataframe">
<thead>
<tr style="text-align: right;">
<th></th>
<th>one</th>
<th>two</th>
</tr>
</thead>
<tbody>
<tr>
<th>count</th>
<td>3.000000</td>
<td>2.000000</td>
</tr>
<tr>
<th>mean</th>
<td>3.083333</td>
<td>-2.900000</td>
</tr>
<tr>
<th>std</th>
<td>3.493685</td>
<td>2.262742</td>
</tr>
<tr>
<th>min</th>
<td>0.750000</td>
<td>-4.500000</td>
</tr>
<tr>
<th>25%</th>
<td>1.075000</td>
<td>-3.700000</td>
</tr>
<tr>
<th>50%</th>
<td>1.400000</td>
<td>-2.900000</td>
</tr>
<tr>
<th>75%</th>
<td>4.250000</td>
<td>-2.100000</td>
</tr>
<tr>
<th>max</th>
<td>7.100000</td>
<td>-1.300000</td>
</tr>
</tbody>
</table>
</div>

非数值型显示
 obj = pd.Series(['a', 'a', 'b', 'c'] * 4)
obj.describe()
count     16
unique     3
top        a
freq       8
dtype: object

一些统计内容方法
Method Description
count Number of non-NA values
describe Compute set of summary statistics for Series or each DataFrame column
min, max Compute minimum and maximum values
argmin, argmax Compute index locations (integers) at which minimum or maximum value obtained, respectively
idxmin, idxmax Compute index labels at which minimum or maximum value obtained, respectively
quantile Compute sample quantile ranging from 0 to 1
sum Sum of values
mean Mean of values
median Arithmetic median (50% quantile) of values
mad Mean absolute deviation from mean value
prod Product of all values
var Sample variance of values
std Sample standard deviation of values
skew Sample skewness (third moment) of values
kurt Sample kurtosis (fourth moment) of values
cumsum Cumulative sum of values
cummin, cummax Cumulative minimum or maximum of values, respectively
cumprod Cumulative product of values
diff Compute first arithmetic difference (useful for time series)
pct_change Compute percent changes

相关

df.corr()

<div>
<style scoped>
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>
<table border="1" class="dataframe">
<thead>
<tr style="text-align: right;">
<th></th>
<th>one</th>
<th>two</th>
</tr>
</thead>
<tbody>
<tr>
<th>one</th>
<td>1.0</td>
<td>-1.0</td>
</tr>
<tr>
<th>two</th>
<td>-1.0</td>
<td>1.0</td>
</tr>
</tbody>
</table>
</div>

df['one'].corr(df['two'])
-1.0
df.cov() #协方差

<div>
<style scoped>
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>
<table border="1" class="dataframe">
<thead>
<tr style="text-align: right;">
<th></th>
<th>one</th>
<th>two</th>
</tr>
</thead>
<tbody>
<tr>
<th>one</th>
<td>12.205833</td>
<td>-10.16</td>
</tr>
<tr>
<th>two</th>
<td>-10.160000</td>
<td>5.12</td>
</tr>
</tbody>
</table>
</div>

df.corrwith(df.one) #特定
one    1.0
two   -1.0
dtype: float64

唯一值,值计数

 obj = pd.Series(['c', 'a', 'd', 'a', 'a', 'b', 'b', 'c', 'c'])

series

uniques = obj.unique()
uniques
array(['c', 'a', 'd', 'b'], dtype=object)
uniques.sort() #排序
uniques
array(['a', 'b', 'c', 'd'], dtype=object)
计数
obj.value_counts()
c    3
a    3
b    2
d    1
dtype: int64
pd.value_counts(obj.values,sort = False) #值大小排序
b    2
a    3
c    3
d    1
dtype: int64
obj
0    c
1    a
2    d
3    a
4    a
5    b
6    b
7    c
8    c
dtype: object
成员检验
mask = obj.isin(['b','c'])
mask
0     True
1    False
2    False
3    False
4    False
5     True
6     True
7     True
8     True
dtype: bool
obj[mask] #筛选
0    c
5    b
6    b
7    c
8    c
dtype: object
变换索引
to_match = pd.Series(['c','a','b','b','c','a'])
u_v = pd.Series(['c','b','a'])
pd.Index(u_v).get_indexer(to_match)
array([0, 2, 1, 1, 0, 2])

Method Description
isin Compute boolean array indicating whether each Series value is contained in the passed sequence of values
match Compute integer indices for each value in an array into another array of distinct values; helpful for data alignment and join-type operations
unique Compute array of unique values in a Series, returned in the order observed
value_counts Return a Series containing unique values as its index and frequencies as its values, ordered count in descending order

其他
data = pd.DataFrame({'Qu1': [1, 3, 4, 3, 4],
 'Qu2': [2, 3, 1, 2, 3],
 'Qu3': [1, 5, 2, 4, 4]})
data

<div>
<style scoped>
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>
<table border="1" class="dataframe">
<thead>
<tr style="text-align: right;">
<th></th>
<th>Qu1</th>
<th>Qu2</th>
<th>Qu3</th>
</tr>
</thead>
<tbody>
<tr>
<th>0</th>
<td>1</td>
<td>2</td>
<td>1</td>
</tr>
<tr>
<th>1</th>
<td>3</td>
<td>3</td>
<td>5</td>
</tr>
<tr>
<th>2</th>
<td>4</td>
<td>1</td>
<td>2</td>
</tr>
<tr>
<th>3</th>
<td>3</td>
<td>2</td>
<td>4</td>
</tr>
<tr>
<th>4</th>
<td>4</td>
<td>3</td>
<td>4</td>
</tr>
</tbody>
</table>
</div>

result = data.apply(pd.value_counts).fillna(0)
result

<div>
<style scoped>
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>
<table border="1" class="dataframe">
<thead>
<tr style="text-align: right;">
<th></th>
<th>Qu1</th>
<th>Qu2</th>
<th>Qu3</th>
</tr>
</thead>
<tbody>
<tr>
<th>1</th>
<td>1.0</td>
<td>1.0</td>
<td>1.0</td>
</tr>
<tr>
<th>2</th>
<td>0.0</td>
<td>2.0</td>
<td>1.0</td>
</tr>
<tr>
<th>3</th>
<td>2.0</td>
<td>2.0</td>
<td>0.0</td>
</tr>
<tr>
<th>4</th>
<td>2.0</td>
<td>0.0</td>
<td>2.0</td>
</tr>
<tr>
<th>5</th>
<td>0.0</td>
<td>0.0</td>
<td>1.0</td>
</tr>
</tbody>
</table>
</div>

相关文章

网友评论

    本文标题:pandas basic

    本文链接:https://www.haomeiwen.com/subject/rywgzxtx.html