1 数据选取操作
1.1 isin和is not in 的使用和操作
按照pandas作者的说法,pandas可以实现几乎所有的类似sql的操作,这其中当然包括sql中的in和not in。
在pandas中(包括Dataframe和Series)in和not in的实现方法分别是obj.isin([list] or {set} or (tuple))和~obj.isin([list]),例如:
import pandas as pd
data = pd.DataFrame(
{"name": ["Tom", "Mary", "Jerry", "Mark"],
"age":[18, 22, 19, 31],
"gender":["M", "F", "M", "M"]})
data.age.isin((18, 22, 19))
~data.age.isin([18, 22, 19])
data[data.age.isin([18, 22])]
1.2 filter 和 select
pandas中的filter和select用来选择符合条件的行或者列
1.2.1 filter的使用
def filter(self, items=None, like=None, regex=None, axis=None):
"""
使用list、正则表达式或者like语法来选择行或者列
参数
----------
items : list-like
索引list、set、tuple或者其他list-like类型
like : string
Keep info axis where "arg in col == True"
regex : string (regular expression)
正则表达式
axis : int or None
默认情况下为0
# 使用方法 以上例的data为例
data.filter(items=[0,1], axis=0)
data.filter(items=[0,1], axis=0)
data.filter(like="ge", axis=1)
data.filter(regex="[a-z]_\d")
1.2.2 select的使用
def select(self, crit, axis=0):
"""
Return data corresponding to axis labels matching criteria
Parameters
----------
crit : function
To be called on each index (label). Should return True or False
axis : int 默认情况下axis=0 即按行选择
Returns
-------
selection : type of caller
"""
# 使用方法如下
import re
# 选择含有age的列
data.select(lambda x: bool(x) if "age" in x else False, axis=1)
# 按照正则表达式选择列
data.select(lambda x: bool(re.match("^[a-z]*_\d$", x)), axis=1)
1.3 多条件选取
“&”、“|”分别表示“与”和“或”
只需使用 & 或 | 连接多个条件就可以实现多条件的选取了
示例代码如下:
```
data[data.age.isin([18, 22, 19])&~data.gender.isin(["M"])]
newseries = df.loc[(df['BBB'] < 25) & (df['CCC'] >= -40), 'AAA']
```
1.4 ix、loc与iloc
在pandas中ix、loc和iloc都用于选取数据,其中ix和loc用于pandas的索引选取,即根据数据中的行索引和列索引的名字选取,iloc是使用类似列表的顺序索引进行数据选取的
2 DataFrame 中重复项的处理
3 DataFrame的排序
3.1 pandas中排序的方法
pandas中的排序方法有三个sort_values()、sort()
和sort_index()三个,其中sort()和sort_index()都已经不推荐使用。
def sort_values(self, by, axis=0, ascending=True, inplace=False,
kind='quicksort', na_position='last'):
"""
Parameters
----------
by : string 或list
键的名字,当为list时是多值排序
axis : int 默认情况下axis=0 即按行选择
ascending : boolean or list, default True
是否按照升序排序 list用于多值排序
inplace : bool
是否原地操作
kind : {'quicksort', 'mergesort', 'heapsort'}, optional
排序方法
na_position : {'first', 'last'} (optional, default='last')
'first' puts NaNs at the beginning
'last' puts NaNs at the end
"""
pass
data = pd.DataFrame(
{"name": ["Tom", "Mary", "Jerry", "Mark"],
"age":[18, 22, 19, 31],
"gender":["M", "F", "M", "M"]})
data["age_2"] = [32, 32, 35, 15]
data.ix[0, "age"] = 25
data.sort_values(by="age")
3.2 多值排序
data.sort(by=["age", "age_2"])
# age gender name age_2
# 2 19 M Jerry 35
# 1 22 F Mary 32
# 0 25 M Tom 32
# 3 31 M Mark 15
data.sort_values(by=["age_2", "age"], ascending=[True, False])
age gender name age_2
# 3 31.0 M Mark 15
# 4 NaN F Cris 18
# 0 25.0 M Tom 32
# 1 22.0 F Mary 32
# 2 19.0 M Jerry 35
3.3 正序与倒序
# 正序与倒序排列通过**ascending**参数控制
data.sort_values(by=["age_2", "age"], ascending=[True, False])
age gender name age_2
3.4 na值的处理
# 参考1
4 在DataFrame中添加行
4.1 concate
import pandas as pd
df = pd.DataFrame([[1,2,3],[4,5,6]],columns=['a','b','c'])
#>>df
# a b c
# 0 1 2 3
# 1 4 5 6
#先将要添加进去的记录转置后连接在一起
t = pd.concat([pd.DataFrame([7,8,9]).T,pd.DataFrame([10,11,12]).T])
#然后修改columns 使得和df的columns一致
t.columns = df.columns
#最后把两个DataFrame合并并且忽略index
df = pd.concat([df,t],ignore_index=True)
#>>df
# a b c
# 0 1 2 3
# 1 4 5 6
# 2 7 8 9
# 3 10 11 12
4.2 loc
df.loc[i]={'a':1,'b':2}
df.loc[df.shape[0]+1] = {'ds':strToDate('2017-07-21'),'y':0}
5 删除列
5.1 del关键字删除列
# 只能使用del data["column_name"] 不能使用 del data.column_name
del data["age_2"]
5.2 pop删除(内部使用del实现)
data.pop("age")
5.3 drop删除
obj.drop("index_name")
6 NaN的声明
a = pd.NaT
np.NaN
7 重新索引与修改列名
7.1 rename和rename_axis
参考pandas.DataFrame.reindex 和 pandas.DataFrame.rename
DataFrame.rename(index=None, columns=None, **kwargs)
通过函数或字典修改DataFrame的index或columns. Function或者dict的值必须是一对一的(1-to-1). 没有包含在 dict或Series中的列或者索引保持不变。 多余的labels不会抛出异常。Alternatively, change Series.name with a scalar value (Series only).
Parameters:
index, columns : scalar, list-like, dict-like or function,optional
Scalar or list-like 仅仅用于修改 the Series.name attribute,
在DataFrame和 Panel中会抛出异常. 修改DataFrame的columns和index
需要使用 dict-like or functions
are transformations to apply to that axis’ values
copy : boolean, default True
Also copy underlying data
inplace : boolean, default False
是否复制一个DataFrame并返回. 如果为True,参数copy的值将会被忽略.
level : int or level name, default None
In case of a MultiIndex, only rename labels in the specified level.
Returns:
renamed : DataFrame (new object)
>>> s = pd.Series([1, 2, 3])
>>> s
0 1
1 2
2 3
dtype: int64
>>> s.rename("my_name") # scalar, changes Series.name
0 1
1 2
2 3
Name: my_name, dtype: int64
>>> s.rename(lambda x: x ** 2) # function, changes labels
0 1
1 2
4 3
dtype: int64
>>> s.rename({1: 3, 2: 5}) # mapping, changes labels
0 1
3 2
5 3
dtype: int64
>>> df = pd.DataFrame({"A": [1, 2, 3], "B": [4, 5, 6]})
>>> df.rename(2)
Traceback (most recent call last):
...
TypeError: 'int' object is not callable
>>> df.rename(index=str, columns={"A": "a", "B": "c"})
a c
0 1 4
1 2 5
2 3 6
>>> df.rename(index=str, columns={"A": "a", "C": "c"})
a B
0 1 4
1 2 5
2 3 6
7.2 reindex和reindex_axis
DataFrame.reindex(index=None, columns=None, **kwargs)
Conform DataFrame to new index with optional filling logic, placing NA/NaN in locations having no value in the previous index. A new object is produced unless the new index is equivalent to the current one and copy=False
Parameters:
index, columns : array-like, optional (can be specified in order, or as
keywords) New labels / index to conform to. Preferably an Index object to avoid duplicating data
method : {None, ‘backfill’/’bfill’, ‘pad’/’ffill’, ‘nearest’}, optional
s.rename("my_name") # scalar, changes Series.name
0 1
1 2
2 3
Name: my_name, dtype: int64
>>> s.rename(lambda x: x ** 2) # function, changes labels
0 1
1 2
4 3
dtype: int64
>>> s.rename({1: 3, 2: 5}) # mapping, changes labels
0 1
3 2
5 3
dtype: int64
>>> df = pd.DataFrame({"A": [1, 2, 3], "B": [4, 5, 6]})
>>> df.rename(2)
Traceback (most recent call last):
...
TypeError: 'int' object is not callable
>>> df.rename(index=str, columns={"A": "a", "B": "c"})
a c
0 1 4
1 2 5
2 3 6
>>> df.rename(index=str, columns={"A": "a", "C": "c"})
a B
0 1 4
1 2 5
2 3 6
```
### 7.2 reindex和reindex_axis
参考pandas.DataFrame.reindex
> DataFrame.reindex(index=None, columns=None, kwargs)
> > Conform DataFrame to new index with optional filling logic, placing NA/NaN in locations having no value in the previous index. A new object is produced unless the new index is equivalent to the current one and copy=False
>
> >Parameters:**
>>> index, columns : array-like, optional (can be specified in order, or as
keywords) New labels / index to conform to. Preferably an Index object to avoid duplicating data
>>> method : {None, ‘backfill’/’bfill’, ‘pad’/’ffill’, ‘nearest’}, optional
>>>>
method to use for filling holes in reindexed DataFrame. Please note: this is only applicable to DataFrames/Series with a monotonically increasing/decreasing index.
- default: don’t fill gaps
- pad / ffill: propagate last valid observation forward to next valid
- backfill / bfill: use next valid observation to fill gap
- nearest: use nearest valid observations to fill gap
copy : boolean, default True
Return a new object, even if the passed indexes are the same
level : int or name
Broadcast across a level, matching Index values on the passed MultiIndex level
fill_value : scalar, default np.NaN
Value to use for missing values. Defaults to NaN, but can be any “compatible” value
limit : int, default None
Maximum number of consecutive elements to forward or backward fill
tolerance : optional
Maximum distance between original and new labels for inexact matches. The values of the index at the matching locations most satisfy the equation abs(index[indexer] - target) <= tolerance.
New in version 0.17.0.
Returns:
reindexed : DataFrame
7.3 column和index
import pandas as pd
df = pd.DataFrame({'$a': [1], '$b': [1], '$c': [1], '$d': [1], '$e': [1]})
# 暴力
df.columns = ['a', 'b', 'c', 'd', 'e']
# 修改
df.columns = df.columns.str.strip('$')
# 修改
df.columns = df.columns.map(lambda x:x[1:])
# 暴力(好处:也可只修改特定的列)
df.rename(columns=('$a': 'a', '$b': 'b', '$c': 'c', '$d': 'd', '$e': 'e'}, inplace=True)
# 修改
df.rename(columns=lambda x:x.replace('$',''), inplace=True)
data.rename(columns={"age_2": "age_3"}, inplace=True)
8 Group By操作
8.1帮助文档
pd.DataFrame.groupby(self, by=None, axis=0, level=None, as_index=True, sort=True, group_keys=True, squeeze=False, **kwargs)
Group series using mapper (dict or key function, apply given function to group, return result as series) or by a series of columns.
Parameters
----------
by : mapping function / list of functions, dict, Series, or tuple /
list of column names.
Called on each element of the object index to determine the groups.
If a dict or Series is passed, the Series or dict VALUES will be
used to determine the groups
axis : int, default 0
level : int, level name, or sequence of such, default None
If the axis is a MultiIndex (hierarchical), group by a particular
level or levels
as_index : boolean, default True
For aggregated output, return object with group labels as the
index. Only relevant for DataFrame input. as_index=False is
effectively "SQL-style" grouped output
sort : boolean, default True
Sort group keys. Get better performance by turning this off.
Note this does not influence the order of observations within each
group. groupby preserves the order of rows within each group.
group_keys : boolean, default True
When calling apply, add group keys to index to identify pieces
squeeze : boolean, default False
reduce the dimensionality of the return type if possible,
otherwise return a consistent type
8.2 基本操作
最常见的操作是对一个和多个建进行group by
# 操作方法
df.groupby(["key1", "key2"])
8.3 聚合操作
在pandas中可以实现很多的聚合操作,如mean、median、sum、avg、表准差等,各种操作引诱尽有,实在不能满足自己的情况下可以使用apply函数实现自己的聚合操作
```
# 操作方法
group.sum()
group.sum()["key1"] # 选取聚合结果的某一列
group.sum()[["key1", "key2"]] # 选取聚合结果的多列
# 将聚合结果的索引转换为列 使用reset_index 函数
res = group.sum()[["key1", "key2"]]
res.reset_index()
```
8.4 将索引groupby索引转换为列
在DataFrame中有一个函数是reset_index, 使用方法见8.3节
8.5 groupby结果的块操作(chunk)
可以多group的结果进行循环操作,循w环中的每一项是一个tuple,形如(索引, 数据)
```
group = data.groupby(["key1", "key2"])
for item in group:
print(item[0])
print(item[1])
```
网友评论