索引
按行名索引 data.ix['row_name']
按行位置索引 data.iloc[2]
过滤指定列包含某些字符串 df_data['分类'].str.contains('婚|同居|抚养|赡养')
索引指定列data = data.loc[:, ['account', '首贷申请时间']]
分组并排序
In [84]: df = pd.DataFrame({'key1':['a','a','b','b','a'],
...: 'key2':['one','two','one','two','one'],
...: 'data1':np.random.randn(5),
...: 'data2':np.random.randn(5)})
In [85]: df
Out[85]:
data1 data2 key1 key2
0 1.579140 0.428876 a one
1 0.494486 0.397206 a two
2 -0.445459 -1.447018 b one
3 1.114477 -1.539330 b two
4 0.899226 -2.082411 a one
# 分组然后排序,多关键字排序,ascending的每个元素表示每个排序关键字的排序方式
In [86]: sort_func = lambda x: x.sort_values(['data1', 'data2'], ascending=[1, 0])
In [87]: dfgs = df.groupby(['key1', 'key2']).apply(sort_func)
In [88]: dfgs
Out[88]:
data1 data2 key1 key2
key1 key2
a one 4 0.899226 -2.082411 a one
0 1.579140 0.428876 a one
two 1 0.494486 0.397206 a two
b one 2 -0.445459 -1.447018 b one
two 3 1.114477 -1.539330 b two
# 分组排序后只取前n个值
In [89]: sort_func = lambda x: x.sort_values(['data1', 'data2'], ascending=[1, 0]).head(1)
In [90]: dfgsh = df.groupby(['key1', 'key2']).apply(sort_func)
In [91]: dfgsh
Out[91]:
data1 data2 key1 key2
key1 key2
a one 4 0.899226 -2.082411 a one
two 1 0.494486 0.397206 a two
b one 2 -0.445459 -1.447018 b one
two 3 1.114477 -1.539330 b two
读取数据时设置格式
data = pd.read_excel(host_file, dtype={'emergency_contact_mobile(紧急联系人1)': str,
'emergency_contact_mobile2nd(紧急联系人2)': str,
'account': str})
修改column名
data.rename(columns={'emergency_contact_mobile(紧急联系人1)': 'emergency1',
'emergency_contact_mobile2nd(紧急联系人2)': 'emergency2',
'首贷申请时间': 'first_time'},
inplace=True)
dataframe转换为dict
data = data.to_dict('records')
参考:http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.to_dict.html
去除nan, inf等
df[~df.isin([np.nan, np.inf, -np.inf]).any(1)]
df_data = df_data[df_data['问题'].notnull()]
df_data.fillna('', inplace=True)
data.dropna(inplace=True)
一些实例
一个读取文件的例子
def parse_application_file(application_file):
data = pd.read_table(application_file, sep='\t', encoding='utf-8', engine='python', dtype = {'account' : str}) # 读取txt文件,以'\t'为分割,'account'列的格式转换成str
data = data.loc[:, ['account', 'addtime']]
data.rename(columns={'addtime': 'first_time'}, inplace=True) # 修改列名
data = data.set_index('account') # 把account列作为索引
print(data)
return data
# index和column的转化,参考https://www.cnblogs.com/hhh5460/p/7067928.html
网友评论