公众号:尤而小屋
作者:Peter
编辑:Peter
大家好,我是Peter~
本文主要介绍的是通过使用Pandas中3个字符串相关函数来筛选满足需求的文本数据:
- contains :包含某个字符
- startswith:以字符开头
- endswith:以字符结尾
模拟数据
import pandas as pd
import numpy as np
df = pd.DataFrame({
"name":["xiao ming","Xiao zhang",np.nan,"sun quan","guan yu"],
"age":["22","19","20","34","39"],
"sex":["male","Female","female","Female","male"],
"address":["广东省深圳市","浙江省杭州市","江苏省苏州市","福建省泉州市","广东省广州市"]
})
df
image
df.dtypes # 查看字段类型
name object
age object
sex object
address object
dtype: object
在本次模拟的数据中,有4个特点:
- name字段:存在缺失值np.nan,且Xiao和xiao存在大小写之分
- age:年龄字段,正常应该是数值型,模拟的数据是字符类型object
- sex:也存在F和f的大小写之分
- address:正常写法
数据类型转换
我们将age字段的字符类型型转成数值型
df["age"] = df["age"].astype(float)
df
<div>
<style scoped>
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}
生成的数据如下,似乎和原始数据没有区别;但是我们查看属性字段的数据类型就会看到区别:
imagedf.dtypes
name object
age float64
sex object
address object
dtype: object
age字段已经转成了float64位的数值型。
contains
contains是用于Series数据的函数,基本语法如下:
Series.str.contains(
pat,
case=True,
flags=0,
na=None,
regex=True
)
- pat:传入的字符或者正则表达式
- case:是否区分大小写(对大小写敏感)
- flags:正则标志位,比如:re.IGNORECASE,表示忽略大小写
- na:可选项,标量类型;对原数据中的缺失值处理,如果是object-dtype, 使用numpy.nan 代替;如果是StringDtype, 用pandas.NA
- regex:布尔值;True:传入的pat看做是正则表达式,False:看做是正常的字符类型的表达式
默认情况
# 例子1:筛选包含xiao的数据
df["name"].str.contains("xiao")
0 True
1 False
2 NaN
3 False
4 False
Name: name, dtype: object
当属性中存在缺失值的时候,需要带上na参数:
缺失值处理
# 例子2:参数na使用
df[df["name"].str.contains("xiao",na=False)]
<div>
<style scoped>
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
</style>
<table border="1" class="dataframe">
<thead>
<tr style="text-align: right;">
<th></th>
<th>name</th>
<th>age</th>
<th>sex</th>
<th>address</th>
</tr>
</thead>
<tbody>
<tr>
<th>0</th>
<td>xiao ming</td>
<td>22.0</td>
<td>male</td>
<td>广东省深圳市</td>
</tr>
</tbody>
</table>
</div>
如果不带上则会报错:
df[df["name"].str.contains("xiao")]
image
忽略大小写
# 例子3:case使用
df["name"].str.contains("xiao",case=False)
0 True
1 True
2 NaN
3 False
4 False
Name: name, dtype: object
上面的结果直接忽略了大小写,可以看到出现了两个True:也就是xiao和Xiao的数据都被筛选出来:
df[df["name"].str.contains("xiao",case=False, na=False)]
<div>
<style scoped>
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
</style>
<table border="1" class="dataframe">
<thead>
<tr style="text-align: right;">
<th></th>
<th>name</th>
<th>age</th>
<th>sex</th>
<th>address</th>
</tr>
</thead>
<tbody>
<tr>
<th>0</th>
<td>xiao ming</td>
<td>22.0</td>
<td>male</td>
<td>广东省深圳市</td>
</tr>
<tr>
<th>1</th>
<td>Xiao zhang</td>
<td>19.0</td>
<td>Female</td>
<td>浙江省杭州市</td>
</tr>
</tbody>
</table>
</div>
忽略大小写和缺失值
# 例子4:忽略大小写和缺失值
df[df["sex"].str.contains("f",case=False, na=False)]
<div>
<style scoped>
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
</style>
<table border="1" class="dataframe">
<thead>
<tr style="text-align: right;">
<th></th>
<th>name</th>
<th>age</th>
<th>sex</th>
<th>address</th>
</tr>
</thead>
<tbody>
<tr>
<th>1</th>
<td>Xiao zhang</td>
<td>19.0</td>
<td>Female</td>
<td>浙江省杭州市</td>
</tr>
<tr>
<th>2</th>
<td>NaN</td>
<td>20.0</td>
<td>female</td>
<td>江苏省苏州市</td>
</tr>
<tr>
<th>3</th>
<td>sun quan</td>
<td>34.0</td>
<td>Female</td>
<td>福建省泉州市</td>
</tr>
</tbody>
</table>
</div>
正则表达式使用
# 例子5:正则表达式使用
df["address"].str.contains("^广")
0 True
1 False
2 False
3 False
4 True
Name: address, dtype: bool
其中^
表示开始的符号,即:以广
开头的数据
df[df["address"].str.contains("^广")]
<div>
<style scoped>
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
</style>
<table border="1" class="dataframe">
<thead>
<tr style="text-align: right;">
<th></th>
<th>name</th>
<th>age</th>
<th>sex</th>
<th>address</th>
</tr>
</thead>
<tbody>
<tr>
<th>0</th>
<td>xiao ming</td>
<td>22.0</td>
<td>male</td>
<td>广东省深圳市</td>
</tr>
<tr>
<th>4</th>
<td>guan yu</td>
<td>39.0</td>
<td>male</td>
<td>广东省广州市</td>
</tr>
</tbody>
</table>
</div>
正则表达式中的$
表示结尾的符号;下面是筛选以市
结尾的数据:
df[df["address"].str.contains("市$")]
<div>
<style scoped>
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
</style>
<table border="1" class="dataframe">
<thead>
<tr style="text-align: right;">
<th></th>
<th>name</th>
<th>age</th>
<th>sex</th>
<th>address</th>
</tr>
</thead>
<tbody>
<tr>
<th>0</th>
<td>xiao ming</td>
<td>22.0</td>
<td>male</td>
<td>广东省深圳市</td>
</tr>
<tr>
<th>1</th>
<td>Xiao zhang</td>
<td>19.0</td>
<td>Female</td>
<td>浙江省杭州市</td>
</tr>
<tr>
<th>2</th>
<td>NaN</td>
<td>20.0</td>
<td>female</td>
<td>江苏省苏州市</td>
</tr>
<tr>
<th>3</th>
<td>sun quan</td>
<td>34.0</td>
<td>Female</td>
<td>福建省泉州市</td>
</tr>
<tr>
<th>4</th>
<td>guan yu</td>
<td>39.0</td>
<td>male</td>
<td>广东省广州市</td>
</tr>
</tbody>
</table>
</div>
在下面的正则表达式例子中,会在深苏泉
中任意选择一个,然后包含这个字符的数据:
df[df["address"].str.contains("[深苏泉]")]
<div>
<style scoped>
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
</style>
<table border="1" class="dataframe">
<thead>
<tr style="text-align: right;">
<th></th>
<th>name</th>
<th>age</th>
<th>sex</th>
<th>address</th>
</tr>
</thead>
<tbody>
<tr>
<th>0</th>
<td>xiao ming</td>
<td>22.0</td>
<td>male</td>
<td>广东省深圳市</td>
</tr>
<tr>
<th>2</th>
<td>NaN</td>
<td>20.0</td>
<td>female</td>
<td>江苏省苏州市</td>
</tr>
<tr>
<th>3</th>
<td>sun quan</td>
<td>34.0</td>
<td>Female</td>
<td>福建省泉州市</td>
</tr>
</tbody>
</table>
</div>
startswith
startswith的语法相对简单:
Series.str.startswith(pat, na=None)
- pat:表示一个字符;注意:不接受正则表达式
- na:表示对缺失值的处理;na=False表示忽略缺失值
pat参数
指定一个字符;不接受正则表达式
df["address"].str.startswith("广")
0 True
1 False
2 False
3 False
4 True
Name: address, dtype: bool
df[df["address"].str.startswith("广")]
<div>
<style scoped>
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
</style>
<table border="1" class="dataframe">
<thead>
<tr style="text-align: right;">
<th></th>
<th>name</th>
<th>age</th>
<th>sex</th>
<th>address</th>
</tr>
</thead>
<tbody>
<tr>
<th>0</th>
<td>xiao ming</td>
<td>22.0</td>
<td>male</td>
<td>广东省深圳市</td>
</tr>
<tr>
<th>4</th>
<td>guan yu</td>
<td>39.0</td>
<td>male</td>
<td>广东省广州市</td>
</tr>
</tbody>
</table>
</div>
这种写法和正则表达式的以某个字符开头是同样的效果:
df[df["address"].str.contains("^广")]
<div>
<style scoped>
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
</style>
<table border="1" class="dataframe">
<thead>
<tr style="text-align: right;">
<th></th>
<th>name</th>
<th>age</th>
<th>sex</th>
<th>address</th>
</tr>
</thead>
<tbody>
<tr>
<th>0</th>
<td>xiao ming</td>
<td>22.0</td>
<td>male</td>
<td>广东省深圳市</td>
</tr>
<tr>
<th>4</th>
<td>guan yu</td>
<td>39.0</td>
<td>male</td>
<td>广东省广州市</td>
</tr>
</tbody>
</table>
</div>
自动区分大小写
startswith方法是自动区分大小写的:
df[df["sex"].str.startswith("f")]
<div>
<style scoped>
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
</style>
<table border="1" class="dataframe">
<thead>
<tr style="text-align: right;">
<th></th>
<th>name</th>
<th>age</th>
<th>sex</th>
<th>address</th>
</tr>
</thead>
<tbody>
<tr>
<th>2</th>
<td>NaN</td>
<td>20.0</td>
<td>female</td>
<td>江苏省苏州市</td>
</tr>
</tbody>
</table>
</div>
df[df["sex"].str.startswith("F")]
<div>
<style scoped>
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
</style>
<table border="1" class="dataframe">
<thead>
<tr style="text-align: right;">
<th></th>
<th>name</th>
<th>age</th>
<th>sex</th>
<th>address</th>
</tr>
</thead>
<tbody>
<tr>
<th>1</th>
<td>Xiao zhang</td>
<td>19.0</td>
<td>Female</td>
<td>浙江省杭州市</td>
</tr>
<tr>
<th>3</th>
<td>sun quan</td>
<td>34.0</td>
<td>Female</td>
<td>福建省泉州市</td>
</tr>
</tbody>
</table>
</div>
缺失值处理
df["name"].str.startswith("xiao")
0 True
1 False
2 NaN
3 False
4 False
Name: name, dtype: object
df[df["name"].str.startswith("xiao",na=False)]
<div>
<style scoped>
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
</style>
<table border="1" class="dataframe">
<thead>
<tr style="text-align: right;">
<th></th>
<th>name</th>
<th>age</th>
<th>sex</th>
<th>address</th>
</tr>
</thead>
<tbody>
<tr>
<th>0</th>
<td>xiao ming</td>
<td>22.0</td>
<td>male</td>
<td>广东省深圳市</td>
</tr>
</tbody>
</table>
</div>
endswith
指定以某个字符结尾,语法为:
Series.str.endswith(pat, na=None)
- pat:表示一个字符;注意:不接受正则表达式
- na:表示对缺失值的处理;na=False表示忽略缺失值
pat参数
# 以市结尾
df[df["address"].str.endswith("市")]
<div>
<style scoped>
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
</style>
<table border="1" class="dataframe">
<thead>
<tr style="text-align: right;">
<th></th>
<th>name</th>
<th>age</th>
<th>sex</th>
<th>address</th>
</tr>
</thead>
<tbody>
<tr>
<th>0</th>
<td>xiao ming</td>
<td>22.0</td>
<td>male</td>
<td>广东省深圳市</td>
</tr>
<tr>
<th>1</th>
<td>Xiao zhang</td>
<td>19.0</td>
<td>Female</td>
<td>浙江省杭州市</td>
</tr>
<tr>
<th>2</th>
<td>NaN</td>
<td>20.0</td>
<td>female</td>
<td>江苏省苏州市</td>
</tr>
<tr>
<th>3</th>
<td>sun quan</td>
<td>34.0</td>
<td>Female</td>
<td>福建省泉州市</td>
</tr>
<tr>
<th>4</th>
<td>guan yu</td>
<td>39.0</td>
<td>male</td>
<td>广东省广州市</td>
</tr>
</tbody>
</table>
</div>
# 正则的写法:contains方法
df[df["address"].str.contains("市$")]
<div>
<style scoped>
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
</style>
<table border="1" class="dataframe">
<thead>
<tr style="text-align: right;">
<th></th>
<th>name</th>
<th>age</th>
<th>sex</th>
<th>address</th>
</tr>
</thead>
<tbody>
<tr>
<th>0</th>
<td>xiao ming</td>
<td>22.0</td>
<td>male</td>
<td>广东省深圳市</td>
</tr>
<tr>
<th>1</th>
<td>Xiao zhang</td>
<td>19.0</td>
<td>Female</td>
<td>浙江省杭州市</td>
</tr>
<tr>
<th>2</th>
<td>NaN</td>
<td>20.0</td>
<td>female</td>
<td>江苏省苏州市</td>
</tr>
<tr>
<th>3</th>
<td>sun quan</td>
<td>34.0</td>
<td>Female</td>
<td>福建省泉州市</td>
</tr>
<tr>
<th>4</th>
<td>guan yu</td>
<td>39.0</td>
<td>male</td>
<td>广东省广州市</td>
</tr>
</tbody>
</table>
</div>
缺失值处理
df["name"].str.endswith("g")
0 True
1 True
2 NaN
3 False
4 False
Name: name, dtype: object
df[df["name"].str.endswith("g",na=False)]
<div>
<style scoped>
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
</style>
<table border="1" class="dataframe">
<thead>
<tr style="text-align: right;">
<th></th>
<th>name</th>
<th>age</th>
<th>sex</th>
<th>address</th>
</tr>
</thead>
<tbody>
<tr>
<th>0</th>
<td>xiao ming</td>
<td>22.0</td>
<td>male</td>
<td>广东省深圳市</td>
</tr>
<tr>
<th>1</th>
<td>Xiao zhang</td>
<td>19.0</td>
<td>Female</td>
<td>浙江省杭州市</td>
</tr>
</tbody>
</table>
</div>
# 不加na参数则报错
df[df["name"].str.endswith("g")]
image
报错的原因很明显:就是因为name字段下面存在缺失值。当使用了na参数就可以解决
网友评论