美文网首页
pandas文本处理的3大秘诀

pandas文本处理的3大秘诀

作者: 皮皮大 | 来源:发表于2022-03-05 22:33 被阅读0次

    公众号:尤而小屋
    作者:Peter
    编辑:Peter

    大家好,我是Peter~

    本文主要介绍的是通过使用Pandas中3个字符串相关函数来筛选满足需求的文本数据:

    • contains :包含某个字符
    • startswith:以字符开头
    • endswith:以字符结尾
    image

    模拟数据

    import pandas as pd
    import numpy as np
    
    df = pd.DataFrame({
        "name":["xiao ming","Xiao zhang",np.nan,"sun quan","guan yu"],
        "age":["22","19","20","34","39"],
        "sex":["male","Female","female","Female","male"],
        "address":["广东省深圳市","浙江省杭州市","江苏省苏州市","福建省泉州市","广东省广州市"]
    })
    
    df
    
    image
    df.dtypes  # 查看字段类型
    
    name       object
    age        object
    sex        object
    address    object
    dtype: object
    

    在本次模拟的数据中,有4个特点:

    1. name字段:存在缺失值np.nan,且Xiao和xiao存在大小写之分
    2. age:年龄字段,正常应该是数值型,模拟的数据是字符类型object
    3. sex:也存在F和f的大小写之分
    4. address:正常写法

    数据类型转换

    我们将age字段的字符类型型转成数值型

    df["age"] = df["age"].astype(float)
    df
    

    <div>
    <style scoped>
    .dataframe tbody tr th:only-of-type {
    vertical-align: middle;
    }

    生成的数据如下,似乎和原始数据没有区别;但是我们查看属性字段的数据类型就会看到区别:

    image
    df.dtypes
    
    name        object
    age        float64  
    sex         object
    address     object
    dtype: object
    

    age字段已经转成了float64位的数值型。

    contains

    contains是用于Series数据的函数,基本语法如下:

    Series.str.contains(
        pat, 
        case=True, 
        flags=0, 
        na=None, 
        regex=True
    )
    
    • pat:传入的字符或者正则表达式
    • case:是否区分大小写(对大小写敏感)
    • flags:正则标志位,比如:re.IGNORECASE,表示忽略大小写
    • na:可选项,标量类型;对原数据中的缺失值处理,如果是object-dtype, 使用numpy.nan 代替;如果是StringDtype, 用pandas.NA
    • regex:布尔值;True:传入的pat看做是正则表达式,False:看做是正常的字符类型的表达式

    默认情况

    # 例子1:筛选包含xiao的数据
    
    df["name"].str.contains("xiao")
    
    0     True
    1    False
    2      NaN
    3    False
    4    False
    Name: name, dtype: object
    

    当属性中存在缺失值的时候,需要带上na参数:

    缺失值处理

    # 例子2:参数na使用
    
    df[df["name"].str.contains("xiao",na=False)]
    

    <div>
    <style scoped>
    .dataframe tbody tr th:only-of-type {
    vertical-align: middle;
    }

    .dataframe tbody tr th {
        vertical-align: top;
    }
    
    .dataframe thead th {
        text-align: right;
    }
    

    </style>

    <table border="1" class="dataframe">
    <thead>
    <tr style="text-align: right;">
    <th></th>
    <th>name</th>
    <th>age</th>
    <th>sex</th>
    <th>address</th>
    </tr>
    </thead>
    <tbody>
    <tr>
    <th>0</th>
    <td>xiao ming</td>
    <td>22.0</td>
    <td>male</td>
    <td>广东省深圳市</td>
    </tr>
    </tbody>
    </table>

    </div>

    如果不带上则会报错:

    df[df["name"].str.contains("xiao")]
    
    image

    忽略大小写

    # 例子3:case使用
    
    df["name"].str.contains("xiao",case=False)
    
    0     True
    1     True
    2      NaN
    3    False
    4    False
    Name: name, dtype: object
    

    上面的结果直接忽略了大小写,可以看到出现了两个True:也就是xiao和Xiao的数据都被筛选出来:

    df[df["name"].str.contains("xiao",case=False, na=False)]
    

    <div>
    <style scoped>
    .dataframe tbody tr th:only-of-type {
    vertical-align: middle;
    }

    .dataframe tbody tr th {
        vertical-align: top;
    }
    
    .dataframe thead th {
        text-align: right;
    }
    

    </style>

    <table border="1" class="dataframe">
    <thead>
    <tr style="text-align: right;">
    <th></th>
    <th>name</th>
    <th>age</th>
    <th>sex</th>
    <th>address</th>
    </tr>
    </thead>
    <tbody>
    <tr>
    <th>0</th>
    <td>xiao ming</td>
    <td>22.0</td>
    <td>male</td>
    <td>广东省深圳市</td>
    </tr>
    <tr>
    <th>1</th>
    <td>Xiao zhang</td>
    <td>19.0</td>
    <td>Female</td>
    <td>浙江省杭州市</td>
    </tr>
    </tbody>
    </table>

    </div>

    忽略大小写和缺失值

    # 例子4:忽略大小写和缺失值
    df[df["sex"].str.contains("f",case=False, na=False)]
    

    <div>
    <style scoped>
    .dataframe tbody tr th:only-of-type {
    vertical-align: middle;
    }

    .dataframe tbody tr th {
        vertical-align: top;
    }
    
    .dataframe thead th {
        text-align: right;
    }
    

    </style>

    <table border="1" class="dataframe">
    <thead>
    <tr style="text-align: right;">
    <th></th>
    <th>name</th>
    <th>age</th>
    <th>sex</th>
    <th>address</th>
    </tr>
    </thead>
    <tbody>
    <tr>
    <th>1</th>
    <td>Xiao zhang</td>
    <td>19.0</td>
    <td>Female</td>
    <td>浙江省杭州市</td>
    </tr>
    <tr>
    <th>2</th>
    <td>NaN</td>
    <td>20.0</td>
    <td>female</td>
    <td>江苏省苏州市</td>
    </tr>
    <tr>
    <th>3</th>
    <td>sun quan</td>
    <td>34.0</td>
    <td>Female</td>
    <td>福建省泉州市</td>
    </tr>
    </tbody>
    </table>

    </div>

    正则表达式使用

    # 例子5:正则表达式使用
    
    df["address"].str.contains("^广")
    
    0     True
    1    False
    2    False
    3    False
    4     True
    Name: address, dtype: bool
    

    其中^表示开始的符号,即:以广开头的数据

    df[df["address"].str.contains("^广")]
    

    <div>
    <style scoped>
    .dataframe tbody tr th:only-of-type {
    vertical-align: middle;
    }

    .dataframe tbody tr th {
        vertical-align: top;
    }
    
    .dataframe thead th {
        text-align: right;
    }
    

    </style>

    <table border="1" class="dataframe">
    <thead>
    <tr style="text-align: right;">
    <th></th>
    <th>name</th>
    <th>age</th>
    <th>sex</th>
    <th>address</th>
    </tr>
    </thead>
    <tbody>
    <tr>
    <th>0</th>
    <td>xiao ming</td>
    <td>22.0</td>
    <td>male</td>
    <td>广东省深圳市</td>
    </tr>
    <tr>
    <th>4</th>
    <td>guan yu</td>
    <td>39.0</td>
    <td>male</td>
    <td>广东省广州市</td>
    </tr>
    </tbody>
    </table>

    </div>

    正则表达式中的$表示结尾的符号;下面是筛选以结尾的数据:

    df[df["address"].str.contains("市$")]
    

    <div>
    <style scoped>
    .dataframe tbody tr th:only-of-type {
    vertical-align: middle;
    }

    .dataframe tbody tr th {
        vertical-align: top;
    }
    
    .dataframe thead th {
        text-align: right;
    }
    

    </style>

    <table border="1" class="dataframe">
    <thead>
    <tr style="text-align: right;">
    <th></th>
    <th>name</th>
    <th>age</th>
    <th>sex</th>
    <th>address</th>
    </tr>
    </thead>
    <tbody>
    <tr>
    <th>0</th>
    <td>xiao ming</td>
    <td>22.0</td>
    <td>male</td>
    <td>广东省深圳市</td>
    </tr>
    <tr>
    <th>1</th>
    <td>Xiao zhang</td>
    <td>19.0</td>
    <td>Female</td>
    <td>浙江省杭州市</td>
    </tr>
    <tr>
    <th>2</th>
    <td>NaN</td>
    <td>20.0</td>
    <td>female</td>
    <td>江苏省苏州市</td>
    </tr>
    <tr>
    <th>3</th>
    <td>sun quan</td>
    <td>34.0</td>
    <td>Female</td>
    <td>福建省泉州市</td>
    </tr>
    <tr>
    <th>4</th>
    <td>guan yu</td>
    <td>39.0</td>
    <td>male</td>
    <td>广东省广州市</td>
    </tr>
    </tbody>
    </table>

    </div>

    在下面的正则表达式例子中,会在深苏泉中任意选择一个,然后包含这个字符的数据:

    df[df["address"].str.contains("[深苏泉]")]
    

    <div>
    <style scoped>
    .dataframe tbody tr th:only-of-type {
    vertical-align: middle;
    }

    .dataframe tbody tr th {
        vertical-align: top;
    }
    
    .dataframe thead th {
        text-align: right;
    }
    

    </style>

    <table border="1" class="dataframe">
    <thead>
    <tr style="text-align: right;">
    <th></th>
    <th>name</th>
    <th>age</th>
    <th>sex</th>
    <th>address</th>
    </tr>
    </thead>
    <tbody>
    <tr>
    <th>0</th>
    <td>xiao ming</td>
    <td>22.0</td>
    <td>male</td>
    <td>广东省深圳市</td>
    </tr>
    <tr>
    <th>2</th>
    <td>NaN</td>
    <td>20.0</td>
    <td>female</td>
    <td>江苏省苏州市</td>
    </tr>
    <tr>
    <th>3</th>
    <td>sun quan</td>
    <td>34.0</td>
    <td>Female</td>
    <td>福建省泉州市</td>
    </tr>
    </tbody>
    </table>

    </div>

    startswith

    startswith的语法相对简单:

    Series.str.startswith(pat, na=None)
    
    • pat:表示一个字符;注意:不接受正则表达式
    • na:表示对缺失值的处理;na=False表示忽略缺失值

    pat参数

    指定一个字符;不接受正则表达式

    df["address"].str.startswith("广")
    
    0     True
    1    False
    2    False
    3    False
    4     True
    Name: address, dtype: bool
    
    df[df["address"].str.startswith("广")]
    

    <div>
    <style scoped>
    .dataframe tbody tr th:only-of-type {
    vertical-align: middle;
    }

    .dataframe tbody tr th {
        vertical-align: top;
    }
    
    .dataframe thead th {
        text-align: right;
    }
    

    </style>

    <table border="1" class="dataframe">
    <thead>
    <tr style="text-align: right;">
    <th></th>
    <th>name</th>
    <th>age</th>
    <th>sex</th>
    <th>address</th>
    </tr>
    </thead>
    <tbody>
    <tr>
    <th>0</th>
    <td>xiao ming</td>
    <td>22.0</td>
    <td>male</td>
    <td>广东省深圳市</td>
    </tr>
    <tr>
    <th>4</th>
    <td>guan yu</td>
    <td>39.0</td>
    <td>male</td>
    <td>广东省广州市</td>
    </tr>
    </tbody>
    </table>

    </div>

    这种写法和正则表达式的以某个字符开头是同样的效果:

    df[df["address"].str.contains("^广")]
    

    <div>
    <style scoped>
    .dataframe tbody tr th:only-of-type {
    vertical-align: middle;
    }

    .dataframe tbody tr th {
        vertical-align: top;
    }
    
    .dataframe thead th {
        text-align: right;
    }
    

    </style>

    <table border="1" class="dataframe">
    <thead>
    <tr style="text-align: right;">
    <th></th>
    <th>name</th>
    <th>age</th>
    <th>sex</th>
    <th>address</th>
    </tr>
    </thead>
    <tbody>
    <tr>
    <th>0</th>
    <td>xiao ming</td>
    <td>22.0</td>
    <td>male</td>
    <td>广东省深圳市</td>
    </tr>
    <tr>
    <th>4</th>
    <td>guan yu</td>
    <td>39.0</td>
    <td>male</td>
    <td>广东省广州市</td>
    </tr>
    </tbody>
    </table>

    </div>

    自动区分大小写

    startswith方法是自动区分大小写的:

    df[df["sex"].str.startswith("f")]
    

    <div>
    <style scoped>
    .dataframe tbody tr th:only-of-type {
    vertical-align: middle;
    }

    .dataframe tbody tr th {
        vertical-align: top;
    }
    
    .dataframe thead th {
        text-align: right;
    }
    

    </style>

    <table border="1" class="dataframe">
    <thead>
    <tr style="text-align: right;">
    <th></th>
    <th>name</th>
    <th>age</th>
    <th>sex</th>
    <th>address</th>
    </tr>
    </thead>
    <tbody>
    <tr>
    <th>2</th>
    <td>NaN</td>
    <td>20.0</td>
    <td>female</td>
    <td>江苏省苏州市</td>
    </tr>
    </tbody>
    </table>

    </div>

    df[df["sex"].str.startswith("F")]
    

    <div>
    <style scoped>
    .dataframe tbody tr th:only-of-type {
    vertical-align: middle;
    }

    .dataframe tbody tr th {
        vertical-align: top;
    }
    
    .dataframe thead th {
        text-align: right;
    }
    

    </style>

    <table border="1" class="dataframe">
    <thead>
    <tr style="text-align: right;">
    <th></th>
    <th>name</th>
    <th>age</th>
    <th>sex</th>
    <th>address</th>
    </tr>
    </thead>
    <tbody>
    <tr>
    <th>1</th>
    <td>Xiao zhang</td>
    <td>19.0</td>
    <td>Female</td>
    <td>浙江省杭州市</td>
    </tr>
    <tr>
    <th>3</th>
    <td>sun quan</td>
    <td>34.0</td>
    <td>Female</td>
    <td>福建省泉州市</td>
    </tr>
    </tbody>
    </table>

    </div>

    缺失值处理

    df["name"].str.startswith("xiao")
    
    0     True
    1    False
    2      NaN
    3    False
    4    False
    Name: name, dtype: object
    
    df[df["name"].str.startswith("xiao",na=False)]
    

    <div>
    <style scoped>
    .dataframe tbody tr th:only-of-type {
    vertical-align: middle;
    }

    .dataframe tbody tr th {
        vertical-align: top;
    }
    
    .dataframe thead th {
        text-align: right;
    }
    

    </style>

    <table border="1" class="dataframe">
    <thead>
    <tr style="text-align: right;">
    <th></th>
    <th>name</th>
    <th>age</th>
    <th>sex</th>
    <th>address</th>
    </tr>
    </thead>
    <tbody>
    <tr>
    <th>0</th>
    <td>xiao ming</td>
    <td>22.0</td>
    <td>male</td>
    <td>广东省深圳市</td>
    </tr>
    </tbody>
    </table>

    </div>

    endswith

    指定以某个字符结尾,语法为:

    Series.str.endswith(pat, na=None)
    
    • pat:表示一个字符;注意:不接受正则表达式
    • na:表示对缺失值的处理;na=False表示忽略缺失值

    pat参数

    # 以市结尾
    
    df[df["address"].str.endswith("市")]
    

    <div>
    <style scoped>
    .dataframe tbody tr th:only-of-type {
    vertical-align: middle;
    }

    .dataframe tbody tr th {
        vertical-align: top;
    }
    
    .dataframe thead th {
        text-align: right;
    }
    

    </style>

    <table border="1" class="dataframe">
    <thead>
    <tr style="text-align: right;">
    <th></th>
    <th>name</th>
    <th>age</th>
    <th>sex</th>
    <th>address</th>
    </tr>
    </thead>
    <tbody>
    <tr>
    <th>0</th>
    <td>xiao ming</td>
    <td>22.0</td>
    <td>male</td>
    <td>广东省深圳市</td>
    </tr>
    <tr>
    <th>1</th>
    <td>Xiao zhang</td>
    <td>19.0</td>
    <td>Female</td>
    <td>浙江省杭州市</td>
    </tr>
    <tr>
    <th>2</th>
    <td>NaN</td>
    <td>20.0</td>
    <td>female</td>
    <td>江苏省苏州市</td>
    </tr>
    <tr>
    <th>3</th>
    <td>sun quan</td>
    <td>34.0</td>
    <td>Female</td>
    <td>福建省泉州市</td>
    </tr>
    <tr>
    <th>4</th>
    <td>guan yu</td>
    <td>39.0</td>
    <td>male</td>
    <td>广东省广州市</td>
    </tr>
    </tbody>
    </table>

    </div>

    # 正则的写法:contains方法
    
    df[df["address"].str.contains("市$")]
    

    <div>
    <style scoped>
    .dataframe tbody tr th:only-of-type {
    vertical-align: middle;
    }

    .dataframe tbody tr th {
        vertical-align: top;
    }
    
    .dataframe thead th {
        text-align: right;
    }
    

    </style>

    <table border="1" class="dataframe">
    <thead>
    <tr style="text-align: right;">
    <th></th>
    <th>name</th>
    <th>age</th>
    <th>sex</th>
    <th>address</th>
    </tr>
    </thead>
    <tbody>
    <tr>
    <th>0</th>
    <td>xiao ming</td>
    <td>22.0</td>
    <td>male</td>
    <td>广东省深圳市</td>
    </tr>
    <tr>
    <th>1</th>
    <td>Xiao zhang</td>
    <td>19.0</td>
    <td>Female</td>
    <td>浙江省杭州市</td>
    </tr>
    <tr>
    <th>2</th>
    <td>NaN</td>
    <td>20.0</td>
    <td>female</td>
    <td>江苏省苏州市</td>
    </tr>
    <tr>
    <th>3</th>
    <td>sun quan</td>
    <td>34.0</td>
    <td>Female</td>
    <td>福建省泉州市</td>
    </tr>
    <tr>
    <th>4</th>
    <td>guan yu</td>
    <td>39.0</td>
    <td>male</td>
    <td>广东省广州市</td>
    </tr>
    </tbody>
    </table>

    </div>

    缺失值处理

    df["name"].str.endswith("g")
    
    0     True
    1     True
    2      NaN
    3    False
    4    False
    Name: name, dtype: object
    
    df[df["name"].str.endswith("g",na=False)]
    

    <div>
    <style scoped>
    .dataframe tbody tr th:only-of-type {
    vertical-align: middle;
    }

    .dataframe tbody tr th {
        vertical-align: top;
    }
    
    .dataframe thead th {
        text-align: right;
    }
    

    </style>

    <table border="1" class="dataframe">
    <thead>
    <tr style="text-align: right;">
    <th></th>
    <th>name</th>
    <th>age</th>
    <th>sex</th>
    <th>address</th>
    </tr>
    </thead>
    <tbody>
    <tr>
    <th>0</th>
    <td>xiao ming</td>
    <td>22.0</td>
    <td>male</td>
    <td>广东省深圳市</td>
    </tr>
    <tr>
    <th>1</th>
    <td>Xiao zhang</td>
    <td>19.0</td>
    <td>Female</td>
    <td>浙江省杭州市</td>
    </tr>
    </tbody>
    </table>

    </div>

    # 不加na参数则报错
    df[df["name"].str.endswith("g")]
    
    image

    报错的原因很明显:就是因为name字段下面存在缺失值。当使用了na参数就可以解决

    相关文章

      网友评论

          本文标题:pandas文本处理的3大秘诀

          本文链接:https://www.haomeiwen.com/subject/skxirrtx.html