美文网首页
爬虫清洗去除html的tags

爬虫清洗去除html的tags

作者: format_b1d8 | 来源:发表于2021-02-23 09:28 被阅读0次

    在抓取一些数据的时候,获取到的字符串是这样的:

    <span><strong>文章标题</strong></span>2432<br>文章内容<br>
    

    这种情况如果写正则匹配的话,太浪费时间了。有一个现成的工具可用
    那就是w3lib。w3lib 是scrapy的基础插件,用来处理html,相当好用,以下是例子:

    from w3lib.html import remove_tags
    s = '<span><strong>文章标题</strong></span>2432<br>文章内容<br>'
    s1 = remove_tags(s)
    print(s1)
    
    >>>文章标题2432文章内容
    

    是不是很赞?另外,w3lib还提供了多种高度自由的方法来进行字符串清洗:

    doc = '<div><p><b>This is a link:</b> <a href="http://www.example.com">example</a></p></div>'
    1.指定保留需要的标签:
    >>> w3lib.html.remove_tags(doc, keep=('div',))
    '<div>This is a link: example</div>'
    
    2. 指定选择去除的标签:
    >>> w3lib.html.remove_tags(doc, which_ones=('a','b'))
    '<div><p>This is a link: example</p></div>'
    
    3. 注意不能即保留又删除标签;
    >>> w3lib.html.remove_tags(doc, which_ones=('a',), keep=('p',))
    AssertionError: which_ones and keep can not be given at the same time
    
    4. 删除标签并把标签的内容删掉
    >>> w3lib.html.remove_tags_with_content(doc, which_ones=('b',))
    '<div><p> <a href="http://www.example.com">example</a></p></div>'
    
    5. 替换字符为实体字符
    >>> import w3lib.html
    >>> w3lib.html.replace_entities(b'Price: &pound;100')
    'Price: \xa3100'
    >>> print(w3lib.html.replace_entities(b'Price: &pound;100'))
    Price: £100
    
    6. 替换标签为指定字符(默认为空字符串)
    >>> import w3lib.html
    >>> w3lib.html.replace_tags('This text contains <a>some tag</a>') # 无指定替换字符
    'This text contains some tag'
    >>> w3lib.html.replace_tags('<p>Je ne parle pas <b>fran\xe7ais</b></p>', ' -- ', 'latin-1')  # 指定替换字符
    ' -- Je ne parle pas  -- fran\xe7ais --  -- '
    

    OK,在文章的最后分享两个demo,用于生成headers字典和cookie字典的,当然写的不太好,希望能对你有所帮助

    import re
    def get_headers(str):  # 根据字符串获取headers字典,字符串是多行字符串
        rule = "^(.*?): (.*?)$"
        result_str = re.findall(rule,str,re.M)
        dict1 = {}
        for i in result_str:
            dict1[i[0]] = i[1]
        return dict1
    
    def get_cookies(str):  # 根据字符串获取cookies,字符串为单行字符串
        s = str.split("; ")
        dict1 = {}
        for i in s:
            res = i.split("=",1)
            dict1[res[0]] = res[1]
        return dict1
    str1 = "BIDUPSID=6b51b98be47c49814ef72383f1f1caba; PSTM=1584239703; BAIDUID=6b51b98be47c49814ef72383f1f1caba:FG=1; BD_UPN=12314753; sug=3; sugstore=0; ORIGIN=0; bdime=0; BDORZ=B490B5EBF6F3CD402E515D22BCDA1598; H_PS_645EC=5ac6k9yR%2FX7%2BUiOSqHiNykeqTR%2F3jZRyJKjiO3aCMOta6xEQADQ56%2F5MXjs; COOKIE_SESSION=50_2_7_0_13_33_0_3_3_4_0_13_66045_1_15_3_1588209306_1588209294_1588209291%7C9%23150672_124_1588209291%7C9; BD_HOME=1; H_PS_PSSID=30968_1456_31169_21094_31421_31342_31270_31464_31229_30824_26350_31163_31472"
    cookies = get_cookies(str1) # 获取cookies字典
    print(cookies)
    headers = ''' 
    Accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9
    Accept-Encoding: gzip, deflate, br
    Accept-Language: zh-CN,zh;q=0.9
    Cache-Control: max-age=0
    Connection: keep-alive
    Cookie: BIDUPSID=C47DF21E074A16099BEF8690090B9383; PSTM=1612069966; BAIDUID=0698AE4551B7945FC148930C7E345686:FG=1; BD_UPN=12314753; __yjs_duid=1_ebe1a109087c379bd51efce62f1d479b1612399647102; sug=3; sugstore=0; ORIGIN=0; bdime=0; BDORZ=B490B5EBF6F3CD402E515D22BCDA1598; Hm_lvt_aec699bb6442ba076c8981c6dc490771=1612938973,1612948703,1613614880,1613984999; BAIDUID_BFESS=3FD0D09F6FE4606330CA1004CC29A253:FG=1; BD_HOME=1; H_PS_PSSID=33425_33512_33581_33272_31254_33461_33585_26350_33267; delPer=0; BD_CK_SAM=1; PSINO=1; COOKIE_SESSION=893_0_5_2_15_7_0_1_5_3_10_1_907_0_27_0_1614042131_0_1614042104%7C9%23694503_11_1613614966%7C4; H_PS_645EC=296ax7tOf7qn12N2GGGcXG101gzpS%2FzPFgPIsy8Z6pIEP8HKWznC3uf4qXY; BA_HECTOR=8k81040124ag0la07b1g38lvn0r
    Host: www.baidu.com
    Referer: https://www.baidu.com/
    Sec-Fetch-Dest: document
    Sec-Fetch-Mode: navigate
    Sec-Fetch-Site: same-origin
    Sec-Fetch-User: ?1
    Upgrade-Insecure-Requests: 1
    User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/88.0.4324.104 Safari/537.36
    '''
    result = get_headers(headers)  # 获取headers字典
    print(result)
    

    相关文章

      网友评论

          本文标题:爬虫清洗去除html的tags

          本文链接:https://www.haomeiwen.com/subject/zxzlfltx.html