美文网首页
parse_baidu_m_news

parse_baidu_m_news

作者: 是东东 | 来源:发表于2022-09-01 00:22 被阅读0次
    from lxml import etree
    text = response.content.decode('utf-8')
    tree = etree.HTML(text)
    script = ''.join((tree.xpath('//script[contains(@id,"atom-data-")]/text()')))
    print(script)
    import json
    oo = json.loads(script)
    details = oo.get('data', {}).get('list') or []
    for detail in details:
        rank = detail.get('index')
        url = detail.get('titleurl') or detail.get('url') or detail.get('params', {}).get('originUrl')
        img_url = detail.get('img') or detail.get('imgsrcurl')
        title = detail.get('title')
        desc = detail.get('abstract')
        keywords1 = etree.HTML(title).xpath('//em/text()') or []
        keywords2 = etree.HTML(desc).xpath('//em/text()') or []
        keyword = []
        keyword.extend(keywords1)
        keyword.extend(keywords2)
        keyword = list(set(keyword))
        title = title.replace('<em>', '').replace('</em>', '')
        desc = desc.replace('<em>', '').replace('</em>', '')
        press_time = detail.get('posttime')
        subsitename = detail.get('subsitename')
    

    相关文章

      网友评论

          本文标题:parse_baidu_m_news

          本文链接:https://www.haomeiwen.com/subject/fpujnrtx.html