美文网首页Python五期爬虫作业
【Python爬虫】第三周练习(14)

【Python爬虫】第三周练习(14)

作者: Doggy米 | 来源:发表于2018-01-12 14:24 被阅读11次

    一、div标签文本提取
    将学习视频中xpath.html文件中div标签下文本值
    “第一个div” ,“第二个div” 使用xpath结构化提取并打印输出
    二、ul标签文本提取
    将xpath.html文件中ul标签下“流程” ,“xpath学习”,“流程2”文本值
    使用xpath结构化提取并打印输出
    三、过滤标签
    将xpath.html文件中的第一个div下的前3个a标签的文本及超链接
    使用xpath结构化提取,打印输出
    四、requests模块和lxml&xpath结合提取数据
    结合上节课requests模块知识,将阳光电影网导航栏的文本及超链接结构化提取

    def clean_data(element_result):
        return str(element_result).replace(" ", "").replace("\n", "").replace("\r", "")
    
    
    def print_data(elements):
        for element in elements:
            data = clean_data(element)
            if len(data):
                print(data)
    
    
    with open("xpath.html", "r", encoding="utf-8") as html_file:
        html_str = html_file.read()
    
    selector = etree.HTML(html_str)
    div_elements = selector.xpath("//div/text()")
    print_data(div_elements)
    
    ul_elements = selector.xpath("//ul/text()")
    print_data(ul_elements)
    
    filter_elements = selector.xpath("//div[1]//a[position()<4]/@href|//div[1]//a[position()<4]/text()")
    print_data(filter_elements)
    
    url = "http://www.ygdy8.com/"
    header_str = '''
    Accept:text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8
    Accept-Encoding:gzip, deflate
    Accept-Language:zh-CN,zh;q=0.8
    Cache-Control:max-age=0
    Cookie:37cs_pidx=1; 37cs_user=37cs96544059545; UM_distinctid=160e80f56031c9-0c9b01c124c227-6d1b117c-1fa400-160e80f5607f4; CNZZDATA5783118=cnzz_eid%3D2025418817-1515716500-null%26ntime%3D1515716500; 37cs_show=69; cscpvrich4016_fidx=1
    Host:www.ygdy8.com
    If-Modified-Since:Thu, 11 Jan 2018 15:12:16 GMT
    If-None-Match:"0c8cb90ee8ad31:54c"
    Proxy-Connection:keep-alive
    Referer:https://www.google.co.uk/
    Upgrade-Insecure-Requests:1
    User-Agent:Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/61.0.3163.79 Safari/537.36
    '''
    header_list = header_str.strip().split('\n')
    headers_dict = {x.split(':')[0]: x.split(':')[1] for x in header_list}
    req = requests.get(url, headers_dict)
    req.encoding = "gb2312"
    selector = etree.HTML(req.text)
    print(req.text)
    data_elements = selector.xpath("//div[@id = 'menu']//a/@href|//div[@id = 'menu']//a/text()")
    print_data(data_elements)
    

    相关文章

      网友评论

        本文标题:【Python爬虫】第三周练习(14)

        本文链接:https://www.haomeiwen.com/subject/irwxoxtx.html