美文网首页
2019-06-24—XPATH

2019-06-24—XPATH

作者: ElfACCC | 来源:发表于2019-06-24 16:29 被阅读0次

    2019-06-24

    proxyhandler代理IP

    image.png

    response.text和response.content的区别

    解码。decode:bytes->str,写入文件的时候得解码成字符串
    编码。encode:str->bytes


    image.png

    发post请求

    image.png

    使用代理

    image.png

    cookie

    image.png

    session

    image.png

    不受信任的证书

    设为false后,不会报错


    image.png

    xpath,lxml

    image.png
    image.png
    image.png
    image.png

    爬取豆瓣上正在上映的电影信息

    image.png
    image.png

    完整代码

    import requests
    from lxml import etree
    
    headers = {
        'Referer': 'https://movie.douban.com/',
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/75.0.3770.100 Safari/537.36'
    }
    
    url = 'https://movie.douban.com/cinema/nowplaying/hangzhou/'
    
    response = requests.get(url,headers=headers)
    text = response.content.decode('utf-8')
    
    html = etree.HTML(text)
    ul = html.xpath("//ul[@class='lists']")[0]
    lis = ul.xpath("./li")
    movies = []
    
    for li in lis:
        title = li.xpath("@data-title")[0]
        score = li.xpath("@data-score")[0]
        region = li.xpath("@data-region")[0]
        director = li.xpath("@data-director")[0]
        actors = li.xpath("@data-actors")[0]
        poster = li.xpath(".//img/@src")[0]
        movie = {
            'title':title,
            'score':score,
            'region':region,
            'director':director,
            'actors':actors,
            'poster':poster
        }
    
        movies.append(movie)
    
    print(movies)
    

    出错的地方,字典里面左边要加引号。。要拿当前元素下的子元素用./不是.//!,xpath返回的都是列表,后面加[0]获得字符串


    image.png

    电影天堂爬虫

    image.png

    优化版:
    lambda匿名函数,就是把后面那个列表中的每个一个元素,都进行匿名函数中的操作

    全局变量大写
    image.png

    获得页面详情


    image.png
    image.png
    image.png
    image.png
    image.png
    image.png

    xpath遇到错误

    1. lxml.etree.XPathEvalError: Unfinished literal
      class写错了
    2. Python Xpath: lxml.etree.XPathEvalError: Invalid predicate
      class少了一个]闭合

    不完整代码

    import requests
    from lxml import etree
    import re
    
    HEADERS = {
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/75.0.3770.100 Safari/537.36'
    }
    
    MAINNET = 'https://www.dytt8.net'
    
    def get_detail_url(url):
        response = requests.get(url,headers=HEADERS)
        text = response.text
        html = etree.HTML(text)
        detail_urls = html.xpath("//div[@class='co_content8']//table//a/@href")
        detail_urls = map(lambda url:MAINNET+url,detail_urls)
        return detail_urls
        
    
    def handle_detail_url(detail_url):
        movie = {}
        response = requests.get(detail_url,headers=HEADERS)
        text = response.content.decode('gbk')
        html = etree.HTML(text)
        title = html.xpath("//div[@class='title_all']//font[@color='#07519a']/text()")[0]
        movie['title'] = title
        download = html.xpath("//td[@bgcolor='#fdfddf']/a/@href")[0]
        movie['download'] = download
        return movie
    
    
    
    def spider():
        base_url="https://www.dytt8.net/html/gndy/dyzz/list_23_{}.html"
        movies = []
        for x in range(1,3):
            url = base_url.format(x)
            detail_urls = get_detail_url(url)
            for detail_url in detail_urls:
                movie = handle_detail_url(detail_url)
                movies.append(movie)
                print(movie)
                break
            break
            
    
    if __name__ == "__main__":
        spider()
    

    很奇怪的几个地方,我拿不到thunder连接,获得的href的网址也打不开。。。。。。

    相关文章

      网友评论

          本文标题:2019-06-24—XPATH

          本文链接:https://www.haomeiwen.com/subject/khenqctx.html