2019-11-14 爬虫 名人名言

作者: 一只失去梦想的程序猿 | 来源:发表于2019-11-14 19:29 被阅读0次

    目标网址:https://www.geyanw.com/
    首先获取几个大的分类网址:

    image.png
    import requests
    from lxml import html
    url='https://www.geyanw.com/'
    response=requests.get(url).text
    selector=html.fromstring(response)
    myArr=selector.xpath('//*[@id="p_left"]/div/dl/dt/strong/a/@href')
    print(len(myArr))
    for detail in myArr:
        arrUrl=url+detail
        print(arrUrl)
    

    获取到9个分类,单独处理每个分类:

    def getDetailUrl(arrUrl):
        response=requests.get(arrUrl).text
        selector = html.fromstring(response)
        page_two = selector.xpath('//*[@id="p_left"]/div/ul[2]/li[3]/a/@href')[0]
        print(page_two)
        page=1
        while 1:
            detailUrl=arrUrl+page_two[:-6]+'%s.html'%page
            print(detailUrl)
            page+=1
    

    循环处理每个网址

            response = requests.get(detailUrl).text
            selector = html.fromstring(response)
            detailList = selector.xpath('//*[@id="p_left"]/div/ul[1]/li/h2/a/@href')
            print(len(detailList))
            if len(detailList)==0:
                break
            page+=1
    
    image.png

    获取到每个详情页里的名人名言:

            for articleUrl in detailList:
                print(articleUrl)
                response=requests.get(url+articleUrl)
                response.encoding = 'gb2312'
                selector = html.fromstring(response.text)
                P_element = selector.xpath('//*[@id="p_left"]/div[1]/div[4]/p')
                print(len(P_element))
                for p in P_element:
                    print(p.text)
    
    image.png

    结果如图:


    image.png

    完整代码:https://github.com/Liangjianghao/everyDay_spider.git mingyan_11-14

    相关文章

      网友评论

        本文标题:2019-11-14 爬虫 名人名言

        本文链接:https://www.haomeiwen.com/subject/jmfdictx.html