美文网首页python 爬虫互联网科技Pythoner集中营
《Python 爬虫:零基础全实战入门》代码合集

《Python 爬虫:零基础全实战入门》代码合集

作者: DC学院 | 来源:发表于2017-11-21 11:38 被阅读2366次

    DC学院   《Python 爬虫:入门+进阶》系统课程
      

    第一节:下载百度首页信息

    import requests
    
    data = requests.get('https://www.baidu.com/')
    data.encoding='utf-8'
    
    print(data.text)
    

      

    第二节:Requsts+Xpath 爬取豆瓣电影

    1.爬取单个元素信息

    import requests
    from lxml import etree
    
    url = 'https://movie.douban.com/subject/1292052/'
    data = requests.get(url).text
    s=etree.HTML(data)
    
    file=s.xpath('//*[@id="content"]/h1/span[1]/text()')
    print(file)
    

    2.爬取多个元素信息

    import requests
    from lxml import etree
    
    url = 'https://movie.douban.com/subject/1292052/'
    data = requests.get(url).text
    s=etree.HTML(data)
    
    film=s.xpath('//*[@id="content"]/h1/span[1]/text()')
    director=s.xpath('//*[@id="info"]/span[1]/span[2]/a/text()')
    actor=s.xpath('//*[@id="info"]/span[3]/span[2]/a/text()')
    time=s.xpath('//*[@id="info"]/span[13]/text()')
    
    print('电影名称:',film)
    print('导演:',director)
    print('主演:',actor)
    print('片长:',time)
    

      

    第四节:爬取豆瓣图书TOP250信息

    from lxml import etree
    import requests
    import time
    
    for a in range(10):
        url = 'https://book.douban.com/top250?start={}'.format(a*25)
        data = requests.get(url).text
    
        s=etree.HTML(data)
        file=s.xpath('//*[@id="content"]/div/div[1]/div/table')
        time.sleep(3)
    
        for div in file:
            title = div.xpath("./tr/td[2]/div[1]/a/@title")[0]
            href = div.xpath("./tr/td[2]/div[1]/a/@href")[0]
            score=div.xpath("./tr/td[2]/div[2]/span[2]/text()")[0]
            num=div.xpath("./tr/td[2]/div[2]/span[3]/text()")[0].strip("(").strip().strip(")").strip()
            scrible=div.xpath("./tr/td[2]/p[2]/span/text()")
    
            if len(scrible) > 0:
                print("{},{},{},{},{}\n".format(title,href,score,num,scrible[0]))
            else:
                print("{},{},{},{}\n".format(title,href,score,num))
    

      

    第五节:爬取小猪短租房屋信息

    from lxml import etree
    import requests
    import time
    
    for a in range(1,6):
        url = 'http://cd.xiaozhu.com/search-duanzufang-p{}-0/'.format(a)
        data = requests.get(url).text
    
        s=etree.HTML(data)
        file=s.xpath('//*[@id="page_list"]/ul/li')
        time.sleep(3)
        
        for div in file:
            title=div.xpath("./div[2]/div/a/span/text()")[0]
            price=div.xpath("./div[2]/span[1]/i/text()")[0]
            scrible=div.xpath("./div[2]/div/em/text()")[0].strip()
            pic=div.xpath("./a/img/@lazy_src")[0]
                
            print("{}   {}   {}   {}\n".format(title,price,scrible,pic))
    

      

    第六节:将爬取的数据存到本地

    1.存储小猪短租数据

    from lxml import etree
    import requests
    import time
    
    with open('/Users/mac/Desktop/xiaozhu.csv','w',encoding='utf-8') as f:
        for a in range(1,6):
            url = 'http://cd.xiaozhu.com/search-duanzufang-p{}-0/'.format(a)
            data = requests.get(url).text
    
            s=etree.HTML(data)
            file=s.xpath('//*[@id="page_list"]/ul/li')
            time.sleep(3)
        
            for div in file:
                title=div.xpath("./div[2]/div/a/span/text()")[0]
                price=div.xpath("./div[2]/span[1]/i/text()")[0]
                scrible=div.xpath("./div[2]/div/em/text()")[0].strip()
                pic=div.xpath("./a/img/@lazy_src")[0]
                
                f.write("{},{},{},{}\n".format(title,price,scrible,pic))
    

    2.存储豆瓣图书TOP250数据

    from lxml import etree
    import requests
    import time
    
    with open('/Users/mac/Desktop/top250.csv','w',encoding='utf-8') as f:
        for a in range(10):
            url = 'https://book.douban.com/top250?start={}'.format(a*25)
            data = requests.get(url).text
    
            s=etree.HTML(data)
            file=s.xpath('//*[@id="content"]/div/div[1]/div/table')
            time.sleep(3)
    
            for div in file:
                title = div.xpath("./tr/td[2]/div[1]/a/@title")[0]
                href = div.xpath("./tr/td[2]/div[1]/a/@href")[0]
                score=div.xpath("./tr/td[2]/div[2]/span[2]/text()")[0]
                num=div.xpath("./tr/td[2]/div[2]/span[3]/text()")[0].strip("(").strip().strip(")").strip()
                scrible=div.xpath("./tr/td[2]/p[2]/span/text()")
    
                if len(scrible) > 0:
                    f.write("{},{},{},{},{}\n".format(title,href,score,num,scrible[0]))
                else:
                    f.write("{},{},{},{}\n".format(title,href,score,num))
    

      

    第七节:爬取豆瓣分类电影信息,解决动态加载页面

    import requests
    import json
    import time
    
    for a in range(3):
        url_visit = 'https://movie.douban.com/j/new_search_subjects?sort=T&range=0,10&tags=&start={}'.format(a*20)
        file = requests.get(url_visit).json()   #这里跟之前的不一样,因为返回的是 json 文件
        time.sleep(2)
    
        for i in range(20):
            dict=file['data'][i]   #取出字典中 'data' 下第 [i] 部电影的信息
            urlname=dict['url']
            title=dict['title']
            rate=dict['rate']
            cast=dict['casts']
        
            print('{}  {}  {}  {}\n'.format(title,rate,'  '.join(cast),urlname))
    

      
      

    DC学院:《Python爬虫:入门+进阶》

    高效的学习路径
    直接从具体的案例入手,通过实际的操作,学习具体的知识点

    每课都有学习资料
    精选最有用的学习资料,你只需要去实践,不必浪费时间搜集、筛选资源

    超多案例,覆盖主流网站
    知乎、淘宝、微博、去哪儿、58同城等数十个网站案例

    GET 多种反反爬技能
    轻松解决IP限制、动态加载、验证码等多种反爬虫手段

    进阶分布式爬虫
    掌握分布式技术,搭建爬虫框架,大规模数据爬取和数据库存储

      

    《Python爬虫:入门+进阶》(PC端加入直接开始上课)

    手机扫描二维码加入,立即开课

      
      

    相关文章

      网友评论

        本文标题:《Python 爬虫:零基础全实战入门》代码合集

        本文链接:https://www.haomeiwen.com/subject/arpivxtx.html