美文网首页python爬虫路
趁热打铁的爬阳光电影网

趁热打铁的爬阳光电影网

作者: 谁占了我的一年的称号 | 来源:发表于2017-04-09 15:25 被阅读53次

    女朋友这几天晚上总是在看电影,╮(╯▽╰)╭,让哥哥一个人自己玩。哼,不就是电影嘛?我给你一个库!
    说干就干,前几天在程老哥的指导下,终于理解了多层网页爬取的时候,数据是怎么传递的。今天选的阳光电影网也是这种的结构:http://www.ygdy8.com/ 选择最喜欢的欧美类

    嘿嘿嘿
    起始网页之这样:http://www.ygdy8.com/html/gndy/oumei/list_7_1.html Paste_Image.png

    不废话,先上代码:

    Paste_Image.png
    # -*- coding: utf-8 -*-
    import scrapy
    from yangguang.items import YangguangItem
    from scrapy.spiders import CrawlSpider
    
    
    
    class Ygdy8ComSpider(CrawlSpider):
        name = "ygdy8.com"
        allowed_domains = ["ygdy8.com"]
        start_urls = ['http://www.ygdy8.com/html/gndy/oumei/list_7_1.html']
    
        def parse(self, response):
            items=[]
            print(response.url)
            infos= response.xpath('//table[@border="0"]/tr[2]/td[2]/b/a[2]')
            for info in infos:
                item = YangguangItem()
                next_page_link = info.xpath('@href')[0].extract()
                next_page_name = info.xpath('text()')[0].extract()
                full_page_link= 'http://www.ygdy8.com'+next_page_link#这里一定要加http:// 不然会报错
                item['next_page_name'] = next_page_name
                item['full_page_link'] = full_page_link
                items.append(item)
            for item in items:
                yield scrapy.Request(url=item['full_page_link'],meta={'item_1':item},callback=self.parse_page)  #老规矩,这里把下一页的网址传递给下一页的解析函数
            for i in range(2,164):   #构造循环函数
                url= 'http://www.ygdy8.com/html/gndy/oumei/list_7_%s.html'%i
                yield scrapy.Request(url,callback=self.parse)
        def parse_page(self,response):  #解析传递下来的网址的函数
            items= response.meta['item_1']
            item = YangguangItem()
            dy_link= response.xpath('//table[@border="0"]/tbody/tr/td/a/@href').extract()
            item['dy_link']=dy_link
            item['next_page_name']=items['next_page_name']
            item['full_page_link']= items['full_page_link']
            print(item)
            yield item
    

    items

    import scrapy
    
    
    class YangguangItem(scrapy.Item):
        # define the fields for your item here like:
        # name = scrapy.Field()
        full_page_link= scrapy.Field()
        dy_link = scrapy.Field()
        next_page_name= scrapy.Field()
    

    pipline

    import pymysql
    def dbHandle():
        conn = pymysql.connect(
            host="localhost",
            user="root",
            passwd="密码",
            charset="utf8",
            use_unicode=False
        )
        return conn
    
    class YangguangPipeline(object):
        def process_item(self, item, spider):
            dbObject = dbHandle()
            cursor = dbObject.cursor()
            sql = "insert into ygdy.dy(dy_link,next_page_name,full_page_link) value (%s,%s,%s)"
            try:
                cursor.execute(sql, (item['dy_link'], item['next_page_name'], item['full_page_link']))
                cursor.connection.commit()
            except BaseException as e:
                print("错误在这里>>>>", e, "<<<<<<错误在这里")
                dbObject.rollback()
            return item
    

    setting

    之前一直存不到数据库,后来问了程老哥,

    ITEM_PIPELINES = {
       'yangguang.pipelines.YangguangPipeline': 300,
    }``` 
    这句话默认是不打开的,所以要在setting里把他打开。
    
    最后看一下存下来的东西
    
    ![Paste_Image.png](https://img.haomeiwen.com/i4324326/f4be3cfed4d4663d.png?imageMogr2/auto-orient/strip%7CimageView2/2/w/1240)
    
    老婆,来找我要电影吧。。
    ![Paste_Image.png](https://img.haomeiwen.com/i4324326/da0013c37346bd4d.png?imageMogr2/auto-orient/strip%7CimageView2/2/w/1240)

    相关文章

      网友评论

      本文标题:趁热打铁的爬阳光电影网

      本文链接:https://www.haomeiwen.com/subject/slpdattx.html