爬取当当数据

作者: whong736 | 来源:发表于2018-02-26 08:26 被阅读189次

    目的:练习爬取当当网站特定关键词下图书数据,并将抓取到的数据存储在mysql数据库中

    1.新建项目当当:

    scrapy startproject dd
    

    2.cd 到项目目录

    cd dd
    
    image.png

    3.创建当当爬虫 ,用基本爬虫模板

    scrapy genspider -t basic dd_spider dangdang.com
    
    
    image.png

    4.使用pycharm打开dd项目


    image.png

    5.打开当当,搜索特定的关键字的图书分析网页和需要抓取的字段

    image.png
    # -*- coding: utf-8 -*-
    
    import scrapy
    
    class DdItem(scrapy.Item):
        # define the fields for your item here like:
        # name = scrapy.Field()
    
         title = scrapy.Field()
         link = scrapy.Field()
         now_price = scrapy.Field()
         comment_num = scrapy.Field()
         detail = scrapy.Field()
        
    
    

    6.打开爬虫文件,导入刚编写的item,以及修改的开始的爬取网址

    from dd.items import DdItem
    
    

    定义Item

         item = DdItem()
            item["title"] = response.xpath("//p[@class='name']/a/@title").extract()
            item["link"] = response.xpath("//p[@class='name']/a/@href").extract()
            item["now_price"] = response.xpath("//p[@class='price']/span[@class='search_now_price']/text()").extract()
            item["comment_num"] = response.xpath("//p/a[@class='search_comment_num']/text()").extract()
            item["detail"] = response.xpath("//p[@class='detail']/text()").extract()
            yield item
    

    定义循环爬取方法

         for i in range(2,27):
                url = "http://search.dangdang.com/?key=python&act=input&page_index="+str(i)
                yield Request(url, callback=self.parse())
    
    

    完整的代码

    # -*- coding: utf-8 -*-
    import scrapy
    from dd.items import DdItem
    from scrapy.http import Request
    
    class DdSpiderSpider(scrapy.Spider):
        name = 'dd_spider'
        allowed_domains = ['dangdang.com']
        start_urls = ['http://search.dangdang.com/?key=python&act=input&page_index=1']
    
        def parse(self, response):
            item = DdItem()
            item["title"] = response.xpath("//p[@class='name']/a/@title").extract()
            item["link"] = response.xpath("//p[@class='name']/a/@href").extract()
            item["now_price"] = response.xpath("//p[@class='price']/span[@class='search_now_price']/text()").extract()
            item["comment_num"] = response.xpath("//p/a[@class='search_comment_num']/text()").extract()
            item["detail"] = response.xpath("//p[@class='detail']/text()").extract()
            yield item
    
            for i in range(2,27):
                url = "http://search.dangdang.com/?key=python&act=input&page_index="+str(i)
                yield Request(url, callback=self.parse())
    
    
    image.png

    7.在setting中,取消注释Pipeline的注释,以及将Robots协议设置为False

    ITEM_PIPELINES = {
       'dd.pipelines.DdPipeline': 300,
    }
    
    ROBOTSTXT_OBEY = False
    
    
    image.png image.png

    8.打开pipelines文件
    通过for循环读取爬取到的itme的值,并打印测试抓取效果

    class DdPipeline(object):
        def process_item(self, item, spider):
    
            for i in range(0,len(item["title"])):
                title = item["title"][i]
                link = item["link"][i]
                now_price = item["now_price"][i]
                comment_num = item["comment_num"][i]
                detail = item["detail"][i]
                print(title)
                print(link)
                print(now_price)
                print(comment_num)
                print(detail)
            return item
    
    image.png

    9.运行爬虫查看效果,使用pycharm的Terminal或mac终端,进入的dd的文件夹目录输入

    scrapy crawl dd_spider --nolog
    
    
    image.png image.png image.png

    10.爬取没问题,接下来要将抓取到的数据,存入到Mysql的数据库中,使用的是第三方库PyMysql,提前安装好PyMysql,直接使用命令 pip install pymysql 来安装。

    11.终端打开并连接上mysql ,输入创建数据库dd命令,并切换成dd数据库

    create database dd;
    
    use dd;
    
    image.png

    创建数据库表books,并创建需要存储的相应字段:
    自动自增id,title,link,now_price,comment_num,detail

    create table books(id int AUTO_INCREMENT PRIMARY KEY,title char(200),link char(100)unique,now_price int(10),comment_num char(100),detail char(255) );
    

    12.导入pymysql

    import pymysql
    
    
    # -*- coding: utf-8 -*-
    
    import pymysql
    
    class DdPipeline(object):
        def process_item(self, item, spider):
            #创建连接
            conn = pymysql.connect(host="127.0.0.1",user="root",passwd="654321",db="dd")
            for i in range(0,len(item["title"])):
                title = item["title"][i]
                link = item["link"][i]
                now_price = item["now_price"][i]
                comment_num = item["comment_num"][i]
                detail = item["detail"][i]
                #构建sql语句插入数据
                sql = "insert into books(title,link,now_price,comment_num,detail) VALUES ('"+title+"','"+link+"','"+now_price+"','"+comment_num+"','"+detail+"')"
                conn.query(sql)
            #关闭连接
            conn.close()
            return item
    
    

    无法争取的写入写入数据库,报ModuleNotFoundError: No module named 'pymysql'
    还没找到解决方案


    image.png

    解决办法:更换SQL语句的写法

         conn = pymysql.connect(host="127.0.0.1",user="root",passwd="654321",db="dd",charset='utf8')
            cursor = conn.cursor()
            cursor.execute('set names utf8')  # 固定格式
            cursor.execute('set autocommit=1')  # 设置自动提交
    
             sql = "insert into goods(title,link,now_price,comment_num,detail) VALUES (%s,%s,%s,%s,%s)"
                param = (title,link,now_price,comment_num,detail)
                cursor.execute(sql,param )
                conn.commit()
    

    完整的代码

    # -*- coding: utf-8 -*-
    
    import pymysql
    
    class DdPipeline(object):
        def process_item(self, item, spider):
            #创建连接
            conn = pymysql.connect(host="127.0.0.1",user="root",passwd="654321",db="dd",charset='utf8')
            cursor = conn.cursor()
            cursor.execute('set names utf8')  # 固定格式
            cursor.execute('set autocommit=1')  # 设置自动提交
            for i in range(0,len(item["title"])):
                title = item["title"][i]
                link = item["link"][i]
                now_price = item["now_price"][i]
                comment_num = item["comment_num"][i]
                detail = item["detail"][i]
                sql = "insert into goods(title,link,now_price,comment_num,detail) VALUES (%s,%s,%s,%s,%s)"
                param = (title,link,now_price,comment_num,detail)
                cursor.execute(sql,param )
                conn.commit()
            cursor.close()
            #关闭连接
            conn.close()
            return item
    
    
    image.png image.png

    心得,出现问题比较多的是数据的编码问题,数据表字段的编码如何存入的字段编码不符可能会存不进去,也可能是乱码

    优化:

    1.抓取的到当当的评论数和价格都是字符,需要转化成数字,这样方便进行排序
    2.写入数据库的时候使用Try 代码更健壮

            def getNumber(string):
                newString = string.encode('UTF-8')
                lastStr = re.findall(r"\d+\.?\d*", newString)
                yield int(lastStr)
    

    参考文章:http://blog.csdn.net/think_ma/article/details/78900218

    相关文章

      网友评论

        本文标题:爬取当当数据

        本文链接:https://www.haomeiwen.com/subject/apbdtftx.html