目的:练习爬取当当网站特定关键词下图书数据,并将抓取到的数据存储在mysql数据库中
1.新建项目当当:
scrapy startproject dd
2.cd 到项目目录
cd dd
image.png
3.创建当当爬虫 ,用基本爬虫模板
scrapy genspider -t basic dd_spider dangdang.com
image.png
4.使用pycharm打开dd项目
image.png
5.打开当当,搜索特定的关键字的图书分析网页和需要抓取的字段
image.png# -*- coding: utf-8 -*-
import scrapy
class DdItem(scrapy.Item):
# define the fields for your item here like:
# name = scrapy.Field()
title = scrapy.Field()
link = scrapy.Field()
now_price = scrapy.Field()
comment_num = scrapy.Field()
detail = scrapy.Field()
6.打开爬虫文件,导入刚编写的item,以及修改的开始的爬取网址
from dd.items import DdItem
定义Item
item = DdItem()
item["title"] = response.xpath("//p[@class='name']/a/@title").extract()
item["link"] = response.xpath("//p[@class='name']/a/@href").extract()
item["now_price"] = response.xpath("//p[@class='price']/span[@class='search_now_price']/text()").extract()
item["comment_num"] = response.xpath("//p/a[@class='search_comment_num']/text()").extract()
item["detail"] = response.xpath("//p[@class='detail']/text()").extract()
yield item
定义循环爬取方法
for i in range(2,27):
url = "http://search.dangdang.com/?key=python&act=input&page_index="+str(i)
yield Request(url, callback=self.parse())
完整的代码
# -*- coding: utf-8 -*-
import scrapy
from dd.items import DdItem
from scrapy.http import Request
class DdSpiderSpider(scrapy.Spider):
name = 'dd_spider'
allowed_domains = ['dangdang.com']
start_urls = ['http://search.dangdang.com/?key=python&act=input&page_index=1']
def parse(self, response):
item = DdItem()
item["title"] = response.xpath("//p[@class='name']/a/@title").extract()
item["link"] = response.xpath("//p[@class='name']/a/@href").extract()
item["now_price"] = response.xpath("//p[@class='price']/span[@class='search_now_price']/text()").extract()
item["comment_num"] = response.xpath("//p/a[@class='search_comment_num']/text()").extract()
item["detail"] = response.xpath("//p[@class='detail']/text()").extract()
yield item
for i in range(2,27):
url = "http://search.dangdang.com/?key=python&act=input&page_index="+str(i)
yield Request(url, callback=self.parse())
image.png
7.在setting中,取消注释Pipeline的注释,以及将Robots协议设置为False
ITEM_PIPELINES = {
'dd.pipelines.DdPipeline': 300,
}
ROBOTSTXT_OBEY = False
image.png
image.png
8.打开pipelines文件
通过for循环读取爬取到的itme的值,并打印测试抓取效果
class DdPipeline(object):
def process_item(self, item, spider):
for i in range(0,len(item["title"])):
title = item["title"][i]
link = item["link"][i]
now_price = item["now_price"][i]
comment_num = item["comment_num"][i]
detail = item["detail"][i]
print(title)
print(link)
print(now_price)
print(comment_num)
print(detail)
return item
image.png
9.运行爬虫查看效果,使用pycharm的Terminal或mac终端,进入的dd的文件夹目录输入
scrapy crawl dd_spider --nolog
image.png
image.png
image.png
10.爬取没问题,接下来要将抓取到的数据,存入到Mysql的数据库中,使用的是第三方库PyMysql,提前安装好PyMysql,直接使用命令 pip install pymysql 来安装。
11.终端打开并连接上mysql ,输入创建数据库dd命令,并切换成dd数据库
create database dd;
use dd;
image.png
创建数据库表books,并创建需要存储的相应字段:
自动自增id,title,link,now_price,comment_num,detail
create table books(id int AUTO_INCREMENT PRIMARY KEY,title char(200),link char(100)unique,now_price int(10),comment_num char(100),detail char(255) );
12.导入pymysql
import pymysql
# -*- coding: utf-8 -*-
import pymysql
class DdPipeline(object):
def process_item(self, item, spider):
#创建连接
conn = pymysql.connect(host="127.0.0.1",user="root",passwd="654321",db="dd")
for i in range(0,len(item["title"])):
title = item["title"][i]
link = item["link"][i]
now_price = item["now_price"][i]
comment_num = item["comment_num"][i]
detail = item["detail"][i]
#构建sql语句插入数据
sql = "insert into books(title,link,now_price,comment_num,detail) VALUES ('"+title+"','"+link+"','"+now_price+"','"+comment_num+"','"+detail+"')"
conn.query(sql)
#关闭连接
conn.close()
return item
无法争取的写入写入数据库,报ModuleNotFoundError: No module named 'pymysql'
还没找到解决方案
image.png
解决办法:更换SQL语句的写法
conn = pymysql.connect(host="127.0.0.1",user="root",passwd="654321",db="dd",charset='utf8')
cursor = conn.cursor()
cursor.execute('set names utf8') # 固定格式
cursor.execute('set autocommit=1') # 设置自动提交
sql = "insert into goods(title,link,now_price,comment_num,detail) VALUES (%s,%s,%s,%s,%s)"
param = (title,link,now_price,comment_num,detail)
cursor.execute(sql,param )
conn.commit()
完整的代码
# -*- coding: utf-8 -*-
import pymysql
class DdPipeline(object):
def process_item(self, item, spider):
#创建连接
conn = pymysql.connect(host="127.0.0.1",user="root",passwd="654321",db="dd",charset='utf8')
cursor = conn.cursor()
cursor.execute('set names utf8') # 固定格式
cursor.execute('set autocommit=1') # 设置自动提交
for i in range(0,len(item["title"])):
title = item["title"][i]
link = item["link"][i]
now_price = item["now_price"][i]
comment_num = item["comment_num"][i]
detail = item["detail"][i]
sql = "insert into goods(title,link,now_price,comment_num,detail) VALUES (%s,%s,%s,%s,%s)"
param = (title,link,now_price,comment_num,detail)
cursor.execute(sql,param )
conn.commit()
cursor.close()
#关闭连接
conn.close()
return item
image.png
image.png
心得,出现问题比较多的是数据的编码问题,数据表字段的编码如何存入的字段编码不符可能会存不进去,也可能是乱码
优化:
1.抓取的到当当的评论数和价格都是字符,需要转化成数字,这样方便进行排序
2.写入数据库的时候使用Try 代码更健壮
def getNumber(string):
newString = string.encode('UTF-8')
lastStr = re.findall(r"\d+\.?\d*", newString)
yield int(lastStr)
网友评论