当当网商品爬虫

作者: 薛落花随泪绽放 | 来源:发表于2017-11-04 20:05 被阅读28次

当当网商品爬虫
Python第三天（spider_当当）
(四)"淘宝商品信息定向爬虫"实例|Pyth
Python商品数据预处理与K-Means聚类可视化分析
Python代写商品数据预处理与K-Means聚类可视化分析
网络爬虫源码分享
scrapy 爬取当当网-图书排行榜-多条件爬取
python第六天
京东商品评论爬虫
推荐一个可玩的爬虫开源项目-闲鱼部分我已经测试过

基础补充

XPath表达式与正则表达式对比：
1、XPath表达式效率会高一点
2、正则表达式功能会强大一点
3、一般来说，优先选择XPath，但是XPath解决不了的问题就选择正则表达式。

/ 代表逐层提取
text() 提取标签下面的文本
//标签名 提取所有名...的标签
//标签名[@属性='属性值']  提取属性为...的标签
@属性名   代表取某个属性值


实例:
提取标题：/html/head/title/text()
提取所有的div标签: //div
提取div中<div class="tools">标签中的内容： //div[@class='tools']

//ul[@class='ddnewhead_operate_nav']/li/a/text()

D:/scrapy>scrapy startproject dangdang
cd dangdang
scrapy genspider -t basic dd dangdang.com

items.py

# -*- coding: utf-8 -*-

# Define here the models for your scraped items
#
# See documentation in:
# http://doc.scrapy.org/en/latest/topics/items.html

import scrapy


class DangdangItem(scrapy.Item):
    # define the fields for your item here like:
    # name = scrapy.Field()
    title = scrapy.Field()
    link=scrapy.Field()
    comment=scrapy.Field()

dd.py

# -*- coding: utf-8 -*-
import scrapy
from dangdang.items import DangdangItem
from scrapy.http import Request


class DdSpider(scrapy.Spider):
    name = 'dd'
    allowed_domains = ['http://www.dangdang.com/']
    start_urls = ['http://search.dangdang.com/?key=%CD%E2%CC%D7%C5%AE&act=input&page_index=1']

    def start_requests(self):
        urls = []
        for i in range(1, 81):
            current_url = 'http://search.dangdang.com/?key=%CD%E2%CC%D7%C5%AE&act=input&page_index=' + str(i)
            urls.append(current_url)
        print(urls)
        for url in urls:
            yield Request(url=url, callback=self.parse)

    def parse(self, response):
        item=DangdangItem()
        item["title"]=response.xpath("//a[@name='itemlist-picture']/@title").extract()
        item["link"]=response.xpath("//a[@name='itemlist-picture']/@href").extract()
        item["comment"]=response.xpath("//a[@name='itemlist-review']/text()").extract()
        #print(item["title"])
        yield item

在cmd里：

D:\scrapy\dangdang>scrapy crawl dd --nolog
D:\scrapy\dangdang>scrapy crawl dd

settings.py

搜索robot，里面默认的是True，把它改成False。

# Obey robots.txt rules
ROBOTSTXT_OBEY = False

ITEM_PIPELINES = {
     'dangdang.pipelines.DangdangPipeline': 300,
}   //默认是注释掉的，把它解开。

cmd

C:\Windows\System32>pip install pymysql

建立一个数据库，如果没有，建议下载phpstudy。

create database dd;
use dd
 create table goods (id int(32)auto_increment primary key, title varchar(1
00),link varchar(100) unique,comment varchar(100) );

pipelines.py

# -*- coding: utf-8 -*-
import pymysql
# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: http://doc.scrapy.org/en/latest/topics/item-pipeline.html


class DangdangPipeline(object):
    def process_item(self, item, spider):
        conn = pymysql.connect(host="127.0.0.1", user="root", passwd="root", db="dd", charset="utf8")
        for i in range(0,len(item["title"])):
            title=item["title"][i]
            print("正在处理：" + title)
            link=item["link"][i]
            comment=item["comment"][i]
            # print(title + ":" + link + ":" + comment)
            sql="insert into dangdang1(title, link, comment) values('" + title + "','" + link + "','" + comment + "')"
            print(sql)
            try:
                conn.query(sql)
                conn.commit()
            except Exception as err:
                print(err)
        conn.close()
        return item