scrapy新手向——爬取电影列表塞进小数据库

作者: 圣_狒司机 | 来源:发表于2019-04-07 23:31 被阅读0次

步骤一创建爬取项目：

1. 进入你的桌面文件夹
cd desktop
2. 创建爬虫项目
scrapy startproject imove
3.创建爬虫机器人，名字就叫movie
cd imove
scrapy genspider movie
4.调整settings.py
变更user-agent 
USER_AGENT = 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/61.0.3163.79 Safari/537.36'
拒绝机器人协议
ROBOTSTXT_OBEY = False
管道通
ITEM_PIPELINES = {
   'imovie.pipelines.ImoviePipeline': 300,
}

步骤二初始化

在填入需要爬入的网站：http://****.com
填入最开始的那个网页： http://****.com.index.html

allowed_domains = ['www.dytt8.net']
start_urls = ['https://www.dytt8.net/html/gndy/dyzz/index.html']
2. 爬取内容的结构化数据类型 items.py
import scrapy

class ImovieItem(scrapy.Item):
    title = scrapy.Field()
    date = scrapy.Field()
    url = scrapy.Field()

步骤三填写爬虫规则

观察网站，需要爬取电影名字、时间、详情url地址，方便继续深入爬取

网页的规则为：
（Xpath 语言）
//table
名字 = .//a/text()
日期 = .//td[@style='padding-left:3px']/font/text()
URL= 域名+  ".//a/@href"

步骤四实现自动翻页

1. 判断是否存在下一页
if response.xpath("//a[text()='下一页']"):
2. 找出下一页网址
（Xpath 语言）
//a[text()='下一页']/@href
3. 点击它！
yield self.make_requests_from_url(next_page)

步骤五入库

1. import sqlite3
2. sql 建表：
create table if not exists movies (title text ,date text , url text);
3. sql 查表
insert into movies (title,date,url) values (?,?,?);",(item["title"],item["date"],item["url"])
4. 数据库验证（可以不做）parse_sqlite.py

import sqlite3
import pandas as pd

conn = sqlite3.connect("data.sqlite")
df = pd.read_sql_query("select * from movies limit 5;", conn)
print(df)

步骤六运行

scrapy crawl movie

运行结果：

image.png

全都保存在数据库中，方便下步操作

全部代码：

# movie.py
# -*- coding: utf-8 -*-
import scrapy
from imovie.items import ImovieItem


class MovieSpider(scrapy.Spider):
    name = 'movie'
    allowed_domains = ['www.dytt8.net']
    start_urls = ['https://www.dytt8.net/html/gndy/dyzz/index.html']

    def parse(self, response):
        tables = response.xpath("//table")
        imoveitem = ImovieItem()
        for table in tables:
            try:
                imoveitem["title"] = table.xpath(".//a/text()").extract_first()
                imoveitem["date"] = table.xpath(".//td[@style='padding-left:3px']/font/text()").extract_first().split()[0]
                imoveitem["url"] = "https://www.dytt8.net"+table.xpath(".//a/@href").extract_first()
            except:pass
            print(imoveitem)
            yield imoveitem

        if response.xpath("//a[text()='下一页']"):
            next_page = "https://www.dytt8.net/html/gndy/dyzz/"+response.xpath("//a[text()='下一页']/@href").extract_first()
            yield self.make_requests_from_url(next_page)


# items.py
import scrapy


class ImovieItem(scrapy.Item):
    title = scrapy.Field()
    date = scrapy.Field()
    url = scrapy.Field()


# pipelines.py

# -*- coding: utf-8 -*-

# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://doc.scrapy.org/en/latest/topics/item-pipeline.html
import sqlite3


class ImoviePipeline(object):
    def __init__(self):
        self.conn = sqlite3.connect("data.sqlite")
        cur = self.conn.cursor()
        # with  self.conn.cursor() as cur:
        #     cur.execute("create table movies (titie text ,date text , url text);")
        cur.execute("create table if not exists movies (title text ,date text , url text);")
        cur.close()

    def process_item(self, item, spider):
        cur = self.conn.cursor()
        # with self.conn.cursor() as cur:
        #     cur.execute("insert into movies (title,date,url) values (?,?,?);",(item["title"],item["date"],item["url"]))
        cur.execute("insert into movies (title,date,url) values (?,?,?);",(item["title"],item["date"],item["url"]))
        self.conn.commit()
        print("插入成功!")
        cur.close()
        return item

# parse_sqlite.py

import sqlite3
import pandas as pd

conn = sqlite3.connect("data.sqlite")
df = pd.read_sql_query("select * from movies;", conn)
print(df)

网友评论

本文标题：scrapy新手向——爬取电影列表塞进小数据库

本文链接：https://www.haomeiwen.com/subject/dqxyiqtx.html

延伸阅读

深度阅读

您也可以注册成为美文阅读网的作者，发表您的原创作品、分享您的心情！

scrapy新手向——爬取电影列表塞进小数据库

步骤一创建爬取项目：

步骤二初始化

步骤三填写爬虫规则

步骤四实现自动翻页

步骤五入库

步骤六运行

相关文章

网友评论

延伸阅读

深度阅读

栏目导航

热点阅读

scrapy新手向——爬取电影列表塞进小数据库

步骤一 创建爬取项目：

步骤二 初始化

步骤三 填写爬虫规则

步骤四 实现自动翻页

步骤五 入库

步骤六 运行

相关文章

网友评论

延伸阅读

深度阅读

栏目导航

热点阅读

步骤一创建爬取项目：

步骤二初始化

步骤三填写爬虫规则

步骤四实现自动翻页

步骤五入库

步骤六运行