美文网首页
scrapy新手向——爬取电影列表塞进小数据库

scrapy新手向——爬取电影列表塞进小数据库

作者: 圣_狒司机 | 来源:发表于2019-04-07 23:31 被阅读0次

    步骤一 创建爬取项目:

    1. 进入你的桌面文件夹
    cd desktop
    2. 创建爬虫项目
    scrapy startproject imove
    3.创建爬虫机器人,名字就叫movie
    cd imove
    scrapy genspider movie
    4.调整settings.py
    变更user-agent 
    USER_AGENT = 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/61.0.3163.79 Safari/537.36'
    拒绝机器人协议
    ROBOTSTXT_OBEY = False
    管道通
    ITEM_PIPELINES = {
       'imovie.pipelines.ImoviePipeline': 300,
    }
    

    步骤二 初始化

    1. 在填入需要爬入的网站:http://****.com
      填入最开始的那个网页 : http://****.com.index.html
    allowed_domains = ['www.dytt8.net']
    start_urls = ['https://www.dytt8.net/html/gndy/dyzz/index.html']
    2. 爬取内容的结构化数据类型 items.py
    import scrapy
    
    class ImovieItem(scrapy.Item):
        title = scrapy.Field()
        date = scrapy.Field()
        url = scrapy.Field()
    

    步骤三 填写爬虫规则

    观察网站,需要爬取电影名字、时间、详情url地址,方便继续深入爬取
    
    网页的规则为:
    (Xpath 语言)
    //table
    名字 = .//a/text()
    日期 = .//td[@style='padding-left:3px']/font/text()
    URL= 域名+  ".//a/@href" 
    

    步骤四 实现自动翻页

    1. 判断是否存在下一页
    if response.xpath("//a[text()='下一页']"):
    2. 找出下一页网址
    (Xpath 语言)
    //a[text()='下一页']/@href
    3. 点击它!
    yield self.make_requests_from_url(next_page)
    

    步骤五 入库

    1. import sqlite3
    2. sql 建表:
    create table if not exists movies (title text ,date text , url text);
    3. sql 查表
    insert into movies (title,date,url) values (?,?,?);",(item["title"],item["date"],item["url"])
    4. 数据库验证(可以不做)parse_sqlite.py
    
    import sqlite3
    import pandas as pd
    
    conn = sqlite3.connect("data.sqlite")
    df = pd.read_sql_query("select * from movies limit 5;", conn)
    print(df)
    

    步骤六 运行

    scrapy crawl movie
    

    运行结果:


    image.png

    全都保存在数据库中,方便下步操作

    全部代码:

    # movie.py
    # -*- coding: utf-8 -*-
    import scrapy
    from imovie.items import ImovieItem
    
    
    class MovieSpider(scrapy.Spider):
        name = 'movie'
        allowed_domains = ['www.dytt8.net']
        start_urls = ['https://www.dytt8.net/html/gndy/dyzz/index.html']
    
        def parse(self, response):
            tables = response.xpath("//table")
            imoveitem = ImovieItem()
            for table in tables:
                try:
                    imoveitem["title"] = table.xpath(".//a/text()").extract_first()
                    imoveitem["date"] = table.xpath(".//td[@style='padding-left:3px']/font/text()").extract_first().split()[0]
                    imoveitem["url"] = "https://www.dytt8.net"+table.xpath(".//a/@href").extract_first()
                except:pass
                print(imoveitem)
                yield imoveitem
    
            if response.xpath("//a[text()='下一页']"):
                next_page = "https://www.dytt8.net/html/gndy/dyzz/"+response.xpath("//a[text()='下一页']/@href").extract_first()
                yield self.make_requests_from_url(next_page)
    
    
    # items.py
    import scrapy
    
    
    class ImovieItem(scrapy.Item):
        title = scrapy.Field()
        date = scrapy.Field()
        url = scrapy.Field()
    
    
    # pipelines.py
    
    # -*- coding: utf-8 -*-
    
    # Define your item pipelines here
    #
    # Don't forget to add your pipeline to the ITEM_PIPELINES setting
    # See: https://doc.scrapy.org/en/latest/topics/item-pipeline.html
    import sqlite3
    
    
    class ImoviePipeline(object):
        def __init__(self):
            self.conn = sqlite3.connect("data.sqlite")
            cur = self.conn.cursor()
            # with  self.conn.cursor() as cur:
            #     cur.execute("create table movies (titie text ,date text , url text);")
            cur.execute("create table if not exists movies (title text ,date text , url text);")
            cur.close()
    
        def process_item(self, item, spider):
            cur = self.conn.cursor()
            # with self.conn.cursor() as cur:
            #     cur.execute("insert into movies (title,date,url) values (?,?,?);",(item["title"],item["date"],item["url"]))
            cur.execute("insert into movies (title,date,url) values (?,?,?);",(item["title"],item["date"],item["url"]))
            self.conn.commit()
            print("插入成功!")
            cur.close()
            return item
    
    # parse_sqlite.py
    
    import sqlite3
    import pandas as pd
    
    conn = sqlite3.connect("data.sqlite")
    df = pd.read_sql_query("select * from movies;", conn)
    print(df)
    
    

    相关文章

      网友评论

          本文标题:scrapy新手向——爬取电影列表塞进小数据库

          本文链接:https://www.haomeiwen.com/subject/dqxyiqtx.html