scrapy爬取古诗文网

作者: Rain师兄 | 来源:发表于2020-12-10 23:39 被阅读0次

scrapy爬取古诗文网
Python-爬取带有验证码的网站
爬虫入门之路漫漫（2）：scrapy实例
scrapy 爬取当当网-图书排行榜-多条件爬取
爬虫技术scrapy
Scrapy爬取网易云音乐和评论（二、Scrapy框架每个模块的
Scrapy爬取网易云音乐和评论（一、思路分析）
Scrapy爬取网易云音乐和评论（四、关于API）
Scrapy爬取网易云音乐和评论（三、爬取歌手）
Scrapy爬取网易云音乐和评论（五、评论）

# spider 程序
import scrapy

# from scrapy.http.response.html import HtmlResponse

from ..items import GswwItem

# setting中piplines优先级越高越先执行

class GswwSpiderSpider(scrapy.Spider):

                name = 'gsww_spider'

                allowed_domains = ['gushiwen.cn']

                start_urls = ['https://www.gushiwen.cn/default.aspx?page=2']

def myprint(self,value1,*args):

                print(value1,*args)

                print("="*30)

def parse(self, response):

# self.myprint(type(response))

# response.xpath返回的都是selector对象，也就是标签对象，可以用xpath，css

# title = gsw_div.xpath(".//b/text()").getall() getall()是获取所有值

# get()返回第一个值,getall()是获取所有值

                gsw_divs = response.xpath("//div[@class='left']/div[@class='sons']")

# self.myprint(type(gsw_divs))

for gsw_div in gsw_divs:

                titles = gsw_div.xpath('.//b/text()').getall()

                source = gsw_div.xpath(".//p[@class='source']/a/text()").getall()

# 下面的//text()代表的是获取class='contson'下的所有子孙文本

                content_list = gsw_div.xpath(".//div[@class='contson']//text()").getall()

if titles and source and content_list != []:

            title = titles[0]

                 dynasty = source[0]

                 author = source[1]

# 下面的//text()代表的是获取class='contson'下的所有子孙文本

                content = "".join(content_list).strip()

                item = GswwItem(title=title, dynasty=dynasty, author=author, content=content)

yield item

        else:

continue