爬取糗百文字信息,页面比较简单,爬取难度不大,但要先确定其是否是动态加载。
在终端输入命令:
scrapy shell https://www.qiushibaike.com/text/
却得到这样的显示:Connection was closed cleanly.
我猜测是因为Headers的原因,于是写了个短小的代码来测试(其实是用Postman导出来的)

代码直接可用,测试OK,得到结果:糗料儿们都是静态加载的。
作者、糗料儿很容易就能找出来:

#热度(图片内未显示出)
hot_num = int(each.css('div.stats>span.stats-vote>i.number::text').extract_first())
#作者
content['author'] = each.css('div.author.clearfix>a:nth-child(2)>h2::text').extract_first().strip()
#糗料儿
content['funny_text'] = each.css('div.content>span::text').extract_first().strip()
根据要爬取的三项构建Item:
import scrapy
class FunnyItem(scrapy.Item):
author = scrapy.Field()
funny_text = scrapy.Field()
hot_num = scrapy.Field()
还有特殊情况,作爬虫不就是靠分析处理特殊情况提升的嘛!
特殊情况之匿名以及查看全文:



点击查看全文和点击文本一样的效果:进入全文网页,网页在截图第一行可以找到。
主要代码:
import scrapy
from scrapy import Request
from ..items import FunnyItem
class Funny66Spider(scrapy.Spider):
name = 'funny66'
allowed_domains = ['www.qiushibaike.com']
start_urls = ['https://www.qiushibaike.com/text/']
#定制请求头
headers = {
'If-None-Match': "1ab399a69d7f90aff3a3ead3bafb0eba47dc4607",
'DNT': "1",
'Accept-Encoding': "gzip, deflate, br",
'Accept-Language': "zh-CN,zh;q=0.9",
'Upgrade-Insecure-Requests': "1",
'User-Agent': "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.132 Safari/537.36",
'Accept': "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8",
'Referer': "https://www.qiushibaike.com/",
'Cookie': "_ga=GA1.2.614688305.1518532238; _gid=GA1.2.1133015715.1518532238; __cur_art_index=1900; _xsrf=2|9fe9f0b4|e3aec0cb1f03a169d8ca96e869adb6ea|1518577666; Hm_lvt_2670efbdd59c7e3ed3749b458cafaa37=1518532238,1518573213,1518577671; Hm_lpvt_2670efbdd59c7e3ed3749b458cafaa37=1518577787",
'Connection': "keep-alive",
'Cache-Control': "no-cache",
}
def parse(self, response):
for each in response.css('div.article'):
content = FunnyItem()
hot_num = int(each.css('div.stats>span.stats-vote>i.number::text').extract_first())
if hot_num >= 666: #筛选:只爬取热度大于666的
content['hot_num'] = hot_num
#判断是否匿名用户并作相应的处理
if each.css('div.author.clearfix>a'):
content['author'] = each.css('div.author.clearfix>a:nth-child(2)>h2::text').extract_first().strip()
else:
content['author'] = '匿名用户'
#判断是否已显示全文
if each.css('div.content>span.contentForAll'):
url = each.css('a.contentHerf::attr(href)').extract_first()
url = response.urljoin(url)
meta = {
'author': content['author'],
'hot_num': content['hot_num'],
}
yield Request(url, callback=self.parse_all, meta=meta, headers=self.headers)
else:
content['funny_text'] = each.css('div.content>span::text').extract_first().strip()
yield content
else:
yield None
#下一页
if response.css('ul.pagination>li:last-child span.next'):
next_url = response.css('ul.pagination>li:last-child>a::attr(href)').extract_first()
next_url = response.urljoin(next_url)
yield Request(next_url, callback=self.parse, headers=self.headers)
#对于长文的获取
def parse_all(self, response):
haha = FunnyItem()
haha['funny_text'] = 'Long Content: ' + str(response.css('div.content::text').extract_first().strip())
haha['author'] = response.meta['author']
haha['hot_num'] = response.meta['hot_num']
yield haha
整体部分已经结束,下面考虑如何把数据输入到数据库:
先构建出数据库:

个人觉得Workbench比命令终端好用多了,之前出现了一个问题:MySQL服务无法启动,我用MySQL Installer把服务卸载-重启-安装,就解决问题了。
下面编写pipelines.py:
import MySQLdb
class MySQLPipeline(object):
def open_spider(self, spider):
db = spider.settings.get('MYSQL_DB_NAME', 'funny')
host = spider.settings.get('MYSQL_HOST', 'localhost')
port = spider.settings.get('MYSQL_PORT', 3306)
user = spider.settings.get('MYSQL_USER', 'root')
passwd = spider.settings.get('MYSQL_PASSWORD', 'xxxxxxxxxxxxx') #密码自己设定
self.db_conn = MySQLdb.connect(host=host, port=port, db=db, user=user, passwd=passwd, charset='utf8')
self.db_cur = self.db_conn.cursor()
def close_spider(self, spider):
self.db_conn.commit()
self.db_conn.close()
def process_item(self, item, spider):
self.insert_db(item)
return item
def insert_db(self, item):
values = (
item['author'],
item['funny_text'],
item['hot_num'],
)
sql = 'INSERT INTO content VALUES(%s,%s,%s)'
self.db_cur.execute(sql, values)
最后进行配置settings.py
import random
BOT_NAME = 'funny'
SPIDER_MODULES = ['funny.spiders']
NEWSPIDER_MODULE = 'funny.spiders'
ROBOTSTXT_OBEY = False
MYSQL_DB_NAME = 'funny'
MYSQL_HOST = 'localhost'
MYSQL_USER = 'root'
MYSQL_PASSWORD = 'lovewangqian'
ITEM_PIPELINES = {
'funny.pipelines.MySQLPipeline': 401,
}
FEED_EXPORT_FIELDS = ['author', 'funny_text', 'hot_num']
#延迟得摸着良心设置
DOWNLOAD_DELAY = random.random() + random.random()
USER_AGENT = 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.132 Safari/537.36'
好了,结束。
给自己的:
前些天在看Linux基础,今天随手找了个网站练练手,防止生疏,这次又加入了数据库的使用,等看完Linux之后要好好看看数据库方面的知识了。
给大家的:
if not '单身狗':
print('祝大家情人节快乐!')
else:
print('好好吃狗粮,来年继续扛')
网友评论