为何上Scrapy框架
- 感觉GeneCard查找基因信息,需要跳转几个页面,对于Scrapy相当方便
- 用管道对数据入库方便
- 由于GeneCard登录很慢,需要多一些计算机节点,异步处理+分布式爬取速度更快
后面发现没必要上Redis分布式和数据入库,就暂时不加这个功能
环境
win10系统
anaconda环境
Powershell终端界面
VScode编辑器
已经安装和配置好了mongodb和Redis
创建普通Scrapy爬虫项目
PS D:\Data\桌面\Scrapy> scrapy startproject GeneCard
New Scrapy project 'GeneCard', using template directory 'D:\Anaconda\lib\site-packages\scrapy\templates\project', created in:
D:\Data\桌面\Scrapy\GeneCard
You can start your first spider with:
cd GeneCard
scrapy genspider example example.com
PS D:\Data\桌面\Scrapy> cd .\GeneCard\
PS D:\Data\桌面\Scrapy\GeneCard> scrapy genspider gene genecards.org
Created spider 'gene' using template 'basic' in module:
GeneCard.spiders.gene
PS D:\Data\桌面\Scrapy\GeneCard> ls
目录: D:\Data\桌面\Scrapy\GeneCard
Mode LastWriteTime Length Name
---- ------------- ------ ----
d----- 2020/5/28 15:13 GeneCard
-a---- 2020/5/28 15:13 259 scrapy.cfg
在VScode中打开

网站页面分析,提取关键url
https://www.genecards.org/cgi-bin/carddisp.pl?gene={}
方法1:重写start_requests()方法
这个方法的返回值就是start_urls,可以在start_requests这个方法里面,获取获取外面的关键字,然后组合url,return出去
# -*- coding: utf-8 -*-
import scrapy
from pyquery import PyQuery
class GeneSpider(scrapy.Spider):
name = 'gene'
allowed_domains = ['genecards.org']
start_urls = ['https://www.genecards.org/']
#重写start_requests()方法,把所有URL地址都交给调度器
def start_requests(self):
# 把所有的URL地址统一扔给调度器入队列
for gene in ['AFP','GLUL']:
print(gene)
url = 'https://www.genecards.org/cgi-bin/carddisp.pl?gene={}'.format(gene)
# 交给调度器
yield scrapy.Request(
url=url,
callback=self.parse
)
def parse(self, response):
print(response.text)

调试了一下, 发现网站具有对爬虫的反扒策略

解决方法:
在setting中添加代理
# Crawl responsibly by identifying yourself (and your website) on the user-agent
USER_AGENT = 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_8_3) AppleWebKit/536.5 (KHTML, like Gecko) Chrome/19.0.1084.54 Safari/536.5'
编写解析函数parse
通过查看网页元素位置,断点调试+代码注入的方式进行分析
items.py中声明Item对象
# -*- coding: utf-8 -*-
# Define here the models for your scraped items
#
# See documentation in:
# https://doc.scrapy.org/en/latest/topics/items.html
import scrapy
class GenecardItem(scrapy.Item):
# define the fields for your item here like:
# name = scrapy.Field()
Aliases=scrapy.Field()
Entrez=scrapy.Field()
GeneCards=scrapy.Field()
US=scrapy.Field()
gene=scrapy.Field()
主要爬虫函数的编写
原本思路:
- 如果正常爬到数据,将数据存储在item后yield到管道中处理
- 如果没有这个基因,则重新请求另一个网页
https://www.genecards.org/Search/Keyword?queryString={}
,获得最接近的基因再进行查询
后面无意中发现,查询时网站自动匹配最接近的基因进行返回,所以直接查询基因后,爬到数据,将数据存储在item后yield到管道中入库处理即可
# -*- coding: utf-8 -*-
import scrapy
from pyquery import PyQuery
from GeneCard.items import GenecardItem
import pprint
import pandas
class GeneSpider(scrapy.Spider):
name = 'gene'
allowed_domains = ['genecards.org']
start_urls = ['https://www.genecards.org/cgi-bin/carddisp.pl?gene={}']
#重写start_requests()方法,把所有URL地址都交给调度器
def start_requests(self):
# 把所有的URL地址统一扔给调度器入队列
with open(r'C:\Users\Administrator\Desktop\markers.xls', 'r') as f:
for row in f:
gene = row.split('\t')[6]
gene = gene.strip('\n')
print(gene)
url = 'https://www.genecards.org/cgi-bin/carddisp.pl?gene={}'.format(gene)
# 交给调度器
yield scrapy.Request(
url=url,
callback=self.parse,
meta={'gene':gene}, #传递变量
dont_filter=True #不过滤操作
)
def parse(self, response):
jpy = PyQuery(response.text)
item=GenecardItem()
lst = jpy('#aliases_descriptions > div.row > div.col-xs-12.col-md-9 > div:nth-child(1) > div > ul > li')
n = len(lst)
alias_lst = []
for i in range(n):
i+=1
alias_lst.append(lst('li:nth-child({})'.format(i)).text())
item['gene'] = response.meta['gene']
item['Aliases'] = alias_lst
item['Entrez'] = jpy('#summaries > div:nth-child(2) > ul > li > p').text()
item['GeneCards'] = jpy('#summaries > div:nth-child(3) > p').text()
item['US'] = jpy('#summaries > div:nth-child(4) > ul > li > div').text()
# pprint.pprint(item)
yield item #转给管道
配置管道文件,这个文件主要负责数据处理和入库(此处写入文件)
# -*- coding: utf-8 -*-
# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://doc.scrapy.org/en/latest/topics/item-pipeline.html
class GenecardPipeline(object):
# init()为类的初始化方法,开始的时候调用
def open_spider(self, spider):
# 首先用写入的方式创建或者打开一个普通文件用于存储爬取到的数据
self.file = open(r"C:\Users\Administrator\Desktop\HCC.xls", "w")
# line = 'gene\tAliases\tEntrez\tGeneCards\tUS\n'
# self.file.write(line)
#processitem()为pipelines中的主要处理方法,默认会自动调用
def process_item(self, item, spider):
#设置每行要写的内容
line = [str(item["gene"]),
str(item["Aliases"]),
str(item["Entrez"]).split('\n')[0],
str(item["GeneCards"].split('\n')[0]),
str(item["US"]).split('\n')[0]]
#此处输出,方便程序的调试
line = '\t'.join(line) + '\n'
print(line)
#将对应信息写入文件中
self.file.write(line)
return item
def close_spider(self,spider):
#关闭文件
print("文件关闭")
self.file.close()
配置settings.py
# Obey robots.txt rules
ROBOTSTXT_OBEY = False
DOWNLOAD_DELAY = 3 #
# Configure item pipelines
# See https://doc.scrapy.org/en/latest/topics/item-pipeline.html
ITEM_PIPELINES = {
'GeneCard.pipelines.GenecardPipeline': 300,
}
main.py
from scrapy.cmdline import execute
execute('scrapy crawl gene'.split())
输入文件格式
p_val avg_logFC pct.1 pct.2 p_val_adj cluster gene
0 0.971505038 0.989 0.923 0 0 RBP4
0 0.970032765 0.818 0.626 0 0 CYP2E1
输出文件格式

网友评论