clinvar数据爬取

作者: 蚂蚁爱吃饭 | 来源:发表于2021-02-01 16:05 被阅读0次

clinvar数据爬取
Clinvar数据库
Python:爬虫（2019-1-15）
python第四天（一）BeautifulSoup爬虫
爬虫入门01-获取网络数据的原理作业
Python学习笔记7——爬取大规模数据
爬虫入门01作业
day 01 用正则爬取电影
听说《西虹市首富》是最值得一看的喜剧电影！Python分析10亿
annovar 数据库

想爬点基因的突变信息，先研究链接格式：https://www.ncbi.nlm.nih.gov/clinvar/?term=UBR5%5Bgene%5D

UBR5是测试的基因名称。那么我们准备一个基因list就可以了：gene.list。

代码：

import sys,os,re
import time
import shutil
from selenium import webdriver
from selenium.webdriver.firefox.options import Options
from selenium.webdriver.support.wait import WebDriverWait
from selenium.webdriver.common.by import By
from selenium.webdriver.support import Thread
from selenium.webdriver.firefox.firefox_binary import FirefoxBinary

我们利用selenium自带的firefox浏览器，首先要做好配置：

profile = webdriver.FirefoxProfile()
profile.set_preference('browser.download.dir', '/tmp/mozilla_baowenjuan0')
profile.set_preference('browser.download.folderList', 2)
profile.set_preference('browser.download.manager.showWhenStarting', False)
profile.set_preference('browser.helperApps.neverAsk.saveToDisk', '/tmp/mozilla_baowenjuan0')
option = webdriver.FirefoxOptions()
option.add_argument('--headless')  #要用无头的哦~
dr = webdriver.Firefox(options=option,firefox_profile=profile)

然后我们观察网页，看怎么定位：

第一步：按照格式输入链接；
第二步：点击Pathogenic这个；
第三步：点击Download（出现悬浮框）；
第四步：点击Create File（就会自动下载啦）

大致步骤

定位pathogenic

link='https://www.ncbi.nlm.nih.gov/clinvar/?term=UBR5%5Bgene%5D'
dr.get(link)
dr.find_element_by_xpath("//a[@data-value_id='Pathogenic']").click()  #这样页面就只显示有害的位点了

点击Download，出现悬浮框

dr.find_element_by_xpath("//*[@sourcecontent='send_to_menu' and @class='tgt_dark']").click()

下载文件

dr.find_element_by_xpath("//button[@name='EntrezSystem2.PEntrez.clinVar.clinVar_Entrez_ResultsPanel.Entrez_DisplayBar.SendToSubmit' and @cmd='File']").click()