使用selenium爬取pubmed论文信息

作者: puxiaotaoc | 来源:发表于2018-11-20 11:15 被阅读104次

使用selenium爬取pubmed论文信息
工具索引
Scrapy实战篇（七）之Scrapy配合Selenium爬取京
Scrapy+Selenium+Headless Chrome的
爬取某宝商品（1）
python scrapy selenium phantomJS
Python爬虫很强大，在爬虫里如何自动操控浏览器呢？
实战2：使用selenium爬取淘宝数据，保存在mongodb
Python+selenium使用cookie登录淘宝
Python爬虫实现的微信公众号文章下载器

一、任务描述

从pubmed上面爬取论文题目、摘要和keywords；
数据选取：leukemia(白血病)、hypertension(高血压)、cancer(癌症)、anemia(贫血)、gastritis(胃炎)、tuberculosis(肺结核)；

二、完整代码

# 完整代码如下
import urllib
import time
from lxml import etree
from selenium import webdriver
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.common.exceptions import TimeoutException


class crabInfo(object):
    browser = webdriver.Chrome()
    start_url = 'https://www.ncbi.nlm.nih.gov/pubmed/?term='
    wait = WebDriverWait(browser, 5)

    def __init__(self, keywordlist):
        self.temp = [urllib.parse.quote(i) for i in keywordlist]
        self.keyword = '%2C'.join(self.temp)
        self.title = ' AND '.join(self.temp)
        self.url = crabInfo.start_url + self.keyword
        self.headers = {
            'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/68.0.3440.106 Safari/537.36'}
        self.file = open('information.txt', 'w')
        self.status = True
        self.yearlist = []

    # 设置初始化
    def click_init(self, ):
        self.browser.get(self.url)
        self.wait.until(EC.element_to_be_clickable((By.CSS_SELECTOR, '#_ds1 > li > ul > li:nth-child(1) > a'))).click()
        self.wait.until(
            EC.element_to_be_clickable(
                (By.XPATH, '//ul[@class="inline_list left display_settings"]/li[3]/a/span[4]'))).click()
        self.wait.until(EC.element_to_be_clickable(
            (By.CSS_SELECTOR, '#display_settings_menu_ps > fieldset > ul > li:nth-child(1) > label'))).click()
        print("爬取五年的论文数据，每页显示200条数据......")

    # 获取页面文档
    def get_response(self):
        self.html = self.browser.page_source
        self.doc = etree.HTML(self.html)

    # 获取列表页的论文PMID
    def get_info(self):
        self.baseurl = 'https://www.ncbi.nlm.nih.gov/pubmed/'
        self.art_timeanddoi = self.doc.xpath('//div[@class="rprt"]/div[2]/div[2]/div/dl/dd/text()')
        for pmid in self.art_timeanddoi:
            url_content = self.baseurl + pmid  # 拼接论文详情页的地址
            print(url_content)
            self.browser.get(url_content)  # 进入论文详情页
            self.get_response()  # 进入页面后重新获取页面结构
            self.get_detail(pmid)  # 获取论文的详情信息
            self.browser.back()  # 从论文详情页返回列表页
            self.get_response()

    def get_detail(self, pmid):
        abstract = self.doc.xpath('//div[@class="abstr"]/div/p/text()')  # 获取论文摘要信息
        keywords = self.doc.xpath('//div[@class="keywords"]/p/text()')  # 获取论文keywords信息
        title = self.doc.xpath('//div[@class="rprt abstract"]/h1/text()')  # 获取论文title
        fileName = "/Users/mac/Desktop/pubmed/data/" + str(pmid) + ".txt"  # 打开输出论文信息的.txt文件，每个文件用pmid命名
        result = open(fileName, 'w')
        result.write("[Title]\r\n")
        result.write(''.join(str(i) for i in title))
        result.write("\r\n[Astract]\r\n")
        result.write(''.join(str(i) for i in abstract))
        result.write("\r\n[Keywords]\r\n")
        result.write(''.join(str(i) for i in keywords))
        result.close()
        print(str(pmid) + ".txt书写完毕")

    # 跳转到下一个页面
    def next_page(self):
        try:
            self.nextpage = self.wait.until(  # 注意这里不是立即点击的，要判断是否可以立即点击
                EC.element_to_be_clickable((By.XPATH, '//*[@title="Next page of results"]')))
        except TimeoutException:
            self.status = False

    def main(self):
        self.click_init()  # 页面设置初始化
        time.sleep(3)  # 等待
        self.get_response()  # 获取新页面的页面结构
        count = 0  # 用count来计数总共要爬取的论文数量，初始为0
        while True:
            self.get_info()  # 首先获取当前列表页的论文信息
            self.next_page()  # 进入下一页
            if self.status:  # 判断跳转是否成功
                self.nextpage.click()  # 执行跳转的点击操作
                self.get_response()
            else:
                print("跳转未成功......")
                break
            count = count + 1
            print(str(count))
            if count == 2:  # 可以根据需要修改count的值，这里只爬取20000条
                break


if __name__ == '__main__':
    arr = ['tuberculosis']  # arr保存需要查找的论文关键字，如cancer等
    a = crabInfo(arr)
    print(str(arr))
    a.main()

三、总结

代码还有一点小bug，我测试的时候每页5条数据是ok的，正式用的时候每页200条结果翻页失败，不知道是什么原因，后面我再调一下，因为我爬的是好几种疾病的数据，爬了1000条，5种疾病的数据，有10来条数据是两个疾病都能搜出来的论文，数据格式如下：

[Title]
Hemotrophic mycoplasma in Simmental cattle in Bavaria: prevalence, blood parameters, and transplacental transmission of 'Candidatus Mycoplasma haemobos' and Mycoplasma wenyonii.
[Astract]
The significance of hemotrophic mycoplasma in cattle remains unclear. Especially in Europe, their epidemiological parameters as well as pathophysiological influence on cows are lacking. The objectives of this study were: (1) to describe the prevalence of 'Candidatus Mycoplasma haemobos' ('C. M. haemobos') and Mycoplasma wenyonii (M. wenyonii) in Bavaria, Germany; (2) to evaluate their association with several blood parameters; (3) to explore the potential of vertical transmission in Simmental cattle; and (4) to evaluate the accuracy of acridine-orange-stained blood smears compared to real-time polymerase chain reaction (PCR) results to detect hemotrophic mycoplasma. A total of 410 ethylenediaminetetraacetic acid-blood samples from cows from 41 herds were evaluated by hematology, acridine-orange-stained blood smears, and real-time PCR. Additionally, blood samples were taken from dry cows of six dairy farms with positive test results for hemotrophic mycoplasma to investigate vertical transmission of infection.The period prevalence of both species was 60.24% (247/410), C. M. haemobos 56.59% (232/410), M. wenyonii 8.54% (35/410) and for coinfection 4.88% (20/410). Of the relevant blood parameters, only mean cell volume (MCV), mean cell hemoglobin (MCH), and white blood cell count (WBC) showed differences between the groups of infected and non-infected individuals. There were lower values of MCV (P < 0.01) and MCH (P < 0.01) and higher values of WBC (P < 0.05) in 'C. M. haemobos'-infected cows. In contrast, co-infected individuals had only higher WBC (P < 0.05). In M. wenyonii-positive blood samples, MCH was significantly lower (P < 0.05). Vertical transmission of 'C. M. haemobos' was confirmed in two calves. The acridine-orange-method had a low sensitivity (37.39%), specificity (65.97%), positive predictive value (63.70%) and negative predictive value (39.75%) compared to PCR.'Candidatus Mycoplasma haemobos' was more prevalent than M. wenyonii in Bavarian Simmental cattle, but infection had little impact on evaluated blood parameters. Vertical transmission of the infection was rare. Real-time PCR is the preferred diagnostic method compared to the acridine-orange-method.
[Keywords]
Acridine-orange-stained blood smears; ; Blood parameters; Cattle; Hemotrophic mycoplasma; M. wenyonii; Prevalence; Real-time PCR; Vertical transmission; ‘C. M. haemobos’

数据命名为论文在pubmed的编号

由于不熟悉selenium的api函数，走了不少弯路，在大佬代码的基础上根据自己的需求做了一些修改，后续还会继续系统的学习爬虫；

四、参考文献：

[python爬虫] Selenium定向爬取PubMed生物医学摘要信息
 利用selenium爬取pubmed，获得搜索的关键字最近五年发表文章数量
 从零开始写Python爬虫 --- 导言

使用selenium爬取pubmed论文信息
一、任务描述从pubmed上面爬取论文题目、摘要和keywords；数据选取：leukemia(白血病)、hyp...
工具索引
网站信息爬取：Selenium + chrome driver安装：pip install seleniumChr...
Scrapy实战篇（七）之Scrapy配合Selenium爬取京
之前我们使用了selenium加Firefox作为下载中间件来实现爬取京东的商品信息。但是在大规模的爬取的时候，F...
Scrapy+Selenium+Headless Chrome的
前言展示如何使用Scrapy爬取静态数据和Selenium+Headless Chrome爬取JS动态生成的数据...
爬取某宝商品（1）
今天我们利用上次的工具selenium 的webdrive工具简单爬取某宝上商品信息目的：爬取华为手机的名称、...
python scrapy selenium phantomJS
之前用selenium和phantomJS单线程爬取tyc的对外投资信息，无奈爬取速度太慢，单个企业抓取速度大概在...
Python爬虫很强大，在爬虫里如何自动操控浏览器呢？
概述： python通过selenium爬取数据是很多突破封锁的有效途径。但在使用selenium中会遇到很多问题...
实战2：使用selenium爬取淘宝数据，保存在mongodb
实战2：使用selenium爬取淘宝数据，保存在mongodb 配置文件爬虫文件
Python+selenium使用cookie登录淘宝
众所周知，使用常规方法爬取淘宝的难度是很高的，所以使用selenium+浏览器几乎成了爬取淘宝最理想的方法。然而现...
Python爬虫实现的微信公众号文章下载器
selenium爬取流程安装python selenium自动模块，通过selenium中的webdriver驱...