美文网首页生信分析流程我自己的生信百宝箱
超简单python脚本实现Selenium+Xpath框架下批量

超简单python脚本实现Selenium+Xpath框架下批量

作者: 瓶瓶瓶平平 | 来源:发表于2021-01-01 10:49 被阅读0次

    在这个不平凡的2020的最后几天,林小姐的Group project分工分到了对155篇文章进行是否是RCT(Randomized Controlled Trial)研究的判断。


    淦!好多.png

    林小姐说要自己一篇篇看然后仔细筛选,
    “在下佩服,棒!不愧是您”
    然后她躺下睡着了。
    好吧!看着林小姐的不那么盛世的美颜,我决定捡起我那几千个小时没用过的Selenium(本来想用scrapy,很可惜发现自己忘得差不多了)


    首先当然是找规律!
    去Pubmed搜了几篇发现人家的RCT都有标识了。那干就完了


    RCT标识 of Pubmed
    from selenium import webdriver
    from collections import OrderedDict
    import time
    goin = "C:/Users/LIFANGPING/Desktop/inclusion1.txt"#Your paper name list
    goout = "C:/Users/LIFANGPING/Desktop/inclusionout.csv"#Your outfile name list
    
    file = open(goin,"r")
    lines = list(file.readlines())
    file.close()
    outfile = open(goout,"w")
    
    chromedriver = r"C:\Users\LIFANGPING\AppData\Local\Google\Chrome\Application\chromedriver"#Start the browers
    driver = webdriver.Chrome(chromedriver)
    url = "https://pubmed.ncbi.nlm.nih.gov/29747957/"# Search page of a paper from Pubmed
    
    
    driver.get(url)
    time.sleep(2) #keep enough time  of  waiting for the response between your browser and the server; all time.sleep is for this purpose
    
    for i in lines:
        driver.refresh()
        time.sleep(1)
        print(i.strip(),end = ",",file = outfile)
       
        time.sleep(2)
        need = i.strip()
        scan = driver.find_element_by_xpath('/html/body/form/div/div[1]/div/span/input')#Search box location
        scan.send_keys(need)#Enter your search content
        
        time.sleep(2)
        scanclick = driver.find_element_by_xpath('/html/body/form/div/div[1]/div/button')#Search the search botton
        scanclick.click()#click the search botton
        time.sleep(2)
        try: #If there are multiple paper search results, choose the best
            bestmeet = driver.find_element_by_xpath('/html/body/main/div[9]/div[2]/section[1]/div[1]/article/div/a')
            bestmeet.click()
            time.sleep(2)
            doi = driver.find_element_by_xpath('.//*[@class="citation-doi"]').text #get the doi
            time.sleep(1)
            print(doi,file = outfile)
             
        except:
            try:
                time.sleep(4)
                doi = driver.find_element_by_xpath('.//*[@class="citation-doi"]').text
                time.sleep(2)
                print(doi,file = outfile)
             
            except:
                driver.get(url)
                time.sleep(2)
                print("",file = outfile)
                continue
                
    outfile.close()
    

    结果相当感人又有少年感!


    部分结果(几乎是全部RCT文章了!)1.png

    当然有些文章Pubmed没分类或者没收录的,醒来的林小姐睁着大大的小眼睛一脸无辜的说一定要仔细过一遍。(好吧!后来的事实证明这些剩下的没几篇是)

    好吧,脚本还能做的就大概是根据doi下载了,下载肯定要有下载源,科研女神Alexandra Elbakyan的成果就此登场(好像新的网站版本里不挥手改下雪了)


    手工调整一下Doi list.png

    来吧!淦!
    这里声明一下我用的是win下的Ubuntu子系统执行脚本,所以能直接用Wget

    from selenium import webdriver
    import os 
    
    download_dir = "C:/Users/LIFANGPING/Desktop/allpdf/" # for linux/*nix, download_dir="/usr/Public"
    options = webdriver.ChromeOptions()
    
    profile = {"plugins.plugins_list": [{"enabled": False, "name": "Chrome PDF Viewer"}], # Disable Chrome's PDF Viewer
                   "download.default_directory": download_dir , "download.extensions_to_open": "applications/pdf"}
    options.add_experimental_option("prefs", profile)
    driver = webdriver.Chrome("C:/Users/LIFANGPING/AppData/Local/Google/Chrome/Application/chromedriver", options=options)
    # Optional argument, if not specified will search path.
    file = open("C:/Users/LIFANGPING/Desktop/doi-part2.txt","r")
    lines = list(file.readlines())
    file.close()
    
    srclist = []
    for i in lines:
        doi = i.strip()
        print(doi)
        try:
            driver.get("https://sci-hub.se/"+doi)
            src = driver.find_element_by_xpath("//*[@id='pdf']").get_attribute("src")
            srclist.append(src)
        except:
            continue
    
    for i in src:
        command = "wget " + i
        os.system(command)
    
    图片.png

    这里要注意hub页面直接出来的PDF其实并不在页面上,而是通过一个iframe引入到另一个页面了。需要通过XPath先定位到那个页面,然后通过Wget直接下载。
    结果相当优秀


    下下来了可我并不想看.png

    有没有什么能替我读PDF?(以下内容不靠谱)
    我们用 PyPDF2吧,搜关键词,我的关键词是

     word_list=['randomly assigned','randomlyassigned','random assi','randomass','randomizedcontrolledtrial','randomized controlled trial',"randomlyallo",'randomallo','random allo','randomallo','Randomizedcontrolledtrial','Randomizedclinicaltrial',"randomizedclinicaltrial"]
    

    全代码:

    import PyPDF2
    import os
    
    
    path = r"C:\Users\LIFANGPING\Desktop\newpdf"
    pdflist = os.listdir(path)
    
    for pdfgo in pdflist:
        
        pdf_File=open(path+"/"+pdfgo,'rb')
        print(pdfgo,end = ",")
        #path = r'C:\Users\LIFANGPING\Desktop\file'
        try:
            pdf_Obj=PyPDF2.PdfFileReader(pdf_File)
            pages=pdf_Obj.getNumPages()
    
            word_list=['randomly assigned','randomlyassigned','random assi','randomass','randomizedcontrolledtrial','randomized controlled trial',"randomlyallo",'randomallo','random allo','randomallo','Randomizedcontrolledtrial','Randomizedclinicaltrial',"randomizedclinicaltrial"]
    
    
            for w in word_list:
                page_list=[]
                for p in range(0,pages):
                    text=pdf_Obj.getPage(p).extractText().strip()
    
                    if text.find(w) != -1:
                        page_list.append(p+1)
    
                print(w,page_list,end = ",")
    
            print()
        except:
            continue
    

    其实还是满可靠的,不是的就是没有,是的就都有好几个!

    不是就是没有,是有好几个.png

    就是不敢相信,还是一篇篇看了(为啥不相信机器呢?)
    乏了乏了。

    林小姐的group project leader: “不属于我们小组的李同学真是一个宝!”
    喵喵喵?

    大家新年快乐!

    相关文章

      网友评论

        本文标题:超简单python脚本实现Selenium+Xpath框架下批量

        本文链接:https://www.haomeiwen.com/subject/pzgqoktx.html