在这个不平凡的2020的最后几天,林小姐的Group project分工分到了对155篇文章进行是否是RCT(Randomized Controlled Trial)研究的判断。
淦!好多.png
林小姐说要自己一篇篇看然后仔细筛选,
“在下佩服,棒!不愧是您”
然后她躺下睡着了。
好吧!看着林小姐的不那么盛世的美颜,我决定捡起我那几千个小时没用过的Selenium(本来想用scrapy,很可惜发现自己忘得差不多了)
首先当然是找规律!
去Pubmed搜了几篇发现人家的RCT都有标识了。那干就完了
RCT标识 of Pubmed
from selenium import webdriver
from collections import OrderedDict
import time
goin = "C:/Users/LIFANGPING/Desktop/inclusion1.txt"#Your paper name list
goout = "C:/Users/LIFANGPING/Desktop/inclusionout.csv"#Your outfile name list
file = open(goin,"r")
lines = list(file.readlines())
file.close()
outfile = open(goout,"w")
chromedriver = r"C:\Users\LIFANGPING\AppData\Local\Google\Chrome\Application\chromedriver"#Start the browers
driver = webdriver.Chrome(chromedriver)
url = "https://pubmed.ncbi.nlm.nih.gov/29747957/"# Search page of a paper from Pubmed
driver.get(url)
time.sleep(2) #keep enough time of waiting for the response between your browser and the server; all time.sleep is for this purpose
for i in lines:
driver.refresh()
time.sleep(1)
print(i.strip(),end = ",",file = outfile)
time.sleep(2)
need = i.strip()
scan = driver.find_element_by_xpath('/html/body/form/div/div[1]/div/span/input')#Search box location
scan.send_keys(need)#Enter your search content
time.sleep(2)
scanclick = driver.find_element_by_xpath('/html/body/form/div/div[1]/div/button')#Search the search botton
scanclick.click()#click the search botton
time.sleep(2)
try: #If there are multiple paper search results, choose the best
bestmeet = driver.find_element_by_xpath('/html/body/main/div[9]/div[2]/section[1]/div[1]/article/div/a')
bestmeet.click()
time.sleep(2)
doi = driver.find_element_by_xpath('.//*[@class="citation-doi"]').text #get the doi
time.sleep(1)
print(doi,file = outfile)
except:
try:
time.sleep(4)
doi = driver.find_element_by_xpath('.//*[@class="citation-doi"]').text
time.sleep(2)
print(doi,file = outfile)
except:
driver.get(url)
time.sleep(2)
print("",file = outfile)
continue
outfile.close()
结果相当感人又有少年感!
部分结果(几乎是全部RCT文章了!)1.png
当然有些文章Pubmed没分类或者没收录的,醒来的林小姐睁着大大的小眼睛一脸无辜的说一定要仔细过一遍。(好吧!后来的事实证明这些剩下的没几篇是)
好吧,脚本还能做的就大概是根据doi下载了,下载肯定要有下载源,科研女神Alexandra Elbakyan的成果就此登场(好像新的网站版本里不挥手改下雪了)
手工调整一下Doi list.png
来吧!淦!
这里声明一下我用的是win下的Ubuntu子系统执行脚本,所以能直接用Wget
from selenium import webdriver
import os
download_dir = "C:/Users/LIFANGPING/Desktop/allpdf/" # for linux/*nix, download_dir="/usr/Public"
options = webdriver.ChromeOptions()
profile = {"plugins.plugins_list": [{"enabled": False, "name": "Chrome PDF Viewer"}], # Disable Chrome's PDF Viewer
"download.default_directory": download_dir , "download.extensions_to_open": "applications/pdf"}
options.add_experimental_option("prefs", profile)
driver = webdriver.Chrome("C:/Users/LIFANGPING/AppData/Local/Google/Chrome/Application/chromedriver", options=options)
# Optional argument, if not specified will search path.
file = open("C:/Users/LIFANGPING/Desktop/doi-part2.txt","r")
lines = list(file.readlines())
file.close()
srclist = []
for i in lines:
doi = i.strip()
print(doi)
try:
driver.get("https://sci-hub.se/"+doi)
src = driver.find_element_by_xpath("//*[@id='pdf']").get_attribute("src")
srclist.append(src)
except:
continue
for i in src:
command = "wget " + i
os.system(command)
图片.png
这里要注意hub页面直接出来的PDF其实并不在页面上,而是通过一个iframe引入到另一个页面了。需要通过XPath先定位到那个页面,然后通过Wget直接下载。
结果相当优秀
下下来了可我并不想看.png
有没有什么能替我读PDF?(以下内容不靠谱)
我们用 PyPDF2吧,搜关键词,我的关键词是
word_list=['randomly assigned','randomlyassigned','random assi','randomass','randomizedcontrolledtrial','randomized controlled trial',"randomlyallo",'randomallo','random allo','randomallo','Randomizedcontrolledtrial','Randomizedclinicaltrial',"randomizedclinicaltrial"]
全代码:
import PyPDF2
import os
path = r"C:\Users\LIFANGPING\Desktop\newpdf"
pdflist = os.listdir(path)
for pdfgo in pdflist:
pdf_File=open(path+"/"+pdfgo,'rb')
print(pdfgo,end = ",")
#path = r'C:\Users\LIFANGPING\Desktop\file'
try:
pdf_Obj=PyPDF2.PdfFileReader(pdf_File)
pages=pdf_Obj.getNumPages()
word_list=['randomly assigned','randomlyassigned','random assi','randomass','randomizedcontrolledtrial','randomized controlled trial',"randomlyallo",'randomallo','random allo','randomallo','Randomizedcontrolledtrial','Randomizedclinicaltrial',"randomizedclinicaltrial"]
for w in word_list:
page_list=[]
for p in range(0,pages):
text=pdf_Obj.getPage(p).extractText().strip()
if text.find(w) != -1:
page_list.append(p+1)
print(w,page_list,end = ",")
print()
except:
continue
其实还是满可靠的,不是的就是没有,是的就都有好几个!
不是就是没有,是有好几个.png就是不敢相信,还是一篇篇看了(为啥不相信机器呢?)
乏了乏了。
林小姐的group project leader: “不属于我们小组的李同学真是一个宝!”
喵喵喵?
网友评论