检测科学摘要中特定的词或词组

可以使用上一篇文章所用到的检测科学摘要中的词或词组。一般地,本例还可以适用于进行非常简单的文本挖掘,可类比于 Microsoft Word 的"查找"工具。
import urllib2
import re
# word to be searched
keyword = re.compile('schistosoma')
# list of PMIDs where we want to search the word
pmids = ['18235848','22607149','22405002','21630672']
for pmid in pmids:
url = 'http://www.ncbi.nlm.nih.gov/pubmed?term=%s' +%pmid
handler=urllib2.urlopen(url)
html = handler.read()
title_regexp = re.compile('<h1>.{5.400}<!h1>')
title=title_regexp.search(html)
title=title.group()
abstract_regexp = re.compile('<h3>Abstract</h3><p>.{20.3000}</p></div>')
abstract = abstract_regexp.search(html)
abstract = abstract.group()
word = keyword.search(abstract,re.IGNORECASE)
if word:
# display title and where the keyword was found
print (title)
print (word.group(),word.start(),word.end())
如果想找出文本单词的所有匹配结果,可以使用 finditer()方法:
import urllib2
import re
# word to be searched
word_regexp = re.compile('schistosαna')
# list of PMIDs where we want to search the word
pmids = ['18235648','22607149','22405002','21630672']
for pmid in pmids:
url = 'http://www.ncbi.nlm.nih.gov/pubmed?term=%s' +%pmid
handler = urllib2.urlopen(url)
html = handler.read ()
title_regexp = re.compile('<h1>.{5,400}</h1>')
title = title_regexp.search(html)
title = title.group()
abstract_regexp = re.compile('<h3>Abstract</h3><P>.{20, 3000}</p></div>')
abstract = abstract_regexp.search(html)
abstract = abstract.group()
words = keyword.finditer(abstract)
if words:
# diaplay title and where the keyword was found
print (title)
for word in words:
print (word.group(),word.start(),word.end())
网友评论