美文网首页
检测科学摘要中特定的词或词组 (自学43天)

检测科学摘要中特定的词或词组 (自学43天)

作者: 天明豆豆 | 来源:发表于2020-03-24 17:37 被阅读0次

    检测科学摘要中特定的词或词组


    可以使用上一篇文章所用到的检测科学摘要中的词或词组。一般地,本例还可以适用于进行非常简单的文本挖掘,可类比于 Microsoft Word 的"查找"工具。

    import urllib2 
    import re 
    # word to be searched 
    
    keyword = re.compile('schistosoma')
    
    # list of PMIDs where we want to search the word 
    
    pmids = ['18235848','22607149','22405002','21630672'] 
    for pmid in pmids: 
      url = 'http://www.ncbi.nlm.nih.gov/pubmed?term=%s' +%pmid 
      handler=urllib2.urlopen(url)
      html = handler.read() 
      title_regexp = re.compile('<h1>.{5.400}<!h1>') 
      title=title_regexp.search(html) 
      title=title.group() 
      abstract_regexp = re.compile('<h3>Abstract</h3><p>.{20.3000}</p></div>') 
      abstract = abstract_regexp.search(html) 
      abstract = abstract.group() 
      word = keyword.search(abstract,re.IGNORECASE) 
    
    if word: 
    # display title and where the keyword was found 
      print (title) 
      print (word.group(),word.start(),word.end())
    

    如果想找出文本单词的所有匹配结果,可以使用 finditer()方法:

    import urllib2
    import re 
    # word to be searched 
    
    word_regexp = re.compile('schistosαna')
    # list of PMIDs where we want to search the word 
    
    pmids = ['18235648','22607149','22405002','21630672'] 
    for pmid in pmids: 
      url = 'http://www.ncbi.nlm.nih.gov/pubmed?term=%s' +%pmid 
      handler = urllib2.urlopen(url) 
      html = handler.read () 
      title_regexp = re.compile('<h1>.{5,400}</h1>') 
      title = title_regexp.search(html) 
      title = title.group() 
      abstract_regexp = re.compile('<h3>Abstract</h3><P>.{20, 3000}</p></div>') 
      abstract = abstract_regexp.search(html) 
      abstract = abstract.group() 
      words = keyword.finditer(abstract) 
      if words: 
    # diaplay title and where the keyword was found 
    
        print (title)
        for word in words: 
          print (word.group(),word.start(),word.end())

    相关文章

      网友评论

          本文标题:检测科学摘要中特定的词或词组 (自学43天)

          本文链接:https://www.haomeiwen.com/subject/iscayhtx.html