美文网首页
利用Python语言进行给定药物网页的信息的获取和输出

利用Python语言进行给定药物网页的信息的获取和输出

作者: FANHONGZENG | 来源:发表于2023-11-23 11:25 被阅读0次

    目标网页链接:https://www.fda.gov/drugs/resources-information-approved-drugs/atezolizumab-urothelial-carcinoma

    在chrom浏览器中,右键选择检查,查看所需数据的HTML标签

    一定要用谷歌浏览器

    法一

    
    from requests_html import HTMLSession 
    
    from lxml import etree 
    
    session = HTMLSession() 
    
    r =session.get('https://www.fda.gov/drugs/resources-information-approved-drugs/atezolizumab-urothelial-carcinoma')
    
    ##查看整个网页html格式文件
    
    print(r.html.html)
    
    html=etree.HTML(r.html.html)
    
    ####提取标题
    
    titles=html.xpath('/html/body/div[2]/div[1]/div/main/article/header/section/div/h1/text()')
    
    #使用xpath全路径
    
    #在谷歌浏览器中,查看所需数据的HTML标签,右键-选择检查-
    
    #找到标题对应的html代码,右键选择copy-Copy Full Xpath,最后再Xpath路径后加text()
    
    

    [图片上传失败...(image-6832cc-1700796324045)]

    
    print(titles)
    
    #['Atezolizumab for Urothelial Carcinoma']
    
    ####提取第一段  
    
    first_paragraph=html.xpath('/html/body/div[2]/div[1]/div/main/article/div/div[1]/text()')  
    
    print(first_paragraph)
    
    #['On May 18, 2016, the U. S. Food and Drug Administration gave accelerated approval to atezolizumab
    
    # injection (Tecentriq, Genentech, Inc.) for the treatment of patients with locally advanced or metastatic 
    
    #urothelial carcinoma who have disease progression during or following platinum-containing chemotherapy
    
    # or have disease progression within 12 months of neoadjuvant or adjuvant treatment with platinum-containing 
    
    #chemotherapy. \xa0\xa0Atezolizumab is a programmed death-ligand 1 (PD-L1) blocking antibody.']
    
    ####提取日期  
    
    data=
    
    html.xpath('/html/body/div[2]/div[1]/div/main/article/aside[1]/section/div/aside/ul/div/li/div/p/time/text()')
    
    print(data)
    
    #['05/19/2016']
    
    

    法二

    
    import requests
    
    import lxml.html
    
    ####提取标题
    
    html = requests.get('https://www.fda.gov/drugs/resources-information-approved-drugs/atezolizumab-urothelial-carcinoma')
    
    doc = lxml.html.fromstring(html.content)
    
    new_releases = doc.xpath('//section[@id="block-entityviewcontent-2"]')[0]
    
    titles = new_releases.xpath('.//h1[@class="content-title text-center"]/text()')
    
    print(titles)
    
    #['Atezolizumab for Urothelial Carcinoma']
    
    ####提取第一段
    
    new_releases2=doc.xpath('//div[@class="col-md-8 col-md-push-2"]')[0]
    
    ##第一段html没有标识符,此处使用xpath全路径
    
    first_paragraph=new_releases2.xpath('/html/body/div[2]/div[1]/div/main/article/div/div[1]/text()')
    
    #['On May 18, 2016, the U. S. Food and Drug Administration gave accelerated approval to 
    
    #atezolizumab injection (Tecentriq, Genentech, Inc.) for the treatment of patients with locally
    
    #advanced or metastatic urothelial carcinoma who have disease progression during or following
    
    #platinum-containing chemotherapy or have disease progression within 12 months of neoadjuvant 
    
    #or adjuvant treatment with platinum-containing chemotherapy. \xa0\xa0Atezolizumab is a programmed
    
    # death-ligand 1 (PD-L1) blocking antibody.']
    
    ####提取日期
    
    new_releases3=doc.xpath('//div[@class="node-current-date"]')[0]
    
    data=new_releases3.xpath('//time["2016-05-19T03:36:00Z"]/text()')
    
    print(data)
    
    #['05/19/2016']
    
    

    参考链接:

    https://timber.io/blog/an-intro-to-web-scraping-with-lxml-and-python/

    https://www.w3school.com.cn/xpath/xpath_syntax.asp

    相关文章

      网友评论

          本文标题:利用Python语言进行给定药物网页的信息的获取和输出

          本文链接:https://www.haomeiwen.com/subject/dccvfrtx.html