目标网页链接:https://www.fda.gov/drugs/resources-information-approved-drugs/atezolizumab-urothelial-carcinoma
在chrom浏览器中,右键选择检查,查看所需数据的HTML标签
一定要用谷歌浏览器
法一
from requests_html import HTMLSession
from lxml import etree
session = HTMLSession()
r =session.get('https://www.fda.gov/drugs/resources-information-approved-drugs/atezolizumab-urothelial-carcinoma')
##查看整个网页html格式文件
print(r.html.html)
html=etree.HTML(r.html.html)
####提取标题
titles=html.xpath('/html/body/div[2]/div[1]/div/main/article/header/section/div/h1/text()')
#使用xpath全路径
#在谷歌浏览器中,查看所需数据的HTML标签,右键-选择检查-
#找到标题对应的html代码,右键选择copy-Copy Full Xpath,最后再Xpath路径后加text()
[图片上传失败...(image-6832cc-1700796324045)]
print(titles)
#['Atezolizumab for Urothelial Carcinoma']
####提取第一段
first_paragraph=html.xpath('/html/body/div[2]/div[1]/div/main/article/div/div[1]/text()')
print(first_paragraph)
#['On May 18, 2016, the U. S. Food and Drug Administration gave accelerated approval to atezolizumab
# injection (Tecentriq, Genentech, Inc.) for the treatment of patients with locally advanced or metastatic
#urothelial carcinoma who have disease progression during or following platinum-containing chemotherapy
# or have disease progression within 12 months of neoadjuvant or adjuvant treatment with platinum-containing
#chemotherapy. \xa0\xa0Atezolizumab is a programmed death-ligand 1 (PD-L1) blocking antibody.']
####提取日期
data=
html.xpath('/html/body/div[2]/div[1]/div/main/article/aside[1]/section/div/aside/ul/div/li/div/p/time/text()')
print(data)
#['05/19/2016']
法二
import requests
import lxml.html
####提取标题
html = requests.get('https://www.fda.gov/drugs/resources-information-approved-drugs/atezolizumab-urothelial-carcinoma')
doc = lxml.html.fromstring(html.content)
new_releases = doc.xpath('//section[@id="block-entityviewcontent-2"]')[0]
titles = new_releases.xpath('.//h1[@class="content-title text-center"]/text()')
print(titles)
#['Atezolizumab for Urothelial Carcinoma']
####提取第一段
new_releases2=doc.xpath('//div[@class="col-md-8 col-md-push-2"]')[0]
##第一段html没有标识符,此处使用xpath全路径
first_paragraph=new_releases2.xpath('/html/body/div[2]/div[1]/div/main/article/div/div[1]/text()')
#['On May 18, 2016, the U. S. Food and Drug Administration gave accelerated approval to
#atezolizumab injection (Tecentriq, Genentech, Inc.) for the treatment of patients with locally
#advanced or metastatic urothelial carcinoma who have disease progression during or following
#platinum-containing chemotherapy or have disease progression within 12 months of neoadjuvant
#or adjuvant treatment with platinum-containing chemotherapy. \xa0\xa0Atezolizumab is a programmed
# death-ligand 1 (PD-L1) blocking antibody.']
####提取日期
new_releases3=doc.xpath('//div[@class="node-current-date"]')[0]
data=new_releases3.xpath('//time["2016-05-19T03:36:00Z"]/text()')
print(data)
#['05/19/2016']
网友评论