目标网址:https://www.geyanw.com/
首先获取几个大的分类网址:
import requests
from lxml import html
url='https://www.geyanw.com/'
response=requests.get(url).text
selector=html.fromstring(response)
myArr=selector.xpath('//*[@id="p_left"]/div/dl/dt/strong/a/@href')
print(len(myArr))
for detail in myArr:
arrUrl=url+detail
print(arrUrl)
获取到9个分类,单独处理每个分类:
def getDetailUrl(arrUrl):
response=requests.get(arrUrl).text
selector = html.fromstring(response)
page_two = selector.xpath('//*[@id="p_left"]/div/ul[2]/li[3]/a/@href')[0]
print(page_two)
page=1
while 1:
detailUrl=arrUrl+page_two[:-6]+'%s.html'%page
print(detailUrl)
page+=1
循环处理每个网址
response = requests.get(detailUrl).text
selector = html.fromstring(response)
detailList = selector.xpath('//*[@id="p_left"]/div/ul[1]/li/h2/a/@href')
print(len(detailList))
if len(detailList)==0:
break
page+=1
image.png
获取到每个详情页里的名人名言:
for articleUrl in detailList:
print(articleUrl)
response=requests.get(url+articleUrl)
response.encoding = 'gb2312'
selector = html.fromstring(response.text)
P_element = selector.xpath('//*[@id="p_left"]/div[1]/div[4]/p')
print(len(P_element))
for p in P_element:
print(p.text)
image.png
结果如图:
image.png
完整代码:https://github.com/Liangjianghao/everyDay_spider.git mingyan_11-14
网友评论