爬取

作者: Rain师兄 | 来源:发表于2020-11-13 18:11 被阅读0次

import requests

from lxmlimport etree

from bs4import BeautifulSoupas bf

# https://www.soxscc.com/SuiTangWoLaoPoShiChangSunWuGou/157152.html

# https://www.soxscc.com/SuiTangWoLaoPoShiChangSunWuGou/157153.html

#                      /SuiTangWoLaoPoShiChangSunWuGou/864881.html

url ='https://www.soxscc.com/SuiTangWoLaoPoShiChangSunWuGou/'

headers = {"User-Agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:61.0) Gecko/20100101 Firefox/61.0"}

resp = requests.get(url,headers=headers)

resp_xpath = etree.HTML(resp.text)

hrefs = resp_xpath.xpath("//div[@id='novel150661']//dd/a/@href")

for iin range(400):

url ='https://www.soxscc.com'+hrefs[i]

headers = {"User-Agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:61.0) Gecko/20100101 Firefox/61.0"}

resp = requests.get(url,headers=headers)

soup = bf(resp.text)

content = soup.find('div',class_='content').get_text()

r_x = etree.HTML(resp.text)

output ="\n{}\n\n{}-----------\n"

    title = r_x.xpath("//div[@class='read_title']/h1/text()")

outputs = output.format(title[0],content)

print(outputs)

with open('biquge.txt','a',encoding='utf-8')as f:

f.write(outputs)

最开始用xpath爬取,爬取小说,纯用xpath爬取碰到一个问题,得到内容的时候,写不进文件。

output ="\n{}\n\n{}-----------\n"

按照这种方式写进文件出的问题是每一行字都跟一个章节名称,最后还是换成了Beautifulsoup解析内容的。

相关文章

网友评论

      本文标题:爬取

      本文链接:https://www.haomeiwen.com/subject/vnvmbktx.html