美文网首页
爬取笔趣看小说

爬取笔趣看小说

作者: 莫辜负自己的一世韶光 | 来源:发表于2018-10-26 15:50 被阅读0次
    • 第一步先分析笔趣看各个章节

    url = https://www.biqukan.com/1_1094/ 一念永恒的网址

    # encoding:utf-8
    __author__ = 'Fioman'
    __date__ = '2018/10/25 16:17'
    from bs4 import BeautifulSoup
    import requests
    
    """"
    类说明:下载<<笔趣看>>网小说:url:https//www.biqukan.com/
    Params: 
            url - <<笔趣看>>网指定的小说目录地址(string)
    """
    
    
    class BiquSpider(object):
        def __init__(self):
            self.base_url = "https://www.biqukan.com/"  # 网站首页
            self.url = "https://www.biqukan.com/1_1094/"  # 小说章节目录的url
            self.chapter_names = []  # 存放各个章节的名字
            self.chapter_urls = []  # 存放各个章节对应的url
            self.numbers = 0  # 存放章节数
            self.headers = {
                "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.77 Safari/537.36",
                'HOST': 'www.biqukan.com',
                "Referer": 'http://www.biqukan.com/',
            }
    
        # 获取各个章节以及各个章节对应的urls
        def get_chapter_urls(self):
            # 解除警告,ssl取消验证的警告解除
            requests.packages.urllib3.disable_warnings()
            res = requests.get(url=self.url, headers=self.headers,verify=False)
            html = res.text
            # 获取soup对象
            soup = BeautifulSoup(html, 'lxml')
            # 获取章节名称和对应的url,先获取整个div
            div = soup.find('div', class_='listmain')
            # 然后找到div下的a标签,并剔除前面的12个是当前更新的无用标签
            chapters = div.find_all('a')[12:]
            self.numbers = len(chapters)
            # 遍历chapters,获取章节的名称和对应的url
            for each in chapters:
                self.chapter_names.append(each.text)
                self.chapter_urls.append(self.base_url + each['href'])
    
            # 遍历章节和章节对应的urls,先去urls获取内容,然后再写入到本地文件中
            for chapter_name,chapter_url in zip(self.chapter_names,self.chapter_urls):
                content = self.getChapterContent(chapter_url)
                # 根据内容和章节的名称,将下载的小说写入到本地文件中
                self.savePage(chapter_name,content)
    
        # 根据章节的url获取内容
        def getChapterContent(self,url):
            res = requests.get(url,headers = self.headers,verify=False)
            html = res.text
            soup = BeautifulSoup(html,'lxml')
            # 获取内容的div
            div = soup.find('div',id='content',class_='showtxt')
            # 将获取的内容返回
            return div.text
    
        # 将获取的内容保存进文件中
        def savePage(self,name,content):
            filename = 'novel/' + name + '.txt'
            with open(filename,'a',encoding='utf-8') as f:
                f.write(content)
    
    
    
    if __name__ == '__main__':
        spider = BiquSpider()
        spider.get_chapter_urls()
    
    
    • 下载的格式不是很好,等待后续的改进

    相关文章

      网友评论

          本文标题:爬取笔趣看小说

          本文链接:https://www.haomeiwen.com/subject/dltltqtx.html