Python3.6:爬取简书热门文章

作者: james_chang | 来源:发表于2018-01-10 18:01 被阅读6次

Python3.6:爬取简书热门文章
简书7日热门文章数据分析+更新推送（持续更新···）
高逼格词云图——看一看简书七日热门文章提到了什么
爬取简书数据生成api
爬取简书文章
python爬虫爬取简书30天热门文章
新手向爬虫（一）利用工具轻松爬取简书并分析
简书文章url爬取
总结：爬取简书文章
Python爬取简书30日热门

想爬热门文章，就要先找到热门文章在哪里，主页20篇，7日热门20篇，30日热门还有20篇，那就让我们把这60篇文章抓下来吧：
60篇文章，对应着60个网址，我不可能一个一个的去找网址，也不可能一个一个网页的去爬，所以呢，就要先将这60个网址爬下来放在列表里或者字典里，然后用个for循环轮流爬这60个网页。
先把网址爬下来吧：

先看看网页源码

唔！好整齐！

那URL一定在里面了，让我们来看一看吧：

并没有找到（0-0）这是什么原因呢，让我们仔细看一下，仔细找一找，再多试一试

原来把URL藏在了这里，简书首页网址，加上这段字符，就是这篇文章的网址了

而且，每个a标签里的内容都好整齐啊，弄的我懒病又翻了，都懒得写正则了，直接用BeautifulSoup取到a标签里的内容，然后切片一下不就行了？
让我们来试一试吧：

import requests
from bs4 import BeautifulSoup
import pickle


class Grapjianshu(object):

    def __init__(self, url):
        self.url = url
        self.data_dict = {}

    def url_list(self):
        #得到网页源码
        response = requests.get(self.url).text
        #实例化一个beautifulsoup对象
        soup = BeautifulSoup(response, 'html.parser')
        #爬取热门文章链接
        data = soup.find_all(name='a', class_='title')
        self.data_dict = {}
        for i in data:
            s = str(i)
            name = s[56:-4]
            url = 'https://www.jianshu.com'+s[23:38]
            self.data_dict[name] = url
        for a in self.data_dict:
            print(a, self.data_dict[a])

看下结果吧：

看来文章题目和链接都得到了呢，都没有用到正则，全部存在了data_dict字典里
那接下来就让我们把这些文章页内容爬下来并写入文件把（这里放全部代码）：

import requests
from bs4 import BeautifulSoup
import pickle


class Grapjianshu(object):

    def __init__(self, url):
        self.url = url
        self.data_dict = {}

    def url_list(self):
        #得到网页源码
        response = requests.get(self.url).text
        #实例化一个beautifulsoup对象
        soup = BeautifulSoup(response, 'html.parser')
        #爬取热门文章链接
        data = soup.find_all(name='a', class_='title')
        self.data_dict = {}
        for i in data:
            s = str(i)
            name = s[56:-4]
            url = 'https://www.jianshu.com'+s[23:38]
            self.data_dict[name] = url
        for a in self.data_dict:
            print(a, self.data_dict[a])

    def url_content(self):
        for i in self.data_dict:
            f = open(i+'.html', 'wb')
            response = requests.get(self.data_dict[i]).text

            pickle.dump(response, f)
            f.close()


a = Grapjianshu('https://www.jianshu.com')
a.url_list()
a.url_content()
# c = Grapjianshu('https://www.jianshu.com/trending/monthly')
# c.url_list()
# c.url_content()
# b = Grapjianshu('https://www.jianshu.com/trending/weekly')
# b.url_list()
# b.url_content()

运行下试试：