爬虫作业03-爬取解密大数据专栏下的所有文章

作者: pnjoe | 来源:发表于2017-07-29 19:53 被阅读175次

爬虫作业03-爬取解密大数据专栏下的所有文章
课程作业-爬虫入门04-构建爬虫-WilliamZeng-201
爬虫作业4
Python实践与学习索引
数据科学实践与学习索引
课程作业-爬虫入门04-2-构建爬虫-WilliamZeng-2
第四次作业
课程作业-爬虫入门03-爬虫基础-WilliamZeng-201
爬虫04作业
Python 爬虫入门课作业4－构建爬虫

课程作业

选择第二次课程作业中选中的网址

爬取该页面中的所有可以爬取的元素，至少要求爬取文章主体内容

可以尝试用lxml爬取

这节课的作业对我来说。难度不小。deadline到了。为了门票。绞尽脑汁，勉强提交这份作业。

曾老师贴心给出了函数代码。完成了95%。几乎带路带到家门口了。代码如下：

# coding: utf-8
"""
版权所有，保留所有权利，非书面许可，不得用于任何商业场景
版权事宜请联系：WilliamZeng2017@outlook.com
"""

import os
import time
import urllib2
import urlparse
from bs4 import BeautifulSoup  # 用于解析网页中文, 安装： pip install beautifulsoup4


def download(url, retry=2):
    """
    下载页面的函数，会下载完整的页面信息
    :param url: 要下载的url
    :param retry: 重试次数
    :return: 原生html
    """
    print "downloading: ", url
    # 设置header信息，模拟浏览器请求
    header = {
        'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/54.0.2840.98 Safari/537.36'
    }
    try: #爬取可能会失败，采用try-except方式来捕获处理
        request = urllib2.Request(url, headers=header) #设置请求数据
        html = urllib2.urlopen(request).read() #抓取url
    except urllib2.URLError as e: #异常处理
        print "download error: ", e.reason
        html = None
        if retry > 0: #未超过重试次数，可以继续爬取
            if hasattr(e, 'code') and 500 <= e.code < 600: #错误码范围，是请求出错才继续重试爬取
                print e.code
                return download(url, retry - 1)
    time.sleep(1) #等待1s，避免对服务器造成压力，也避免被服务器屏蔽爬取
    return html

def crawled_links(url_seed, url_root):
    """
    抓取文章链接
    :param url_seed: 下载的种子页面地址
    :param url_root: 爬取网站的根目录
    :return: 需要爬取的页面
    """
    crawled_url = set()  # 需要爬取的页面
    i = 1
    flag = True #标记是否需要继续爬取
    while flag:
        url = url_seed % i #真正爬取的页面
        i += 1 #下一次需要爬取的页面

        html = download(url) #下载页面
        if html == None: #下载页面为空，表示已爬取到最后
            break

        soup = BeautifulSoup(html, "html.parser") #格式化爬取的页面数据
        links = soup.find_all('a', {'class': 'title'}) #获取标题元素
        if links.__len__() == 0: #爬取的页面中已无有效数据，终止爬取
            flag = False

        for link in links: #获取有效的文章地址
            link = link.get('href')
            if link not in crawled_url:
                realUrl = urlparse.urljoin(url_root, link)
                crawled_url.add(realUrl)  # 记录未重复的需要爬取的页面
            else:
                print 'end'
                flag = False  # 结束抓取

    paper_num = crawled_url.__len__()
    print 'total paper num: ', paper_num
    return crawled_url

def crawled_page(crawled_url):
    """
    爬取文章内容
    :param crawled_url: 需要爬取的页面地址集合
    """
    for link in crawled_url: #按地址逐篇文章爬取
        html = download(link)
        soup = BeautifulSoup(html, "html.parser")
        title = soup.find('h1', {'class': 'title'}).text #获取文章标题
        content = soup.find('div', {'class': 'show-content'}).text #获取文章内容

        if os.path.exists('spider_res/') == False: #检查保存文件的地址
            os.mkdir('spider_res')

        file_name = 'spider_res/' + title + '.txt' #设置要保存的文件名
        if os.path.exists(file_name):
            # os.remove(file_name) # 删除文件
            continue  # 已存在的文件不再写
        file = open('spider_res/' + title + '.txt', 'wb') #写文件
        content = unicode(content).encode('utf-8', errors='ignore')
        file.write(content)
        file.close()

注意：第一行的# coding: utf-8不是单纯的备注。是代码说明。不可以删掉。

结合课堂上PPT的代码。我将源代码分成以下4部分。

第一部分：导入模块包

import os
import time
import urllib2
import urlparse
from bs4 import BeautifulSoup  # 用于解析网页中文, 安装： pip install beautifulsoup4

这部分容易理解

第二部分：定义一个download函数

def download(url, retry=2):
    """
    下载页面的函数，会下载完整的页面信息
    :param url: 要下载的url
    :param retry: 重试次数
    :return: 原生html
    """
    print "downloading: ", url
    # 设置header信息，模拟浏览器请求
    header = {
        'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/54.0.2840.98 Safari/537.36'
    }
    try: #爬取可能会失败，采用try-except方式来捕获处理
        request = urllib2.Request(url, headers=header) #设置请求数据
        html = urllib2.urlopen(request).read() #抓取url
    except urllib2.URLError as e: #异常处理
        print "download error: ", e.reason
        html = None
        if retry > 0: #未超过重试次数，可以继续爬取
            if hasattr(e, 'code') and 500 <= e.code < 600: #错误码范围，是请求出错才继续重试爬取
                print e.code
                return download(url, retry - 1)
    time.sleep(1) #等待1s，避免对服务器造成压力，也避免被服务器屏蔽爬取
    return html

这一大段函数的作用。就是下载网页源代码。
即提供一个网页url地址。返回这个网页的源代码。

我一下子理解不了里面每一句代码的意思。我就先把它看成一个做包子的工厂。
只要给它原材数面团(url)，跟肉馅(retry)。工厂就会给你做出包子（输出网页的源代码html代码）来。

至于里面是什么手法做成的包子。先不管了。里面涉及的知识点太多。
这个工厂的名字我们叫download(这里可以随意取名)。
下单的老板若是忘记告诉工厂放多少肉馅时，工厂就自己默认放2匙肉馅（retry=2），

#那么，我就可以把他简看成这样子。
def download(url, retry=2):
    .....
    return html

#以后当我需要一个网页的源代码时。我只需调用一下这个download函数。那么他就会自动输出源代码给我。

# 比如。我需要百度网页的源代码。可以这么写
url = 'http://www.baidu.com'
baidu = download(url)
print (baidu)

# 运行结果
downloading: [http://www.baidu.com](http://www.baidu.com/)
<!DOCTYPE html><!--STATUS OK-->
<html>
<head>
    <meta http-equiv="content-type" content="text/html;charset=utf-8">
    <meta http-equiv="X-UA-Compatible" content="IE=Edge">
    <meta content="always" name="referrer">
    <meta name="theme-color" content="#2932e1">
    <link rel="shortcut icon" href="/favicon.ico" type="image/x-icon" />
    <link rel="search" type="application/opensearchdescription+xml" href="/content-search.xml" title="百度搜索" /> 
    <link rel="icon" sizes="any" mask href="//www.baidu.com/img/baidu.svg">
    <link rel="dns-prefetch" href="//s1.bdstatic.com"/>
    <link rel="dns-prefetch" href="//t1.baidu.com"/>
    <link rel="dns-prefetch" href="//t2.baidu.com"/>
    <link rel="dns-prefetch" href="//t3.baidu.com"/>
    <link rel="dns-prefetch" href="//t10.baidu.com"/>
    <link rel="dns-prefetch" href="//t11.baidu.com"/>
    <link rel="dns-prefetch" href="//t12.baidu.com"/>
    <link rel="dns-prefetch" href="//b1.bdstatic.com"/>
    
    <title>百度一下，你就知道</title>
#......   源代码太长。就不全部贴上。 知道个大概意思就好。
<script>
if(bds.comm.supportis){
    window.__restart_confirm_timeout=true;
    window.__confirm_timeout=8000;
    window.__disable_is_guide=true;
    window.__disable_swap_to_empty=true;
}
initPreload({
    'isui':true,
    'index_form':"#form",
    'index_kw':"#kw",
    'result_form':"#form",
    'result_kw':"#kw"
});
</script>

<script>
if(navigator.cookieEnabled){
    document.cookie="NOJS=;expires=Sat, 01 Jan 2000 00:00:00 GMT";
}
</script>
</body>
</html>

第三部分：定义一个crawled_links函数

def crawled_links(url_seed, url_root):
    """
    抓取文章链接
    :param url_seed: 下载的种子页面地址
    :param url_root: 爬取网站的根目录
    :return: 需要爬取的页面
    """
    crawled_url = set()  # 需要爬取的页面
    i = 1
    flag = True #标记是否需要继续爬取
    while flag:
        url = url_seed % i #真正爬取的页面
        i += 1 #下一次需要爬取的页面

        html = download(url) #下载页面
        if html == None: #下载页面为空，表示已爬取到最后
            break

        soup = BeautifulSoup(html, "html.parser") #格式化爬取的页面数据
        links = soup.find_all('a', {'class': 'title'}) #获取标题元素
        if links.__len__() == 0: #爬取的页面中已无有效数据，终止爬取
            flag = False

        for link in links: #获取有效的文章地址
            link = link.get('href')
            if link not in crawled_url:
                realUrl = urlparse.urljoin(url_root, link)
                crawled_url.add(realUrl)  # 记录未重复的需要爬取的页面
            else:
                print 'end'
                flag = False  # 结束抓取

    paper_num = crawled_url.__len__()
    print 'total paper num: ', paper_num
    return crawled_url

按第一部分的分析理解方法
输入的是url_seed（种子网址）, url_root（网站首页）
输出的是crawled_url（专栏内所有文章的，单独文章链接的集合目前有278篇）

第四部分：定义一个crawled_page函数

def crawled_page(crawled_url):
    """
    爬取文章内容
    :param crawled_url: 需要爬取的页面地址集合
    """
    for link in crawled_url: #按地址逐篇文章爬取
        html = download(link)
        soup = BeautifulSoup(html, "html.parser")
        title = soup.find('h1', {'class': 'title'}).text #获取文章标题
        
        title = title.replace('|', ' ')      #  这一段是自己添加的。
        title = title.replace('"', ' ')      # 因为有些文章的标题带有特殊字符。 
        title = title.replace(':', ' ')      # 而我们电脑文件的命名规则
        title = title.replace('\x08', ' ')   # 不允许有特殊字符的存在。
        title = title.replace('<', ' ')      # 在这里我就将特殊字符全部替换成空格处理。
        title = title.replace('>', ' ')      # 这是向polo助教讨教到的。
        print (title)                        # 特意加下print输出。运行过程中可以看到标题
        
        content = soup.find('div', {'class': 'show-content'}).text #获取文章内容

        if os.path.exists('spider_res/') == False: #检查保存文件的地址
            os.mkdir('spider_res')

        file_name = 'spider_res/' + title + '.txt' #设置要保存的文件名
        if os.path.exists(file_name):
            # os.remove(file_name) # 删除文件
            continue  # 已存在的文件不再写
        file = open(file_name, 'wb') #写文件。老师有定义了file_name变量了。我就直接用上了
        content = unicode(content).encode('utf-8', errors='ignore')
        file.write(content)
        file.close()

输入crawled_url（所有文章的网页地址）
输出生成以文章标题为文件名的txt文件。存储在spider_res目录下。文件内容就是文章的内容。

  title = title.replace('|', ' ')     #  这一段是自己添加的。
  title = title.replace('"', ' ')     # 因为有些文章的标题带有特殊字符。 
  title = title.replace(':', ' ')     # 而我们电脑文件的命名规则
  title = title.replace('\x08', ' ')  # 不允许有特殊字符的存在。
  title = title.replace('<', ' ')     # 在这里我就将特殊字符全部替换成空格处理。
  title = title.replace('>', ' ')     # 这是向polo助教讨教到的。
  title = title.replace('/', ' ')     # 
  print (title)                       # 特意加下print输出。运行过程中可以看到标题

下面举个例子说明一下自己为什么加上这么一段代码：
商业分析之数据词典| 得到

最后部分：爬虫主要运行代码部分

url_root = 'http://www.jianshu.com'
url_seed = 'http://www.jianshu.com/c/9b4685b6357c?page=%d'

crawled_url = crawled_links(url_seed, url_root)
crawled_page(crawled_url)

最后的最后：贴上自己运行过程及结果

导入文件跟定义函数运行是没有输出显示什么的。主要是后面四行代码才会有输出。

# 代码部分
url_root = 'http://www.jianshu.com'
url_seed = 'http://www.jianshu.com/c/9b4685b6357c?page=%d'

crawled_url = crawled_links(url_seed, url_root)

# 运行结果
downloading:  http://www.jianshu.com/c/9b4685b6357c?page=1
downloading:  http://www.jianshu.com/c/9b4685b6357c?page=2
downloading:  http://www.jianshu.com/c/9b4685b6357c?page=3
downloading:  http://www.jianshu.com/c/9b4685b6357c?page=4
downloading:  http://www.jianshu.com/c/9b4685b6357c?page=5
downloading:  http://www.jianshu.com/c/9b4685b6357c?page=6
downloading:  http://www.jianshu.com/c/9b4685b6357c?page=7
downloading:  http://www.jianshu.com/c/9b4685b6357c?page=8
downloading:  http://www.jianshu.com/c/9b4685b6357c?page=9
downloading:  http://www.jianshu.com/c/9b4685b6357c?page=10
downloading:  http://www.jianshu.com/c/9b4685b6357c?page=11
downloading:  http://www.jianshu.com/c/9b4685b6357c?page=12
downloading:  http://www.jianshu.com/c/9b4685b6357c?page=13
downloading:  http://www.jianshu.com/c/9b4685b6357c?page=14
downloading:  http://www.jianshu.com/c/9b4685b6357c?page=15
downloading:  http://www.jianshu.com/c/9b4685b6357c?page=16
downloading:  http://www.jianshu.com/c/9b4685b6357c?page=17
downloading:  http://www.jianshu.com/c/9b4685b6357c?page=18
downloading:  http://www.jianshu.com/c/9b4685b6357c?page=19
downloading:  http://www.jianshu.com/c/9b4685b6357c?page=20
downloading:  http://www.jianshu.com/c/9b4685b6357c?page=21
downloading:  http://www.jianshu.com/c/9b4685b6357c?page=22
downloading:  http://www.jianshu.com/c/9b4685b6357c?page=23
downloading:  http://www.jianshu.com/c/9b4685b6357c?page=24
downloading:  http://www.jianshu.com/c/9b4685b6357c?page=25
downloading:  http://www.jianshu.com/c/9b4685b6357c?page=26
downloading:  http://www.jianshu.com/c/9b4685b6357c?page=27
downloading:  http://www.jianshu.com/c/9b4685b6357c?page=28
downloading:  http://www.jianshu.com/c/9b4685b6357c?page=29
total paper num:  278

# 代码部分
crawled_page(crawled_url)

# 运行结果
downloading:  http://www.jianshu.com/p/45df7e3ecc78
数据之缘                         # 在第四部分那里 加了一句print (title)，就会输出标题。
downloading:  http://www.jianshu.com/p/99ae5b28a51f
网购小细节你注意到了吗？            # 这样可以看到输出的进度，方便调试。
downloading:  http://www.jianshu.com/p/d6243f087bd9
1.1 利用Python进行数据分析
downloading:  http://www.jianshu.com/p/ea40c6da9fec
我与“解密大数据 ”社群
downloading:  http://www.jianshu.com/p/59e0da43136e
爬虫课程作业2
downloading:  http://www.jianshu.com/p/dc07545c6607
某宝与某东购物流程
downloading:  http://www.jianshu.com/p/d1acbed69f45
爬虫作业01-获取网络数据的原理
downloading:  http://www.jianshu.com/p/02f33063c258
批发市场淘宝与购物中心京东的购物流程对比
downloading:  http://www.jianshu.com/p/ad10d79255f8
爬虫作业2
downloading:  http://www.jianshu.com/p/062b8dfca144
商业数据分析之数据字典设计
downloading:  http://www.jianshu.com/p/cb4f8ab1b380
商业分析之数据词典  得到
downloading:  http://www.jianshu.com/p/8f7102c74a4f
【用Python学统计】数据描述之可视化
downloading:  http://www.jianshu.com/p/77876ef45ab4
2016年终总结 无量化 不成长
downloading:  http://www.jianshu.com/p/e5475131d03f
正态分布作业一
downloading:  http://www.jianshu.com/p/e0bd6bfad10b
Python 爬虫入门课作业2－ 网页基础与结构分析
downloading:  http://www.jianshu.com/p/a425acdaf77e
红葡萄酒质量探索分析
downloading:  http://www.jianshu.com/p/729edfc613aa
离数据科学家还有多远-硅谷数据科学家成长之路
downloading:  http://www.jianshu.com/p/e50c863bb465
关于数据分析的学习方法
downloading:  http://www.jianshu.com/p/7107b67c47bc
直方图作业中我遇到的那些 坑 
downloading:  http://www.jianshu.com/p/020f0281f1df
直方图的绘制
downloading:  http://www.jianshu.com/p/1292d7a3805e
解密大数据课程作业-正态分布的应用
downloading:  http://www.jianshu.com/p/7cb84cfa56fa
解密大数据专栏文章分类
downloading:  http://www.jianshu.com/p/41c14ef3e59a
个人数据的Who's&How   数据中间商 读书笔记
downloading:  http://www.jianshu.com/p/1a2a07611fd8
机器学习实战之决策树（三）
downloading:  http://www.jianshu.com/p/217a4578f9ab
认知限制和预测局限  黑天鹅读书笔记
downloading:  http://www.jianshu.com/p/d234a015fa90
迟到的第四次作业- 推论统计
downloading:  http://www.jianshu.com/p/e08d1a03045f
大脑使用手册 ： 学习如何学习（下篇）
downloading:  http://www.jianshu.com/p/6f4a7a1ef85c
Tiger 大数据二十二问
downloading:  http://www.jianshu.com/p/faf2f4107b9b
课程作业-爬虫入门01-获取网络数据的原理-WilliamZeng-20170629
downloading:  http://www.jianshu.com/p/9dee9886b140
对比某宝和某东的购物流程异同
downloading:  http://www.jianshu.com/p/e2ee86a8a32b
突然不运行的—jupyter notebook
downloading:  http://www.jianshu.com/p/9258b0495021
2017-03-21
downloading:  http://www.jianshu.com/p/7e2fccb4fad9
爬虫课程作业01-解密大数据社群
downloading:  http://www.jianshu.com/p/74042ba10c0d
译 大数据科普系列-数据预处理
downloading:  http://www.jianshu.com/p/d882831868fb
【泰阁志-数据分析】作业2：直方图
downloading:  http://www.jianshu.com/p/d5bc50d8e0a2
Python数据分析的起手式（2）Python 列表 list 
downloading:  http://www.jianshu.com/p/2e64c2045be5
正态分布作业二
downloading:  http://www.jianshu.com/p/565500cfb5a4
[泰阁志-数据分析课]作业3：正态分布
downloading:  http://www.jianshu.com/p/1729787990e7
Python数据分析的起手式（4）Numpy入门
downloading:  http://www.jianshu.com/p/8ca518b3b2d5
《黑天鹅》读后感想
downloading:  http://www.jianshu.com/p/9c7fbcac3461
商业数据分析——股票数据分析
downloading:  http://www.jianshu.com/p/13d76e7741c0
给你更多和给你好的
downloading:  http://www.jianshu.com/p/81d17436f29e
数据分析和统计——直方图
downloading:  http://www.jianshu.com/p/148b7cc83bcd
小白的商业数据分析
downloading:  http://www.jianshu.com/p/70b7505884e9
爬虫第一课作业
downloading:  http://www.jianshu.com/p/ba4100af215a
爬虫作业3
downloading:  http://www.jianshu.com/p/819a202adecd
商业数据分析之打车软件反作弊案例
downloading:  http://www.jianshu.com/p/a4beefd8cfc2
大作业 - 某东和某宝的购物流程对比
downloading:  http://www.jianshu.com/p/eb01f9002091
第四次作业
downloading:  http://www.jianshu.com/p/ba43beaa186a
与加速变化的世界共舞
downloading:  http://www.jianshu.com/p/d44cc7e9a0a9
大作业-某宝和某东的购物比较
downloading:  http://www.jianshu.com/p/d0de8ee83ea1
您的好友小聋瞎已上线
downloading:  http://www.jianshu.com/p/b4670cb9e998
商业数据分析之三——反作弊案例分析
downloading:  http://www.jianshu.com/p/9f9fb337be0c
中篇-泰坦尼克号 
downloading:  http://www.jianshu.com/p/542f41879879
译 在Stack Overflow做数据科学家的一年
downloading:  http://www.jianshu.com/p/e9f6b15318be
Numpy数据存取与常用函数
downloading:  http://www.jianshu.com/p/f1ef93a6c033
【泰阁志-数据分析】作业6：商业数据分析02
downloading:  http://www.jianshu.com/p/872a67eed7af
分析后学推论总计、假设检验
downloading:  http://www.jianshu.com/p/f0063d735a5c
某宝和某东购物流程浅析
downloading:  http://www.jianshu.com/p/856c8d648e20
大数据学习第五次作业
downloading:  http://www.jianshu.com/p/b9407b2c22a4
【泰阁志-数据分析】作业3：正态分布
downloading:  http://www.jianshu.com/p/a36e997b8e59
python学统计第二课 复现与作业
downloading:  http://www.jianshu.com/p/c28207b3c71d
走一步进一步— 数据之路
downloading:  http://www.jianshu.com/p/8448ac374dc1
【泰阁志-数据分析】作业7：商业数据分析03
downloading:  http://www.jianshu.com/p/4a3fbcb06981
来，尝尝我这两菜一汤
downloading:  http://www.jianshu.com/p/b1a9daef3423
第六次作业——业务指标字典的设计
downloading:  http://www.jianshu.com/p/5eb037498c48
《数据中间商》笔记1--进门地毯的故事
downloading:  http://www.jianshu.com/p/f756bf0beb26
python入门挣扎指南 - 安装及直方图
downloading:  http://www.jianshu.com/p/673b768c6084
爬虫第一次作业
downloading:  http://www.jianshu.com/p/6233788a8abb
数据中间商读书笔记
downloading:  http://www.jianshu.com/p/087ce1951647
直方图
downloading:  http://www.jianshu.com/p/7240db1ba0af
大数据学习分享
downloading:  http://www.jianshu.com/p/289e51eb6446
我的翻译步骤和工具  写在第一次翻译实践之后
downloading:  http://www.jianshu.com/p/0565cd673282
如何通过数据指标判断刷单司机 - 商业数据分析03 作业
downloading:  http://www.jianshu.com/p/873613065502
眼前的黑不是真正的黑
downloading:  http://www.jianshu.com/p/605644d688ff
课程作业-爬虫入门01
downloading:  http://www.jianshu.com/p/1ea730c97aae
爬虫入门01-获取网络数据的原理作业
downloading:  http://www.jianshu.com/p/bab0c09416ee
第三次作业——正态分布
downloading:  http://www.jianshu.com/p/c6591991d1ca
商业数据分析之四——投资人对投资企业的选择
downloading:  http://www.jianshu.com/p/fd9536a0acfb
利用python进行数据分析之数据加载、存储与文件格式
downloading:  http://www.jianshu.com/p/f89c4032a0b2
商业数据分析第二次课作业-0719
downloading:  http://www.jianshu.com/p/1fa23219270d
Tiger 人人都能用数据-统计学和直方图
downloading:  http://www.jianshu.com/p/412f8eab2599
菜市场和超市购物----某宝和某东购物体验对比
downloading:  http://www.jianshu.com/p/05c15b9f16f1
三分钟读懂什么是置信区间
downloading:  http://www.jianshu.com/p/4931d66276c3
黑天鹅与数据分析
downloading:  http://www.jianshu.com/p/b5165468a32b
做好数据分析的秘诀在于讲好一个故事
downloading:  http://www.jianshu.com/p/2c02a7b0b382
用python绘制直方图
downloading:  http://www.jianshu.com/p/dffdaf11bd4c
数据分析的流程 -- 数据探索之开篇
downloading:  http://www.jianshu.com/p/71c02ef761ac
统计学学习笔记
downloading:  http://www.jianshu.com/p/6920d5e48b31
商业数据分析作业之二
downloading:  http://www.jianshu.com/p/71b968bd8abb
大数据社群作业-商业数据分析
downloading:  http://www.jianshu.com/p/5a6c4b8e7700
数据中间商读书笔记2—— 数据时代正确的生活姿势
downloading:  http://www.jianshu.com/p/c1163e39a42e
商业数据分析大作业1
downloading:  http://www.jianshu.com/p/bd9a27c4e2a8
电商购物流程指标
downloading:  http://www.jianshu.com/p/88d0addf64fa
Python 爬虫入门课作业1－ 获取网络数据的原理
downloading:  http://www.jianshu.com/p/6a7afc98c868
机器学习之综述与案例分析笔记
downloading:  http://www.jianshu.com/p/9ee12067f35e
作业-天气数据简单分析
downloading:  http://www.jianshu.com/p/c41624a83b71
爬虫入门03作业
downloading:  http://www.jianshu.com/p/67ae9d87cf3c
迟来的第一讲作业
downloading:  http://www.jianshu.com/p/b5c292e093a2
第七次作业--防作弊分析
downloading:  http://www.jianshu.com/p/0a6977eb686d
时间与情感都是钱 - 数据分析第一课作业
downloading:  http://www.jianshu.com/p/8088d1bede8d
2.26数据分析课程作业-逻辑思考题
downloading:  http://www.jianshu.com/p/d578d5e2755f
概率统计第一次作业
downloading:  http://www.jianshu.com/p/c9e1dffad756
正态分布作业及我的收获
downloading:  http://www.jianshu.com/p/81819f27a7d8
机器学习驱动的北美商业决策-听课笔记
downloading:  http://www.jianshu.com/p/799c51fbe5f1
解密大数据课程作业-直方图
downloading:  http://www.jianshu.com/p/5e4a86f8025c
爬虫作业02-html页面分析
downloading:  http://www.jianshu.com/p/7acf291b2a5e
第五次作业习题1与2——商业数据分析
downloading:  http://www.jianshu.com/p/6ef6b9a56b50
大作业-某宝和某东比较
downloading:  http://www.jianshu.com/p/210aacd31ef7
浅学正态分布(进阶历程)
downloading:  http://www.jianshu.com/p/9a9280de68f8
淘宝与京东WEB购物流程思考
downloading:  http://www.jianshu.com/p/39eb230e6f15
Python数据分析的起手式（1）Python 基础
downloading:  http://www.jianshu.com/p/c0c0a3ed35d4
黑天鹅之我们的思考
downloading:  http://www.jianshu.com/p/74db357c7252
04作业推论统计-你有遇到坑吗？
downloading:  http://www.jianshu.com/p/3a95a09cda40
大数据作业6 - 商业分析之数据词典 微信
downloading:  http://www.jianshu.com/p/bc75ab89fac0
大数据第一次作业 20170303
downloading:  http://www.jianshu.com/p/460a8eed5cfa
黑天鹅笔记：重识熵信息论与数据科学
downloading:  http://www.jianshu.com/p/8ca88a90ea17
#发生在别人身上是故事，发生在自己身边是事故——人人都需重视黑天鹅事件
downloading:  http://www.jianshu.com/p/a8037a38e219
Lending Club贷款数据分析（上）
downloading:  http://www.jianshu.com/p/3dfedf60de62
机器学习入门课听课后思考
downloading:  http://www.jianshu.com/p/ada67bd7c56f
利用python进行数据分析之数据规整化(三)
downloading:  http://www.jianshu.com/p/486afcd4c36c
商业数据分析第一次课学习笔记
downloading:  http://www.jianshu.com/p/8a0479f55b21
数据探索之统计分布
downloading:  http://www.jianshu.com/p/e492d3acfe38
【泰阁志-数据分析】作业5：商业数据分析01
downloading:  http://www.jianshu.com/p/b4e2e5e31154
一次脑洞大开的阅读体验
downloading:  http://www.jianshu.com/p/75fc36aec98e
读书笔记——黑天鹅：如何应对不可预知的未来
downloading:  http://www.jianshu.com/p/a015b756a803
“黑天鹅”的启发
downloading:  http://www.jianshu.com/p/29062bca16aa
天气分析 - 解密大数据作业001
downloading:  http://www.jianshu.com/p/910662d6e881
大数据课程第一节作业--购物网站流程分析
downloading:  http://www.jianshu.com/p/8fbe3a7b4764
硅谷数据科学家成长之路
downloading:  http://www.jianshu.com/p/0329f87c9ae4
数据分析第三次课随堂作业 - 0725
downloading:  http://www.jianshu.com/p/e1b28de0a1e4
所有的涓涓细流，终汇成江河大海！
downloading:  http://www.jianshu.com/p/b5c31a2eeb8b
某宝与某东购物体验
downloading:  http://www.jianshu.com/p/7e556f17021a
数据探索之参数估计
downloading:  http://www.jianshu.com/p/23144099e9f8
统计学作业03
downloading:  http://www.jianshu.com/p/a91c54f96ded
数据分析第二次课随堂作业_0718
downloading:  http://www.jianshu.com/p/74ef104a9f45
第七次作业——反作弊分析
downloading:  http://www.jianshu.com/p/afa17bc391b7
同一购物网站，为什么你比别人更容易买到假货——你不知道的《数据中间商》
downloading:  http://www.jianshu.com/p/90914aef3636
爬虫第二次作业-0706
downloading:  http://www.jianshu.com/p/0c0e3ace0da1
爬虫作业1
downloading:  http://www.jianshu.com/p/b7eef4033a09
爬虫作业一
downloading:  http://www.jianshu.com/p/7b2e81589a4f
爬虫作业
downloading:  http://www.jianshu.com/p/2f7d10b2e508
X宝与X东的一些比较
downloading:  http://www.jianshu.com/p/ed499f4ecdd1
HTML基础知识
downloading:  http://www.jianshu.com/p/11c103c03d4a
“逛X宝”vs“上X东”，到底有什么区别？
downloading:  http://www.jianshu.com/p/333dacb0e1b2
数据为个人赋能
downloading:  http://www.jianshu.com/p/7c54cd046d4b
黑色的翅膀扇出一记耳光
downloading:  http://www.jianshu.com/p/cfaf85b24281
爬虫课02
downloading:  http://www.jianshu.com/p/356a579062aa
利用Python处理Excel数据  总结
downloading:  http://www.jianshu.com/p/46e82e4fe324
数字时代的新通识
downloading:  http://www.jianshu.com/p/ba00a9852a02
作业-美国票选之小白数据分析
downloading:  http://www.jianshu.com/p/b6359185fc26
第四次作业-推论统计
downloading:  http://www.jianshu.com/p/a1a2dabb4bc2
爬虫入门01作业
downloading:  http://www.jianshu.com/p/4077cbc4dd37
【泰阁志-数据分析】作业1：关于某宝购物流程的思考
downloading:  http://www.jianshu.com/p/90efe88727fe
认识黑天鹅，成为黑马
downloading:  http://www.jianshu.com/p/17f99100525a
泰坦尼克号-数据分析
downloading:  http://www.jianshu.com/p/01385e2dd129
第六次作业——设计指标字典
downloading:  http://www.jianshu.com/p/ec3c57d6a4c7
第二次数据分析作业----做出一组数据的直方图
downloading:  http://www.jianshu.com/p/3a5975d6ac55
爬虫03作业。（没有成功）
downloading:  http://www.jianshu.com/p/85da47fddad7
大作业-在某宝与某东的购物流程对比
downloading:  http://www.jianshu.com/p/3b47b36cc8e8
3.26大作业习题1
downloading:  http://www.jianshu.com/p/29e304a61d32
商业数据分析第一次课作业-0708
downloading:  http://www.jianshu.com/p/649167e0e2f4
第四次推论统计作业
downloading:  http://www.jianshu.com/p/13840057782d
网购平台浅析
downloading:  http://www.jianshu.com/p/11b3dbb05c39
Pandas to_sql将DataFrame保存的数据库中
downloading:  http://www.jianshu.com/p/9632ba906ca2
利用python进行数据分析之pandas入门(一)
downloading:  http://www.jianshu.com/p/41b1ee54d766
商业数据分析01 3.26
downloading:  http://www.jianshu.com/p/0ee1f0bfc8cb
购物对比
downloading:  http://www.jianshu.com/p/09b19b8f8886
Juputer 利用python的pandas数据分析人群收入模型1
downloading:  http://www.jianshu.com/p/3c71839bc660
大脑使用手册 ： 学习如何学习 —— Coursera最火课程之一（上篇）
downloading:  http://www.jianshu.com/p/f0436668cb72
爬虫课程作业1
downloading:  http://www.jianshu.com/p/c0f3d36d0c7a
pandas 常见操作第一课
downloading:  http://www.jianshu.com/p/be0192aa6486
债务违约预测之一：数据探索
downloading:  http://www.jianshu.com/p/ee43c55123f8
用 python 绘制正态分布曲线
downloading:  http://www.jianshu.com/p/af4765b703f0
商业数据分析课程5习题1
downloading:  http://www.jianshu.com/p/ff772050bd96
商业数据分析&作业1
downloading:  http://www.jianshu.com/p/e121b1a420ad
爬虫课程作业02-解密大数据社群
downloading:  http://www.jianshu.com/p/ed93f7f344d0
爬虫课01
downloading:  http://www.jianshu.com/p/8f6ee3b1efeb
爬虫第三次作业-0706
downloading:  http://www.jianshu.com/p/3f06c9f69142
Python数据分析的起手式（3）函数、方法和包
downloading:  http://www.jianshu.com/p/ff2d4eadebde
关于某宝和某东的购物体验——大数据2月26日课程作业
downloading:  http://www.jianshu.com/p/ce0e0773c6ec
用Python浅析股票数据
downloading:  http://www.jianshu.com/p/be384fd73bdb
大数据课程第一节作业-购物网站流程分析
downloading:  http://www.jianshu.com/p/acc47733334f
爬虫入门L2   网页结构&元素标签位置
downloading:  http://www.jianshu.com/p/bf5984fb299a
爬虫第一课作业
downloading:  http://www.jianshu.com/p/1a935c2dc911
商业数据分析1 作业 - 习题1
downloading:  http://www.jianshu.com/p/8982ad63eb85
发掘数据中的信息 -- 数据探索之描述性统计
downloading:  http://www.jianshu.com/p/99fd951a0b8b
数据探索之假设检验
downloading:  http://www.jianshu.com/p/98cc73755a22
爬虫入门02作业
downloading:  http://www.jianshu.com/p/bb736600b483
【读书笔记】数据中间商
downloading:  http://www.jianshu.com/p/f75128ec3ea3
大数据作业7 反作弊案例分析
downloading:  http://www.jianshu.com/p/23a905cf936b
第二次作业——直方图
downloading:  http://www.jianshu.com/p/169403f7e40c
致Python初学者们 - Anaconda入门使用指南
downloading:  http://www.jianshu.com/p/a9c7970bc949
数据分析课程作业之一
downloading:  http://www.jianshu.com/p/ed9ec88e71e4
Python 爬虫入门课作业3－爬虫基础
downloading:  http://www.jianshu.com/p/5057ab6f9ad5
测试感知现象
downloading:  http://www.jianshu.com/p/1b42a12dac14
自我量化的数字生活
downloading:  http://www.jianshu.com/p/5dc5dfe26148
大数据第二次作业 20170309
downloading:  http://www.jianshu.com/p/c88a4453dd6d
《谁说菜鸟不会数据分析》入门篇图表练习
downloading:  http://www.jianshu.com/p/cd971afcb207
Growth Hacking精要-跟Airbnb、Pinterest和Uber学习增长秘笈
downloading:  http://www.jianshu.com/p/2ccd37ae73e2
爬虫入门02（html笔记）
downloading:  http://www.jianshu.com/p/926013888e3e
上海市房价数据分析报告
downloading:  http://www.jianshu.com/p/888a580b2384
机器学习实战之准备（一）
downloading:  http://www.jianshu.com/p/e72c8ef71e49
Tiger：我眼中的大数据(2)-转行和实战经验
downloading:  http://www.jianshu.com/p/bb4a81624af1
【泰阁志-数据分析】作业4：区间估计
downloading:  http://www.jianshu.com/p/4b944b22fe83
数据的核心还是人 - 商业数据分析01 作业2，3
downloading:  http://www.jianshu.com/p/fa7dd359d7a8
第一次作业，分析weatherdata.csv
downloading:  http://www.jianshu.com/p/bfd9b3954038
商标代理业务的数据指标设计 - 商业数据分析1 作业
downloading:  http://www.jianshu.com/p/2364064e0bc9
用python制作正态分布图
downloading:  http://www.jianshu.com/p/56967004f8c4
直方图绘制+简单文件处理
downloading:  http://www.jianshu.com/p/394856545ab0
Python问题汇总 for 统计方法论
downloading:  http://www.jianshu.com/p/aed64f7e647b
作业-数据描述之统计量
downloading:  http://www.jianshu.com/p/a32f27199846
网购流程的小差异背后体现了什么？
downloading:  http://www.jianshu.com/p/4b4e0c343d3e
Lending Club贷款数据分析（下）
downloading:  http://www.jianshu.com/p/8f6b5a1bb3fa
数据中·见商
downloading:  http://www.jianshu.com/p/f7354d1c5abf
对链家网租房进行分析
downloading:  http://www.jianshu.com/p/1fe31cbddc78
爬虫入门01作业 phsyke
downloading:  http://www.jianshu.com/p/f7dc92913f33
第五次作业习题3——商业数据分析
downloading:  http://www.jianshu.com/p/296ae7538d1f
第三次作业-正态分布分析
downloading:  http://www.jianshu.com/p/d43125a4ff44
硅谷数据科学家成长之路-笔记
downloading:  http://www.jianshu.com/p/0b0b7c33be57
数据分析入门常见问题汇总
downloading:  http://www.jianshu.com/p/979b4c5c1857
推论统计作业 
downloading:  http://www.jianshu.com/p/4b57424173a0
Tiger：我眼中的大数据-新生大学分享(1)
downloading:  http://www.jianshu.com/p/e0ae002925bd
说说统计图表 - 黑客
downloading:  http://www.jianshu.com/p/5250518f5cc5
商业数据分析第三次课作业-0725
downloading:  http://www.jianshu.com/p/7b946e6d6861
小白进阶历程-直方图学习
downloading:  http://www.jianshu.com/p/62e127dbb73c
2017- 我的敏捷学习之年
downloading:  http://www.jianshu.com/p/430b5bea974d
顶级风投的宿命
downloading:  http://www.jianshu.com/p/e5d13e351320
人人都会用数据（一）——直方图&平均数
downloading:  http://www.jianshu.com/p/5d8a3205e28e
     用python 制作直方图
downloading:  http://www.jianshu.com/p/1099c3a74336
作业：商业数据分析之X地气象分析
downloading:  http://www.jianshu.com/p/761a73b7eea2
大数据作业5 - 习题2&3 电商流程剖析& 股票分析
downloading:  http://www.jianshu.com/p/83cc892eb24a
第三课-正态分布
downloading:  http://www.jianshu.com/p/b223e54fe5ee
商业分析之数据指标
downloading:  http://www.jianshu.com/p/366c2594f24b
USDA食品数据库分析
downloading:  http://www.jianshu.com/p/cc3b5d76c587
解密大数据0305作业 直方图
downloading:  http://www.jianshu.com/p/6dbadc78d231
机器学习笔记1-监督学习
downloading:  http://www.jianshu.com/p/733475b6900d
作业4-推论统计
downloading:  http://www.jianshu.com/p/e71e5d7223bb
你所不知道的网购搜索
downloading:  http://www.jianshu.com/p/f26085aadd47
机器学习实战之K-近邻算法（二）
downloading:  http://www.jianshu.com/p/df7b35249975
python 直方图
downloading:  http://www.jianshu.com/p/68423bfc4c4e
 数据中间商 正在改变的世界
downloading:  http://www.jianshu.com/p/601d3a488a58
协作翻译复盘 Buddha篇
downloading:  http://www.jianshu.com/p/1d6fc1a9406b
【泰阁志-数据分析】作业5：商业数据分析01
downloading:  http://www.jianshu.com/p/76238014a03f
商业数据分析02作业
downloading:  http://www.jianshu.com/p/9e7cfcc85a57
爬虫入门01（笔记）
downloading:  http://www.jianshu.com/p/4a8749704ebf
爬虫入门02作业
downloading:  http://www.jianshu.com/p/d2dc5aa9bf8f
商业数据分析课程6作业7
downloading:  http://www.jianshu.com/p/4dda2425314a
浅谈黑天鹅
downloading:  http://www.jianshu.com/p/8baa664ea613
初试商业数据分析之股票分析
downloading:  http://www.jianshu.com/p/cbfab5db7f6f
简书的数据指标体系及分析 - 商业数据分析02 作业
downloading:  http://www.jianshu.com/p/bd78a49c9d23
 第二次作业正态分布图
downloading:  http://www.jianshu.com/p/cf2edecdba77
利用python数据分析之数据聚合与分组（七）
downloading:  http://www.jianshu.com/p/3b3bca4281aa
第五次作业--股票数据分析
downloading:  http://www.jianshu.com/p/f382741c2736
第二次作业
downloading:  http://www.jianshu.com/p/4ffca0a43476
Python数据分析入门 - 正态分布
downloading:  http://www.jianshu.com/p/e04bcac99c8d
精益创业
downloading:  http://www.jianshu.com/p/37e927476dfe
解密大数据0226大作业
downloading:  http://www.jianshu.com/p/4981df2eefe7
爬虫入门01作业
downloading:  http://www.jianshu.com/p/86117613b7a6
左手程序员，右手作家：你必须会的Jupyter Notebook
downloading:  http://www.jianshu.com/p/233ff48d668e
课程作业-商业数据分析技术篇01-Python热身-DrFish-20170708
downloading:  http://www.jianshu.com/p/13a68ac7afdd
爬虫作业03-爬取解密大数据专栏下的所有文章
downloading:  http://www.jianshu.com/p/aa1121232dfd
爬虫课02作业
downloading:  http://www.jianshu.com/p/e99dacbf5c44
为什么释迦摩尼提倡过午不食？
downloading:  http://www.jianshu.com/p/40cc7d239513
为电影公司创建可视化图表-tableau
downloading:  http://www.jianshu.com/p/5a8b8ce0a395
课程笔记-商业数据分析技术篇01-Python热身-DrFish-20170708
downloading:  http://www.jianshu.com/p/14967ec6e954
机器学习入门课听课后思考
downloading:  http://www.jianshu.com/p/8266f0c736f9
【读书笔记】如何应对不确定性事件 -《黑天鹅》的终极奥秘
downloading:  http://www.jianshu.com/p/b3e8e9cb0141
作业-01获取网络数据的原理
downloading:  http://www.jianshu.com/p/ae5f78b40f17
数据分析入门毕业项目：电商购物平台母亲节礼品特征分析
downloading:  http://www.jianshu.com/p/10b429fd9c4d
课程作业-爬虫入门02-网页基础与结构分析-WilliamZeng-20170706
downloading:  http://www.jianshu.com/p/2c557a1bfa04
2.26大作业——逻辑思考题
downloading:  http://www.jianshu.com/p/9457100d8763
如何同时在 Anaconda 同时配置 python 2和3
downloading:  http://www.jianshu.com/p/62c0a5122fa8
利用python进行数据分析之数据规整化(一)
downloading:  http://www.jianshu.com/p/59ca82a11f87
3.5作业--直方图制作及分析
downloading:  http://www.jianshu.com/p/27a78b2016e0
Python学习利器——我的小白 Anaconda安装之路
downloading:  http://www.jianshu.com/p/0c007dbbf728
爬虫入门L1   数据url
downloading:  http://www.jianshu.com/p/f6420cce3040
利用python进行数据分析之数据规整化(二)

因专题的文章数过大。278篇
第二部分定义download函数时。老师在里面设置了time.sleep(1)延时1秒。

那么278篇。至少就得花上278秒。近5分钟了。我等不了那么久。就偷偷改成了time.sleep(0.2) 延时200毫秒。可以正常爬取到，又不会给简书服务器封IP。节约了时间。1分钟左右就爬取完了。

查看一下电脑spider_res目录下的结果

解密大数据专栏下 273篇文章

我这里也奇怪为什么先前显示278篇。实际下载到的只有273篇。

随机点了几篇文章看了一下。
发现如果文章作者是采用markdown写作的。那么,txt档里文章段落会很分明。如：

如果没有用markdown编写的。则没有段落。阅读起来很费力。

相关阅读:
初学python-函数
 初学python-条件语句

爬虫作业03-爬取解密大数据专栏下的所有文章
课程作业选择第二次课程作业中选中的网址爬取该页面中的所有可以爬取的元素，至少要求爬取文章主体内容可以尝试用lxml...
课程作业-爬虫入门04-构建爬虫-WilliamZeng-201
课堂作业爬取解密大数据专题所有文章列表，并输出到文件中保存每篇文章需要爬取的数据：作者，标题，文章地址，摘要，...
爬虫作业4
一、课程作业二、爬虫代码三、爬取结果四、存在问题一、课程作业：爬取大数据专题所有文章列表，并输出到文件中保存每篇...
Python实践与学习索引
爬虫小专栏—爬取广州二手房信息小专栏—爬虫模块化小专栏—广度优先爬虫小专栏—爬取某个用户的所有微博包简书—pandas
数据科学实践与学习索引
Python 包 pandas 爬虫小专栏—爬取广州二手房信息小专栏—爬虫模块化小专栏—广度优先爬虫小专栏—爬取...
课程作业-爬虫入门04-2-构建爬虫-WilliamZeng-2
课堂作业爬取解密大数据专题所有文章的内容，并保存到数据库中。数据库我选择了MySQL本地搭建的数据库。要求保存...
第四次作业
第四次作业作业：爬取大数据专题所有文章列表，并输出到文本中保存。每篇文章需要爬取的数据：作者、标题、文章地址...
课程作业-爬虫入门03-爬虫基础-WilliamZeng-201
课堂作业 8月9日根据爬虫入门04课曾老师的讲解做了一些补充，代码和其执行修改成先爬取解密大数据专题下的文章链接，...
爬虫04作业
本次作业爬取大数据专题所有文章列表，并输出到文本中保存。每篇文章需要爬取的数据：作者、标题、文章地址、摘要、缩...
Python 爬虫入门课作业4－构建爬虫
课程作业爬取大数据专题所有文章列表，并输出到文件中保存每篇文章需要爬取的数据：作者，标题，文章地址，摘要，缩...