美文网首页
python爬取豆瓣书评并词云展示

python爬取豆瓣书评并词云展示

作者: money666 | 来源:发表于2020-02-02 21:07 被阅读0次

    前言:

    准备工具:python3.7、vscode、chrome

    安装urllib、beautifulsoup、jieba、wordcloud(pip install 库)

    一、分析豆瓣页面

    首先我们先观察豆瓣的搜索页面

    搜索页面 搜索页面url

    我们可以看到左侧的导航栏,结合url我们会发现cat后面的值和q后面的书名电影名影响着搜索的变化,可以找出如下规律:

    读书 1001

    电影1002

    音乐1003

    我们查看网页的源代码(F12)可以发现我们所需要的内容全部都在a标签之下,我们利用豆瓣优秀的排序算法可以直接获取搜索排序的第一名作为我们的待爬取内容,我们也只需要其中的sid号,其余的事情就交给待爬取页面的爬虫去做了。

    搜索页面源代码

    下面给出源代码:

    import ssl
    import string
    import urllib
    import urllib.request
    import urllib.parse
    
    from bs4 import BeautifulSoup
    
    
    def create_url(keyword: str, kind: str) -> str:
        '''
        Create url through keywords
        Args:
            keyword: the keyword you want to search
            kind: a string indicating the kind of search result
                type: 读书; num: 1001
                type: 电影; num: 1002
                type: 音乐; num: 1003
        Returns: url
        '''
        num = ''
        if kind == '读书':
            num = 1001
        elif kind == '电影':
            num = 1002
        elif kind == '音乐':
            num = 1003
        url = 'https://www.douban.com/search?cat=' + \
            str(num) + '&q=' + keyword
        return url
    
    
    def get_html(url: str) -> str:
        '''send a request'''
    
        headers = {
            # 'Cookie': 你的cookie,
            'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36 OPR/26.0.1656.60',
            'Connection': 'keep-alive'
        }
        ssl._create_default_https_context = ssl._create_unverified_context
    
        s = urllib.parse.quote(url, safe=string.printable)  # safe表示可以忽略的部分
        req = urllib.request.Request(url=s, headers=headers)
        req = urllib.request.urlopen(req)
        content = req.read().decode('utf-8')
        return content
    
    
    def get_content(keyword: str, kind: str) -> str:
        '''
        Create url through keywords
        Args:
            keyword: the keyword you want to search
            kind: a string indicating the kind of search result
                type: 读书; num: 1001
                type: 电影; num: 1002
                type: 音乐; num: 1003
        Returns: url
        '''
        url = create_url(keyword=keyword, kind=kind)
        html = get_html(url)
        # print(html)
        soup_content = BeautifulSoup(html, 'html.parser')
        contents = soup_content.find_all('h3', limit=1)
        result = str(contents[0])
        return result
    
    
    def find_sid(raw_str: str) -> str:
        '''
        find sid in raw_str
        Args:
            raw_str: a html info string contains sid
        Returns:
            sid
        '''
        assert type(raw_str) == str, \
            '''the type of raw_str must be str'''
        start_index = raw_str.find('sid:')
        sid = raw_str[start_index + 5: start_index + 13]
        sid.strip(',')
        return sid
    
    
    if __name__ == "__main__":
        raw_str = get_content('看见', '读书')
        print(find_sid(raw_str))
    

    这样我们就有了具有唯一标实的图书(电影)的sid

    其次我们先观察待爬取页面并查看网页源代码(F12)

    待爬取页面 待爬取页面源代码

    通过观察我们不难发现我们所需的评论都在<span class="short"> 标签下,想要爬取的作者、时间、推荐星级也分别藏在其他几个子标签下,代码如下:

    comments = soupComment.findAll('span', 'short')
    
    time = soupComment.select( '.comment-item > div > h3 > .comment-info > span:nth-of-type(2)')
    
    name = soupComment.select('.comment-item > div > h3 > .comment-info > a')
    

    第一页评论url:https://book.douban.com/subject/20427187/comments/hot?p=1

    第二页评论url:https://book.douban.com/subject/20427187/comments/hot?p=2

    ...

    第n页评论url:https://book.douban.com/subject/20427187/comments/hot?p=n

    通过翻取评论,url的规律这样就找到了,只需要改变p后面的一个变量就可以

    二、豆瓣评论数据抓取

    我们需要为爬虫伪装一个头部信息防止网站的反爬虫

    headers = {
    
            # 'Cookie': 你的cookie,
    
            'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36 OPR/26.0.1656.60',
    
            'Referer': 'https: // movie.douban.com / subject / 20427187 / comments?status = P',
    
            'Connection': 'keep-alive'
    
        }
    
    

    关于cookie你可以先在网页登陆你的豆瓣账号然后F12->network->all->heders中寻找

    image

    爬虫代码如下:

    import urllib.request
    
    import urllib.parse
    
    from bs4 import BeautifulSoup
    
    import time
    
    import jieba
    
    import wordcloud
    
    import crawler_tools
    
    def creat_url(num):
    
        urls = []
    
        for page in range(1, 20):
    
            url = 'https://book.douban.com/subject/' + \
    
                str(num)+'/comments/hot?p='+str(page)+''
    
            urls.append(url)
    
        print(urls)
    
        return urls
    
    def get_html(urls):
    
        headers = {
    
            # 'Cookie': 你的cookie,
    
            'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36 OPR/26.0.1656.60',
    
            'Connection': 'keep-alive'
    
        }
    
        for url in urls:
    
            print('正在爬取:'+url)
    
            req = urllib.request.Request(url=url, headers=headers)
    
            req = urllib.request.urlopen(req)
    
            content = req.read().decode('utf-8')
    
            time.sleep(10)
    
        return content
    
    def get_comment(num):
    
        a = creat_url(num)
    
        html = get_html(a)
    
        soupComment = BeautifulSoup(html, 'html.parser')
    
        comments = soupComment.findAll('span', 'short')
    
        onePageComments = []
    
        for comment in comments:
    
            # print(comment.getText()+'\n')
    
            onePageComments.append(comment.getText()+'\n')
    
        print(onePageComments)
    
        f = open('数据.txt', 'a', encoding='utf-8')
    
        for sentence in onePageComments:
    
            f.write(sentence)
    
        f.close()
    
    raw_str = crawler_tools.get_content('看见', '读书')
    
    sid = crawler_tools.find_sid(raw_str)
    
    print('sid:'+sid)
    
    get_comment(sid)
    

    三、数据清洗、特征提取及词云显示

    首先利用jieba库分词,并使用其其库里内置的TFIDF算法对分词进行权重运算

    然后利用wordcloud库生成词云,具体设置参数如下:

    font_path='FZQiTi-S14S.TTF', # 设置字体

    max_words=66, # 设置最大显示字数

    max_font_size=600, # 设置字体最大值

    random_state=666, # 设置随机生成状态

    width=1400, height=900, # 设置图像大小

    background_color='black', # 设置背景大小

    stopwords=(type(list)) # 设置停用辞典

    我们把做了数据处理的词云和普通词云做个对比:

    为处理 处理过

    数据处理代码如下:

    import jieba
    
    import jieba.analyse
    
    import wordcloud
    
    f = open('/Users/money666/Desktop/The_new_crawler/看见.txt',
    
            'r', encoding='utf-8')
    
    contents = f.read()
    
    f.close()
    
    stopWords_dic = open(
    
        '/Users/money666/Desktop/stopwords.txt', 'r', encoding='gb18030')    # 从文件中读入停用词
    
    stopWords_content = stopWords_dic.read()
    
    stopWords_list = stopWords_content.splitlines()    # 转为list备用
    
    stopWords_dic.close()
    
    keywords = jieba.analyse.extract_tags(
    
        contents, topK=75, withWeight=False,)
    
    print(keywords)
    
    w = wordcloud.WordCloud(background_color="black",
    
                            font_path='/Users/money666/Desktop/字体/粗黑.TTF',
    
                            width=1400, height=900, stopwords=stopWords_list)
    
    txt = ' '.join(keywords)
    
    w.generate(txt)
    
    w.to_file("/Users/money666/Desktop/The_new_crawler/看见.png")
    

    四、问题及解决办法

    1、pip timeout

    一、创建或修改pip.conf配置文件:

    $ sudo vi ~/.pip/pip.config

    timeout =500 #设置pip超时时间

    二、使用国内镜像

    使用镜像来替代原来的官网,方法如下:

    1. pip install redis -i https://pypi.douban.com/simple

    -i:指定镜像地址

    2. 创建或修改pip.conf配置文件指定镜像地址:

    [global]

    timeout =6000

    index-url = http://pypi.douban.com/simple/

    [install]

    use-mirrors =true

    mirrors = http://pypi.douban.com/simple/

    trusted-host = pypi.douban.com

    补充:可以在多个路径下找到pip.conf,没有则创建,另外,还可以通过环境变量Linux*:/etc/pip.conf *

    *~/.pip/pip.conf *

    *~/.config/pip/pip.conf *

    Windows: %APPDATA%\pip\pip.ini

    • %HOME%\pip\pip.ini *

    C:\Documents and Settings\All Users\Application Data\PyPA\pip\pip.conf (Windows XP)

    • C:\ProgramData\PyPA\pip\pip.conf (Windows 7及以后)*

    Mac OSX*: ~/Library/Application Support/pip/pip.conf *

    *~/.pip/pip.conf *

    */Library/Application Support/pip/pip.conf *

    相关文章

      网友评论

          本文标题:python爬取豆瓣书评并词云展示

          本文链接:https://www.haomeiwen.com/subject/synqxhtx.html