美文网首页巫师3和学英语
巫师3游戏资源入库程序第1篇

巫师3游戏资源入库程序第1篇

作者: 小肥爬爬 | 来源:发表于2018-10-08 21:54 被阅读0次

    简介

    年中入坑巫师3, 到现在基本已经抛开其他游戏, 每天就是转悠清问号. 出于对这个游戏和系列小说的喜好, 再加上残留中年男人落后的"玩也要学习"的想法. 打算做个小英语软件, 方便记忆游戏单词, 提高英语阅读能力.

    计划实现的功能有:

    • 爬虫入库: 将游戏里的词条(包括角色,书,怪物.. ) 等资料统统抓取入库. 一开始先抓取书跑顺程序, 到最后再增加其他资料.
    • 做个小程序方便整理和阅读. 具体的界面和实现方式还没想好.
      先完成爬虫吧.

    资料来源

    转了一圈发现这个网站: http://witcher.wikia.com/wiki/Category:Books_in_the_games
    灰常感谢粉丝们的热心整理.

    程序

    crawl_helper.py 是一个爬虫辅助类. 这么简单的需求, 实在不想用框架了:

    #! /usr/bin/python
    # -*- coding: UTF-8 -*-
    """
        使用requests, 加强爬虫方面的应用能力.
    
        使用ip代理池, 参考文章: http://ju.outofmemory.cn/entry/246458
    
        FIXME :
        1. cookie 可以通过数据库设置. 这样在访问一次之后, 就自从带上最新的cookie
    
        作者: 萌萌哒小肥他爹
        简书: https://www.jianshu.com/u/db796a501972
    
    """
    
    import random
    import requests
    import string
    from bs4 import BeautifulSoup
    import logging
    
    
    UA_LIST = [
        'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:56.0) Gecko/20100101 Firefox/56.0',
        'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/57.0.2987.133 Safari/537.36',
    ]
    
    
    def get_header(crawl_for):
        headers = {
            # 'Host': 'www.zhihu.com',
            'User-Agent': UA_LIST[random.randint(0, len(UA_LIST)-1)],
            'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
            'Accept-Language': 'zh-CN,zh;q=0.8,en-US;q=0.5,en;q=0.3',
            'Accept-Encoding': 'gzip, deflate',
            'Referer': 'http://www.baidu.com',
            'Connection': 'keep-alive',
            'Cache-Control': 'max-age=0',
        }
    
        if crawl_for == 'douban':
            headers['Host'] = 'www.douban.com'
            headers['Cookie'] = "bid=%s" % "".join(random.sample(string.ascii_letters + string.digits, 11))
    
        elif crawl_for == 'zhihu_answer':
            headers['Host'] = 'www.zhihu.com'
            headers['Cookie'] = 'd_c0="ACCCRm0uzAuPTn3djjdlWBFiQWJ0oQUIhpU=|1495460939"; _zap=3b7aeef8-23a0-4de9-a16b-5fece66e5498; q_c1=cb07a3b06a6e4efaa2b78015c6c2243f|1507801443000|1493628570000; r_cap_id="MjI5OTUyMTk2MzgyNDYwODg1N2RjNWE0ZTEzN2FlNDI=|1510280548|f6d71498966574559ce3f3a64ee848f9b148ffbe"; cap_id="ODJlZjNmOTg5YmQ0NDM0MWJjMDM1M2M0NjgzYWY0MmU=|1510280548|3bd34d6d0f9672659fbd3845ce08a78ca2fd634f"; z_c0=Mi4xaU1jRkFBQUFBQUFBSUlKR2JTN01DeGNBQUFCaEFsVk5jbHZ5V2dDd2Z2Sk1YZXZoVGNLUFRqcVFTY1ExMFhJNjhn|1510280562|94f746f7f48dab3490583fdc65f18ec4df358782; _xsrf=f6fff6c2cd1f55e60b61b29d098f8342; q_c1=cb07a3b06a6e4efaa2b78015c6c2243f|1510311164000|1493628570000; aliyungf_tc=AQAAAGpRUB+dlAoApSHeb6jmDrXePee1; __utma=155987696.1981921582.1510908314.1510908314.1510911826.2; __utmc=155987696; __utmz=155987696.1510908314.1.1.utmcsr=(direct)|utmccn=(direct)|utmcmd=(none); _xsrf=f6fff6c2cd1f55e60b61b29d098f8342'
    
        elif crawl_for == 'zhihu_question':
            headers['Host'] = 'www.zhihu.com'
            headers['Cookie'] = '_zap=6b9be63d-3724-40c4-9bd2-3e2c6c533472; d_c0="ACCCRm0uzAuPTn3djjdlWBFiQWJ0oQUIhpU=|1495460939"; _zap=3b7aeef8-23a0-4de9-a16b-5fece66e5498; q_c1=cb07a3b06a6e4efaa2b78015c6c2243f|1507801443000|1493628570000; r_cap_id="MjI5OTUyMTk2MzgyNDYwODg1N2RjNWE0ZTEzN2FlNDI=|1510280548|f6d71498966574559ce3f3a64ee848f9b148ffbe"; cap_id="ODJlZjNmOTg5YmQ0NDM0MWJjMDM1M2M0NjgzYWY0MmU=|1510280548|3bd34d6d0f9672659fbd3845ce08a78ca2fd634f"; z_c0=Mi4xaU1jRkFBQUFBQUFBSUlKR2JTN01DeGNBQUFCaEFsVk5jbHZ5V2dDd2Z2Sk1YZXZoVGNLUFRqcVFTY1ExMFhJNjhn|1510280562|94f746f7f48dab3490583fdc65f18ec4df358782; _xsrf=f6fff6c2cd1f55e60b61b29d098f8342; q_c1=cb07a3b06a6e4efaa2b78015c6c2243f|1510311164000|1493628570000; aliyungf_tc=AQAAAGpRUB+dlAoApSHeb6jmDrXePee1; __utma=155987696.1981921582.1510908314.1510911826.1510921296.3; __utmc=155987696; __utmz=155987696.1510908314.1.1.utmcsr=(direct)|utmccn=(direct)|utmcmd=(none); _xsrf=f6fff6c2cd1f55e60b61b29d098f8342'
    
        logging.error(u'---- crawl_for 参数不对, 使用默认的cookie')
    
        return headers
    
    
    def do_get(url, crawl_for="", is_json=False):
    
        headers = get_header(crawl_for)
    
        """
        如果不设置verify=False, 会抛出以下异常:
          File "/usr/lib/python2.7/dist-packages/requests/adapters.py", line 385, in send
    raise SSLError(e)
    requests.exceptions.SSLError: [Errno 1] _ssl.c:510: error:14090086:SSL routines:SSL3_GET_SERVER_CERTIFICATE:certificate verify failed
    
        看作者讨论, 是因为 ubuntu14.04 的openssl版本太低. 这里: https://github.com/PHPMailer/PHPMailer/issues/1022
        """
    
        resp = requests.get(url, headers=headers, verify=False)
        # print resp.apparent_encoding
        # 关于抓取乱码问题, 有篇好文章: http://liguangming.com/python-requests-ge-encoding-from-headers
        # real_encoding = requests.utils.get_encodings_from_content(resp.content)[0]
        #
        # content = resp.content.decode(real_encoding).encode('utf8')
    
        if is_json:
            return resp.json()
    
        soup = BeautifulSoup(resp.content, 'html.parser')
    
        return soup
    
    

    具体实现程序是这里, 已经能够找到书的标题和内容.

    #! /usr/bin/python
    # -*- coding: UTF-8 -*-
    """
        巫师3 books的链接.
        原链接如下: http://witcher.wikia.com/wiki/Category:Books_in_the_games
        在这个链接能更好地看到数据结构, 而程序里的链接是更方便获取分页查找的链接, 结构完全相同.
    
        作者: 萌萌哒小肥他爹
        简书: https://www.jianshu.com/u/db796a501972
    
    """
    from bs4 import BeautifulSoup
    from crawler import crawl_helper
    import time
    
    witcher3_books_url_template = 'http://witcher.wikia.com/index.php?action=ajax&articleId=The+Witcher+3+books&method=axGetArticlesPage&rs=CategoryExhibitionAjax&page=%d'
    test_url = witcher3_books_url_template % 1
    
    g_domain = 'http://witcher.wikia.com'
    
    # print(soup)
    
    # main_books = soup.find_all('div', {'id': 'mw-pages'})[0]
    
    
    def do_it():
    
        soup = crawl_helper.do_get(test_url, '', True)
        main_books = BeautifulSoup(soup['page'], 'html.parser')
        main_books = main_books.find_all('div', {'class': 'category-gallery-item'})
    
        for div in main_books:
            a_tag = div.find_all('a')[0]
            title = a_tag['title']
            book_url = g_domain + a_tag['href']
            print('---- book: %s, url: %s' % (title, book_url))
    
            time.sleep(1.17)
    
            find_book_detail(book_url)
    
    
    def find_book_detail(book_url):
        """
        具体格式可以看这个: http://witcher.wikia.com/wiki/Hieronymus%27_notes
    
        :param book_url:
        :return:
        """
    
        book_html = crawl_helper.do_get(book_url, '', False)
        article_div = book_html.find_all('div', {'class': 'WikiaArticle'})[0]
    
        # wiki 里有时候用dl, 有时候用p , 咳咳...
        content_tag_list = article_div.find_all('dl')
        if content_tag_list is None:
            content_tag_list = article_div.find_all('p')
    
        for dl_tag in content_tag_list:
            print(dl_tag.text)
            # print(dl_tag.encode_contents())
    
    
    if __name__ == '__main__':
        do_it()
    
    
    

    大概先这样, 明天再实现数据库功能. 程序的功能注释都很明白, 不需要多解释了.
    以后每完成一个重要功能就连载一篇, 用这种鼓励自己做完, 也希望有粉丝喜欢.

    相关文章

      网友评论

        本文标题:巫师3游戏资源入库程序第1篇

        本文链接:https://www.haomeiwen.com/subject/hpcsaftx.html