美文网首页
近期的计划-出一系列爬虫的文章3

近期的计划-出一系列爬虫的文章3

作者: nonoBoy | 来源:发表于2017-04-12 18:10 被阅读256次

    这篇文章主要分享一下留学论坛-一亩三分地数据抓取以及本地搜索引擎的构造;
    1、目标:留学论坛一亩三分地“院校介绍”板块信息抓取,并写入数据库:

    import requests
    from lxml import etree
    from mongodb_queue import MongoQueue
    import time
    
    spider_queue = MongoQueue('1point3acres', 'schoolInfo')
    
    def method1(url):
        response = requests.get(url)
        selector = etree.HTML(response.text)
    
        all_title = selector.xpath("//a[@class='s xst']")
        all_href = selector.xpath("//a[@class='s xst']/@href")
    
        for i in range(len(all_title)):
            title = all_title[i].xpath('string(.)')
            href = all_href[i]
            print(title + ', ' + href)
            #写入数据库
            spider_queue.push(href, title)
    
    for i in range(1,32):
        url = "http://www.1point3acres.com/bbs/forum-71-" + str(i) + ".html"
        method1(url)
        time.sleep(2)
    

    运行以上爬虫,打印出如下数据:

    # 主题:对应主题链接
    北美留学生求职宝典 & 资源超级入口, http://www.1point3acres.com/bbs/thread-226233-1-1.html
    直播:一波三折留学路,他是如何进入FACEBOOK的?, http://www.1point3acres.com/bbs/thread-273145-1-1.html
    地里新站beta上线:一亩三分地instant - 问题和建议汇报帖, http://www.1point3acres.com/bbs/thread-149010-1-1.html
    想在美国找CS软件工作?Warald提供需要学习的书籍和课程名单!, http://www.1point3acres.com/bbs/thread-50411-1-1.html
    想加入一亩三分地的开发团队,给小伙伴们带来惊喜吗?, http://www.1point3acres.com/bbs/thread-144800-1-1.html
    美国大学院系专业信息精选206篇, http://www.1point3acres.com/bbs/thread-145548-1-1.html
    介绍学校院系program的十个问题模板, http://www.1point3acres.com/bbs/thread-35347-1-1.html
    UCSD CSE风景超好学费超便宜排名又好你们不来吗!!!(半年生活学习实习分享, http://www.1point3acres.com/bbs/thread-259453-1-1.html
    UCLA EE MS介绍(2016入学,PWE->SS), http://www.1point3acres.com/bbs/thread-270916-1-1.html
    USC新项目data informatics介绍, http://www.1point3acres.com/bbs/thread-222674-1-1.html
    UPENN CGGT介绍, http://www.1point3acres.com/bbs/thread-214794-1-1.html
    Lehigh ISE Department, http://www.1point3acres.com/bbs/thread-271493-1-1.html
    快毕业了,谈谈UMich化工, http://www.1point3acres.com/bbs/thread-158888-1-1.html
    UCLA EE(SS Track) 给想转码的但纠结于选校和专业的同学一些建议, http://www.1point3acres.com/bbs/thread-229797-1-1.html
    Brown CS 2013 Fall 情况, http://www.1point3acres.com/bbs/thread-72777-1-1.html
    CMU-SV SE信息帖, http://www.1point3acres.com/bbs/thread-125194-1-1.html
    bu cs master不给CPT 请大家慎重择校, http://www.1point3acres.com/bbs/thread-180266-1-1.html
    来安利一下UCI,找工作为目的的同学看过来, http://www.1point3acres.com/bbs/thread-172914-1-1.html
    UVA EECS 17 Fall 新生群, http://www.1point3acres.com/bbs/thread-269841-1-1.html
    UVA 2017 FALL 新生群 欢迎加入, http://www.1point3acres.com/bbs/thread-259511-1-1.html
    CMU-MSP 介绍, http://www.1point3acres.com/bbs/thread-200340-1-1.html
    ......
    

    2、本地搜索引擎的构造主要在于mongodb_queue中添加一个方法:

    def popFindings(self, str, count):
                record = self.db.find_and_modify(query = {'status': self.OUTSTANDING, '主题': {'$regex': str, '$options':'i'}}, update={'$set':{
                    'status':self.PROCESSING, 'timestamp':datetime.now()}
                })
                if record:
                    print(count, end=', ')
                    print(record['主题'] + ": \n" +record['_id'] , end= '\n')
    
                    return record['_id']
    

    3、本地搜索引擎的实现(文件名:searchEngine.py):

    from mongodb_queue import MongoQueue
    spider_queue = MongoQueue('1point3acres', 'schoolInfo')
    
    def popResult(str):
        count = 0
        url = spider_queue.popFindings(str, count)
    
        while url != None:
                try:
                    count = count + 1
                    url = spider_queue.popFindings(str, count)
    
                    # print(url)
                except KeyError:
                    print('队列咩有数据')
                    break
                else:
                    # print('Doing...')
                    spider_queue.complete(url)
    
    if __name__ == '__main__':
        str = '哥大'
        popResult(str)
    

    通过修改str字符串能够搜索不同的主题帖以及对应链接,比如搜索‘哥大’,得到如下返回结果:
    0, 深爱哥大,Columbia Statistics MA 一学期感触:
    http://www.1point3acres.com/bbs/thread-115164-1-1.html
    1, 哥大生统16fall新生入学超详细介绍~~~:
    http://www.1point3acres.com/bbs/thread-200879-1-1.html
    2, 哥大(Columbia University)在校生所了解到的哥大的CS【楠哥原创】:
    http://www.1point3acres.com/bbs/thread-21691-1-1.html
    3, 哥大16Fall CS:
    http://www.1point3acres.com/bbs/thread-202227-1-1.html
    4, 最近看好多同学报了哥大DS的录取,15 Fall的童鞋来讲讲亲身体验:
    http://www.1point3acres.com/bbs/thread-178456-1-1.html
    5, 哥大cs其实没几个人去花街(顺便部分去年就业数据):
    http://www.1point3acres.com/bbs/thread-125134-1-1.html
    6, 哥大16Fall,IEOR-OR新生:
    http://www.1point3acres.com/bbs/thread-200017-1-1.html
    7, 哥大cs新生 感受:
    http://www.1point3acres.com/bbs/thread-107494-1-1.html
    8, 哥大CS 15fall录取简单介绍:
    http://www.1point3acres.com/bbs/thread-141455-1-1.html
    9, 硕士选ucsd business analytics 还是 哥大applied analytics ???:
    http://www.1point3acres.com/bbs/thread-190680-1-1.html
    10, 求哥大的统计新生2016微信群~求找组织!!:
    http://www.1point3acres.com/bbs/thread-170747-1-1.html
    11, 哥大CS系选课实际情况介绍:
    http://www.1point3acres.com/bbs/thread-126931-1-1.html
    12, 哥大的留念:
    http://www.1point3acres.com/bbs/thread-563-1-1.html
    13, 芝加哥大学统计2013 Fall:
    http://www.1point3acres.com/bbs/thread-62631-1-1.html
    14, 哥大ME开学一周介绍:
    http://www.1point3acres.com/bbs/thread-104188-1-1.html
    15, 哥大CS&CE2013Fall就业不完全统计:
    http://www.1point3acres.com/bbs/thread-116677-1-1.html
    16, 哥大QMSS 13 fall录取情况etc.:
    http://www.1point3acres.com/bbs/thread-71060-1-1.html
    17, 哥大2013Fall CS MS意识流小记:
    http://www.1point3acres.com/bbs/thread-71166-1-1.html
    18, 关于哥大统计的一点个人分析以及转载资讯:
    http://www.1point3acres.com/bbs/thread-25020-1-1.html
    19, 哥大ME介绍:
    http://www.1point3acres.com/bbs/thread-72344-1-1.html
    20, 哥大admitted student Open House @ Manhattan:
    http://www.1point3acres.com/bbs/thread-11733-1-1.html

    相关文章

      网友评论

          本文标题:近期的计划-出一系列爬虫的文章3

          本文链接:https://www.haomeiwen.com/subject/eynmattx.html