MongoDB分析数据

作者: esskeetit | 来源:发表于2018-03-28 09:27 被阅读0次

MongoDB分析数据
Python 导出读取MongoDB数据到Pandas分析
MongoDB实时同步Impala
浅谈MongoDB数据库
MongoDB作为实时流水查询的存储适用性分析
爬虫之 numpy，matplotlib
mongodb 使用参考
MongoDB 常用基本命令
在 Mac OS 上运行 MongoDB 数据库
mongoDB的docker安装以及语法

1. 简介

利用MongoDB的聚合框架对数据进行探索

2.Twitter 数据集

将数据放在数据库中的最大优点：
很多数据库都有内置的分析工具，这使我们能够探索并了解数据、

MongoDB的内置分析工具是以聚合框架的形式出现的

3.聚合框架示例

which user in our data set has produced the most tweets?
-group tweets by user
-count each user's tweets
-sort into descending order
-select user at top

MongoDB中的聚合查询通过aggregate()发出
aggregations are done with a pipe line

from pymongo import MongoClient
import pprint

client=MongoClient("mongodb://localhost:27017")
db=client.twitter

def most_tweets():
    result=db.tweets.aggregate([
              {"$group":{"_id":"$user.screen_name",
                         "count":{"$sum":1}}},
              {"$sort":{"count":-1}}])

if __name__ =='__main__':
    result = most_tweets()
    pprint.pprint(result)

4.聚合管道

aggregation pipeline
collection --- stage1 --- ··· stageN --- result

由于各个阶段使用的运算符类型各不相同，某一阶段可能改变数据的形态，有时候这个改变会很大

the whole idea of aggregation pipeline is that you use aggregation operators,to construct stages that will in a series of steps,process your data in such a way that it produces the results you need.sometimes a single stage is enough,other times you need several stages.and the individual operators that are used in a given stage are you want to do,you are not whetted to using group in the first stage or sort in the last stage.

5.使用组

#!/usr/bin/env python
def get_db(db_name):
    from pymongo import MongoClient
    client = MongoClient('localhost:27017')
    db = client[db_name]
    return db

def make_pipeline():
    # complete the aggregation pipeline
    pipeline = [{"$group":{"_id":"$source",
                           "count":{"$sum":1}}},
                {"$sort":{"count":-1}}]
    return pipeline

def tweet_sources(db, pipeline):
    return [doc for doc in db.tweets.aggregate(pipeline)]

if __name__ == '__main__':
    db = get_db('twitter')
    pipeline = make_pipeline()
    result = tweet_sources(db, pipeline)
    import pprint
    pprint.pprint(result[0])
    assert result[0] == {u'count': 868, u'_id': u'web'}

6. 聚合运算符 1

$project 投影
利用投影选择你感兴趣的某些字段，抽出这些单值
利用投影，可以针对在单个文件中得到的字段进行一系列不同类型的运算

$match 匹配
对文件进行筛选
比如可以设置一些匹配标准，以进行一定的筛选

$group 分组

$sort 排序

$skip
在聚合阶段的一系列输入文件中，跳过最初的一部分文件

$limit
与$skip相反，如果将传递给管道下一阶段的文件限制为3个文件，那就是保留3个文件，而将其他文件过滤掉

7. 聚合运算符 2

$unwind 展开
在mongodb中，字段值可以是数组，在我们使用展开的数组字段中，它会针对数组的每一个值创建含有数组字段的文件实例

image.png

这样做的意义在于，我们可以在管道的下一个阶段进行分组，以便我们关注每个单值的情况，并在这些值的基础上进行分组

8.Match 运算符

who has the highest followers to friends ratio?

from pymongo import MongoClient
client = MongoClient("mongodb://localhost:28017")
db = client.examples


def highest_ratio():
    result = db.tweets.aggregate([
        {"$match" : {"user.friends_count" : {"$gt" : 0},
                     "user.followers_count" : {"$gt" : 0}}},
        {"$project" : {"ratio" : {"$divide" : ["$user.followers_count",
                                               "$user.friends_count"]},
                       "screen_name":"$user.screen_name"}},
        {"$sort":{"ratio":-1}},
        {"$limit":1}])
    return result

9.project运算符

Project 运算符文档

use $project to :
-- include fields from the original document
-- insert computed fields
-- rename fields
-- create fields that hold subdocuments

10. 使用 match 和 project 运算符

写一个回答以下问题的聚合查询：
对于巴西利亚时区的用户，哪些用户发推次数不低于 100 次，哪些用户的关注者数量最多？
以下提示将帮助你解决这一问题：
你可以在每个推特的用户对象的“time_zone”字段中找到时区。
你可以在“statuses_count”字段中找到每个用户的发推数量。
注意，你需要创建“followers”、“screen_name”和“tweets”字段。
只需修改“make_pipeline”函数，使其创建并返回一个聚合管道，该管道可以传递到 MongoDB 聚合函数中。和这节课中的示例一样，聚合管道应该是一个包含一个或多个字典对象的列表。

#!/usr/bin/env python
"""
def get_db(db_name):
    from pymongo import MongoClient
    client = MongoClient('localhost:27017')
    db = client[db_name]
    return db

def make_pipeline():
    # complete the aggregation pipeline
    pipeline = [
        {"$match":{"user.time_zone":"Brasilia",
                   "user.statuses_count":{"$gte":100}}},
        {"$project":{"followers":"$user.followers_count",
                     "screen_name":"$user.screen_name",
                     "tweets":"$user.statuses_count"}},
        {"$sort":{"followers":-1}},
        {"$limit":1}]
    return pipeline

def aggregate(db, pipeline):
    return [doc for doc in db.tweets.aggregate(pipeline)]

if __name__ == '__main__':
    db = get_db('twitter')
    pipeline = make_pipeline()
    result = aggregate(db, pipeline)
    import pprint
    pprint.pprint(result)

11. unwind运算符

who include the most user mention?

def user_mentions():
    result = db.tweets.aggregate([
            {"$unwind":"$entities.user_mentions"},
            {"$group":{"_id":"$user.screen_name",
                       "count":{"$sum":1}}},
            {"$sort":{"count":-1}},
            {"$limit":1}])
    return result

12.练习：使用unwind运算符

对于这道练习，我们回到城市 infobox 数据集。我们希望你能回答以下问题：印度的哪个地区包括的城市最多？（确保将城市计数存储在叫做“count”的字段中；请参阅脚本末尾的声明。）
对于城市数据，需要注意的一点是：“isPartOf”字段包含一个地区数组，可以在其中查找城市。请参阅下面的讲师注释中的示例文档。
只需修改“make_pipeline”函数，使其创建并返回一个聚合管道，该管道可以传递到 MongoDB 聚合函数中。和这节课中的示例一样，聚合管道应该是一个包含一个或多个字典对象的列表。

#!/usr/bin/env python
def get_db(db_name):
    from pymongo import MongoClient
    client = MongoClient('localhost:27017')
    db = client[db_name]
    return db

def make_pipeline():
    # complete the aggregation pipeline
    pipeline = [
        {"$match":{"country":"India"}},
        {"$unwind":"$isPartOf"},
        {"$group":{"_id":"$isPartOf",
                   "count":{"$sum":1}}},
        {"$sort":{"count":-1}},
        {"$limit":1}
        ]
    return pipeline
    


def aggregate(db, pipeline):
    return [doc for doc in db.cities.aggregate(pipeline)]

if __name__ == '__main__':
    db = get_db('examples')
    pipeline = make_pipeline()
    result = aggregate(db, pipeline)
    print "Printing the first result:"
    import pprint
    pprint.pprint(result[0])

13.累组加运算符

$group operators
$sum
$first
$last
$min
$max
$avg

计算每个使用某一标签的推文被转发的平均次数

def hashtag_retweet_avg():
    result = db.tweets.aggregate([
            {"$unwind" : "$entities.hashtags"},
            {"$group" : {"_id" : "$entities.hashtags.text",
                               "retweet_avg":{"$avg":"$retweet_count"}}},
            {"$sort":{"retweet_avg":-1}}])
    return result

处理数组arrays的两个运算符：
$push
$addToSet

$addToSet将数据与值相加，在相加时，它会把数组视为一个集合，也就是说它不会把同一个值与我们累积的数组进行多次相加

聚合所有按照用户网名进行分组的特殊标签，同时忽略用户在推文中多次使用的标签

def unique_hashtags_by_user():
    result = db.tweets.aggregate([
            {"$unwind" : "$entities.hashtags"},
            {"$group" : {"_id" : "$user.screen_name",
                               "unique_hashtags":{
                                   "$addToSet":"$entities.hashtags.text"}}},
            {"$sort":{"_id":-1}}])
    return result

使用$addToSet能够确保无论某个标签出现了多少次，它都只会把unique_hashtags数组加上1

14.练习：使用推送

$push 与 $addToSet 相似。区别在于 $push 会将所有值（而不是唯一值）整合到数组中。
通过聚合查询查找出每个用户的推特数量。在相同的 $group 阶段，使用 $push 数出每个用户的所有推特文本数量。仅数出推特数量在前五名的用户。
最后的文档应该仅包含以下字段：
"_id"（用户的帐号名），
"count"（用户的推文数量），
"tweet_texts"（用户的推文列表）。
只需修改“make_pipeline”函数，使其创建并返回一个聚合管道，该管道可以传递到 MongoDB 聚合函数中。和这节课中的示例一样，聚合管道应该是一个包含一个或多个字典对象的列表。

def make_pipeline():
    # complete the aggregation pipeline
    pipeline = [
        {"$group":{"_id":"$user.screen_name",
                   "tweet_texts":{"$push":"$text"},
                   "count":{"$sum":1}}},
        {"$sort":{"count":-1}},
        {"$limit":5}]
    return pipeline

15. 使用给定运算符的多个阶段

很多任务需要我们在多个阶段使用同一个运算符
who has mentioned the most unique users?

def unique_user_mentions():
    result = db.tweets.aggregate([
            {"$unwind":"$entities.user_mentions"};
            {"$group" :{
                 "_id" : "$user.screen_name",
                 "mset" : {
                          "$addToSet":"$entities.user_mentions.screen_name"
                 }}},
            {"$unwind":"$mset"},
            {"$group":{"_id":"$id","count":{"$sum":1}}},
            {"$sort":{"count":1}},
            {"$limit":10} ])
    return result

16. same运算符

在上一道练习中，我们查看了城市数据集，并询问印度的哪个地区包含的城市最多。在这道练习中，我们想请你回答另一个关于印度地区的相关问题。印度各个地区的平均人口数量是多少？你需要首先计算每个地区城市的平均人口数量，然后计算地区的平均人口数量。
提示：如果你想使用所有输入文档中的值汇集到一个群组阶段中，可以使用常量作为“_id”字段的值。例如：
{ "$group" : {"_id" : "India Regional City Population Average", ... }
只需修改“make_pipeline”函数，使其创建并返回一个聚合管道，该管道可以传递到 MongoDB 聚合函数中。和这节课中的示例一样，聚合管道应该是一个包含一个或多个字典对象的列表。

def make_pipeline():
    # complete the aggregation pipeline
    pipeline = [
        {"$match":{"country":"India"}},
        {"$unwind":"$isPartOf"},
        {"$group":{"_id":"$isPartOf",
                   "avgp":{"$avg":"$population"}}},
        {"$group":{"_id":"India Regional City Population avg",
                   "avg":{"$avg":"$avgp"}}}]
    return pipeline

17. 索引

在mongodb中，索引就是一系列按顺序排列的关键字
索引会占取磁盘空间，更新也需要一定时间
不应该针对查询集合的每一种方法创建一种索引，而应该针对你最可能用来查询集合的方法创建索引

18.使用索引

db.nodes.ensureIndex({"tg":1})
db.nodes.find({"tg":{"k":"name","v":"Giordanos"}}).pretty()

19. 地理空间索引

mongodb的地理信息检索，能够使得我们查找一个地点附近的其他地点
要在mongodb中创建地理信息检索，需要知道三点：

location:[x,y]
ensureindex('location':...)
$near

习题集

1. 最常见的城市名

请使用聚合查询回答以下问题。
我们的城市集合中最常用的城市名称是什么？
你一开始可能会发现 None 是最常出现的城市名称。实际上表示很多城市根本没有名称字段。很奇怪此集合中会出现此类文档，根据你的具体情况，你可能需要进一步清理数据。
要立即解答此问题，我们应该忽略没有指定名称的城市。提示下，可以思考哪个管道运算符使我们能够简化过滤器输入？我们如何测试某个字段是否存在？
只需修改“make_pipeline”函数，使其创建并返回一个聚合管道，该管道可以传递到 MongoDB 聚合函数中。和这节课中的示例一样，聚合管道应该是一个包含一个或多个字典对象的列表。

def get_db(db_name):
    from pymongo import MongoClient
    client = MongoClient('localhost:27017')
    db = client[db_name]
    return db

def make_pipeline():
    # complete the aggregation pipeline
    pipeline = [ 
        {"$match":{"name":{"$ne":None}}},
        {"$group":{"_id":"$name",
                   "count":{"$sum":1}}},
        {"$sort":{"count":-1}},
        {"$limit":1}
        ]
    return pipeline

def aggregate(db, pipeline):
    return [doc for doc in db.cities.aggregate(pipeline)]


if __name__ == '__main__':
    # The following statements will be used to test your code by the grader.
    # Any modifications to the code past this point will not be reflected by
    # the Test Run.
    db = get_db('examples')
    pipeline = make_pipeline()
    result = aggregate(db, pipeline)
    import pprint
    pprint.pprint(result[0])
    assert len(result) == 1
    assert result[0] == {'_id': 'Shahpur', 'count': 6}

2. 区域城市

#!/usr/bin/env python
def get_db(db_name):
    from pymongo import MongoClient
    client = MongoClient('localhost:27017')
    db = client[db_name]
    return db

def make_pipeline():
    # complete the aggregation pipeline
    pipeline = [
        {"$match":{"country":"India",
                   "lon":{"$gte":75,"$lte":80}}},
        {"$unwind":"$isPartOf"},
        {"$group":{"_id":"$isPartOf",
                   "count":{"$sum":1}}},
        {"$sort":{"count":-1}},
        {"$limit":1}
        ]
    return pipeline

def aggregate(db, pipeline):
    return [doc for doc in db.cities.aggregate(pipeline)]

if __name__ == '__main__':
    # The following statements will be used to test your code by the grader.
    # Any modifications to the code past this point will not be reflected by
    # the Test Run.
    db = get_db('examples')
    pipeline = make_pipeline()
    result = aggregate(db, pipeline)
    import pprint
    pprint.pprint(result[0])
    assert len(result) == 1
    assert result[0]["_id"] == 'Tamil Nadu'
    assert result[0]["count"] == 424

3.平均人口

#!/usr/bin/env python
"""
Use an aggregation query to answer the following question. 

Extrapolating from an earlier exercise in this lesson, find the average
regional city population for all countries in the cities collection. What we
are asking here is that you first calculate the average city population for each
region in a country and then calculate the average of all the regional averages
for a country.
  As a hint, _id fields in group stages need not be single values. They can
also be compound keys (documents composed of multiple fields). You will use the
same aggregation operator in more than one stage in writing this aggregation
query. I encourage you to write it one stage at a time and test after writing
each stage.

Please modify only the 'make_pipeline' function so that it creates and returns
an aggregation  pipeline that can be passed to the MongoDB aggregate function.
As in our examples in this lesson, the aggregation pipeline should be a list of
one or more dictionary objects. Please review the lesson examples if you are
unsure of the syntax.

Your code will be run against a MongoDB instance that we have provided. If you
want to run this code locally on your machine, you have to install MongoDB,
download and insert the dataset. For instructions related to MongoDB setup and
datasets please see Course Materials.

Please note that the dataset you are using here is a different version of the
cities collection provided in the course materials. If you attempt some of the
same queries that we look at in the problem set, your results may be different.
"""

def get_db(db_name):
    from pymongo import MongoClient
    client = MongoClient('localhost:27017')
    db = client[db_name]
    return db

def make_pipeline():
    # complete the aggregation pipeline
    pipeline = [
        {"$unwind":"$isPartOf"},
        {"$group":{"_id":{"country":"$country",
                         "region":"$isPartOf"},
                   "avgPopulation":{"$avg":"$population"}}},
        {"$group":{"_id":"$_id.country",
                   "avgRegionalPopulation":{"$avg":"$avgPopulation"}}}]
    return pipeline

def aggregate(db, pipeline):
    return [doc for doc in db.cities.aggregate(pipeline)]

if __name__ == '__main__':
    # The following statements will be used to test your code by the grader.
    # Any modifications to the code past this point will not be reflected by
    # the Test Run.
    db = get_db('examples')
    pipeline = make_pipeline()
    result = aggregate(db, pipeline)
    import pprint
    if len(result) < 150:
        pprint.pprint(result)
    else:
        pprint.pprint(result[:100])
    key_pop = 0
    for country in result:
        if country["_id"] == 'Lithuania':
            assert country["_id"] == 'Lithuania'
            assert abs(country["avgRegionalPopulation"] - 14750.784447977203) < 1e-10
            key_pop = country["avgRegionalPopulation"]
    assert {'_id': 'Lithuania', 'avgRegionalPopulation': key_pop} in result

MongoDB分析数据
1. 简介利用MongoDB的聚合框架对数据进行探索 2.Twitter 数据集将数据放在数据库中的最大优点：...
Python 导出读取MongoDB数据到Pandas分析
之前在MongoDB中有大量数据要分析，需要导入到Pandas中进行分析，本文就主要分享一下我将MongoDB中数...
MongoDB实时同步Impala
背景由于越来越多的Mysql数据以及Mongodb的数据需要做分析.但是大量的数据分析并不适合在Mysql以及M...
浅谈MongoDB数据库
浅谈MongoDB数据库 Java操作MongoDB数据库简介MongoDB 介绍数据库安装MongoDB使用Mo...
MongoDB作为实时流水查询的存储适用性分析
MongoDB作为实时流水查询的存储适用性分析 MongoDB作为时下最流行的NoSQL数据库，凭借其优秀的性能，...
爬虫之 numpy，matplotlib
数据分析还是很难的我准备了一点京东的商品数据进行简单分析我用的是mongodb数据库，首先我们先将mongod...
mongodb 使用参考
关于mongodb创建索引的一些经验总结（转） MongoDb 数据迁移 MongoDB 更新数据 MongoDB...
MongoDB 常用基本命令
mongoDB 数据库 mongoDB 数据库概念集合文档 _id mongo命令链接MongoDB 数...
在 Mac OS 上运行 MongoDB 数据库
安装 MongoDB 数据库通过 HomeBrew 安装 MongoDB 数据库： MongoDB 配置文件路径...
mongoDB的docker安装以及语法
吐槽和评论数据特点分析： 1）数据量大2）写入操作频繁价值较低对于这样的数据我们更容易使用 mongoDB 来...