美文网首页
Week3 hw2: Draw a Histogram in j

Week3 hw2: Draw a Histogram in j

作者: 快要没时间了 | 来源:发表于2016-05-30 09:22 被阅读0次

    In the previous study (week2 homework), we have already get all item info in ganji.com. Now we are going to draw a histogram in juypter-notebook with charts module.

    Target

    Import Json File into mongo

    If there is a json file like this:

    [ 
    {
    "title":"Introduction",
    "url":"http://courses.engr.illinois.edu/cs598jhm/sp2013/Slides/Lecture01.pdf",
    "description":""
    }
    ,
    {
    "title":"Conjugate priors",
    "url":"http://courses.engr.illinois.edu/cs598jhm/sp2013/Slides/Lecture02.pdf",
    "description":"T. Griffiths and A. Yuille A primer on probabilistic inference; Chapters 8 and 9 of D. Barber Bayesian Reasoning and Machine Learning. See also this diagram of conjugate prior relationships"
    }
    ]
    

    It can be import to mongo as a collection by 2 steps.

    1. Create a empty collection. (mongo Shell)

    db.creatCollection('newCollect')

    1. Use mongoimport. (in Terminal)

    mongoimport --db datebaseName --collection newCollect --file /home/tmp/course_temp.json --jsonArray

    also, it can be write as:

    mongoimport -d dbName -c collectName path/file.json

    Data Cleaning

    Type "jupyter notebook" in terminal. Start a demon.

    jupyter notebook

    Now, we can use jupyter in Safari (localhost:8888).
    Here is a data info. It's obvious to classify those items by the url value.

    {'title': '【图】很新的海信冰箱 - 西城西单二手家电 - 北京58同城', 'price': 260, 'look': '-', 'area': ['西城', '西单'], 'time': 0, '_id': ObjectId('5698f525a98063dbe6e91ca8'), 'cates': ['北京58同城', '北京二手市场', '北京二手家电', '北京二手冰箱'], 'pub_date': '2016.01.13', 'url': 'http://bj.58.com/jiadian/24652878967613x.shtml'}

    import pymongo
    import charts
    
    client = pymongo.MongoClient('localhost',27017)
    myDB = client['ganjiDB']
    myCollection = myDB['bjGanji']
    
    for i in myCollection.find().limit(200):
        url = i['url']
        cate = url.split('/')[3]
        print(cate)
    
    How to classify

    Since it works well, the following is much easier.
    Can we just use .find() method to select all items whose url contain the key word?
    So I check a cookbook of mongo. Unfortunately, mongo's basic Conditional operator only works for numbers. They are $gte or $lte.
    Now, I have to use set and list to get the number of each category recurring.

    cate_list = []
    for each in myCollection.find():
        url = each['url']
        cate = url.split('/')[3]
        cate_list.append(cate)
    cate_index = (set(cate_list))
    print(cate_index)
    print(len(cate_list),len(cate_index))
    

    the result is here:

    {'yingyou', 'ershoujiaju', 'fushi', 'meirong', 'ershoushebei', 'bangong', 'pingbandiannao', 'tushu', 'tiaozao', 'wenti', 'shouji', 'shuma', 'diannao', 'jiadian', 'bijibendiannao'}
    86850 15

    Draw a Histogram

    import charts

    series = []
    for each in cate_index:
        dat = {
            'name':each,
            'data':[cate_list.count(each)],
            'type':'column'
        }   
        print(dat)
        series.append(dat)
    
    options = {
        'title':{'text':'Post Numbers in each Category'}
    }
    print(options)
    

    Result is here

    {'name': 'yingyou', 'type': 'column', 'data': [7819]}
    {'name': 'ershoujiaju', 'type': 'column', 'data': [4891]}
    {'name': 'fushi', 'type': 'column', 'data': [9990]}
    {'name': 'meirong', 'type': 'column', 'data': [2794]}
    {'name': 'ershoushebei', 'type': 'column', 'data': [1639]}
    {'name': 'bangong', 'type': 'column', 'data': [6461]}
    {'name': 'pingbandiannao', 'type': 'column', 'data': [1525]}
    {'name': 'tushu', 'type': 'column', 'data': [4221]}
    {'name': 'tiaozao', 'type': 'column', 'data': [1143]}
    {'name': 'wenti', 'type': 'column', 'data': [9510]}
    {'name': 'shouji', 'type': 'column', 'data': [2822]}
    {'name': 'shuma', 'type': 'column', 'data': [7666]}
    {'name': 'diannao', 'type': 'column', 'data': [4855]}
    {'name': 'jiadian', 'type': 'column', 'data': [18863]}
    {'name': 'bijibendiannao', 'type': 'column', 'data': [2651]}
    {'title': {'text': 'Post Numbers in each Category'}}

    Series and options are two fixed variables for charts, which looks like JavaScript.

    charts.plot(series=series, show='inline', options=options)
    
    QQ20160530-1.png

    Appendix

    chartsDemo

    相关文章

      网友评论

          本文标题:Week3 hw2: Draw a Histogram in j

          本文链接:https://www.haomeiwen.com/subject/ygmqdttx.html