In the previous study (week2 homework), we have already get all item info in ganji.com. Now we are going to draw a histogram in juypter-notebook with charts module.
Target
Import Json File into mongo
If there is a json file like this:
[
{
"title":"Introduction",
"url":"http://courses.engr.illinois.edu/cs598jhm/sp2013/Slides/Lecture01.pdf",
"description":""
}
,
{
"title":"Conjugate priors",
"url":"http://courses.engr.illinois.edu/cs598jhm/sp2013/Slides/Lecture02.pdf",
"description":"T. Griffiths and A. Yuille A primer on probabilistic inference; Chapters 8 and 9 of D. Barber Bayesian Reasoning and Machine Learning. See also this diagram of conjugate prior relationships"
}
]
It can be import to mongo as a collection by 2 steps.
- Create a empty collection. (mongo Shell)
db.creatCollection('newCollect')
- Use mongoimport. (in Terminal)
mongoimport --db datebaseName --collection newCollect --file /home/tmp/course_temp.json --jsonArray
also, it can be write as:
mongoimport -d dbName -c collectName path/file.json
Data Cleaning
Type "jupyter notebook" in terminal. Start a demon.
jupyter notebook
Now, we can use jupyter in Safari (localhost:8888).
Here is a data info. It's obvious to classify those items by the url value.
{'title': '【图】很新的海信冰箱 - 西城西单二手家电 - 北京58同城', 'price': 260, 'look': '-', 'area': ['西城', '西单'], 'time': 0, '_id': ObjectId('5698f525a98063dbe6e91ca8'), 'cates': ['北京58同城', '北京二手市场', '北京二手家电', '北京二手冰箱'], 'pub_date': '2016.01.13', 'url': 'http://bj.58.com/jiadian/24652878967613x.shtml'}
import pymongo
import charts
client = pymongo.MongoClient('localhost',27017)
myDB = client['ganjiDB']
myCollection = myDB['bjGanji']
for i in myCollection.find().limit(200):
url = i['url']
cate = url.split('/')[3]
print(cate)
How to classify
Since it works well, the following is much easier.
Can we just use .find() method to select all items whose url contain the key word?
So I check a cookbook of mongo. Unfortunately, mongo's basic Conditional operator only works for numbers. They are $gte or $lte.
Now, I have to use set and list to get the number of each category recurring.
cate_list = []
for each in myCollection.find():
url = each['url']
cate = url.split('/')[3]
cate_list.append(cate)
cate_index = (set(cate_list))
print(cate_index)
print(len(cate_list),len(cate_index))
the result is here:
{'yingyou', 'ershoujiaju', 'fushi', 'meirong', 'ershoushebei', 'bangong', 'pingbandiannao', 'tushu', 'tiaozao', 'wenti', 'shouji', 'shuma', 'diannao', 'jiadian', 'bijibendiannao'}
86850 15
Draw a Histogram
import charts
series = []
for each in cate_index:
dat = {
'name':each,
'data':[cate_list.count(each)],
'type':'column'
}
print(dat)
series.append(dat)
options = {
'title':{'text':'Post Numbers in each Category'}
}
print(options)
Result is here
{'name': 'yingyou', 'type': 'column', 'data': [7819]}
{'name': 'ershoujiaju', 'type': 'column', 'data': [4891]}
{'name': 'fushi', 'type': 'column', 'data': [9990]}
{'name': 'meirong', 'type': 'column', 'data': [2794]}
{'name': 'ershoushebei', 'type': 'column', 'data': [1639]}
{'name': 'bangong', 'type': 'column', 'data': [6461]}
{'name': 'pingbandiannao', 'type': 'column', 'data': [1525]}
{'name': 'tushu', 'type': 'column', 'data': [4221]}
{'name': 'tiaozao', 'type': 'column', 'data': [1143]}
{'name': 'wenti', 'type': 'column', 'data': [9510]}
{'name': 'shouji', 'type': 'column', 'data': [2822]}
{'name': 'shuma', 'type': 'column', 'data': [7666]}
{'name': 'diannao', 'type': 'column', 'data': [4855]}
{'name': 'jiadian', 'type': 'column', 'data': [18863]}
{'name': 'bijibendiannao', 'type': 'column', 'data': [2651]}
{'title': {'text': 'Post Numbers in each Category'}}
Series and options are two fixed variables for charts, which looks like JavaScript.
charts.plot(series=series, show='inline', options=options)
QQ20160530-1.png
网友评论