美文网首页我爱编程
Week2 hw1: MongoDB

Week2 hw1: MongoDB

作者: 快要没时间了 | 来源:发表于2016-05-29 14:56 被阅读0次

    The relationship in mongoDB

    Like the MS Excel, mongoDB can be considered like a ExcelFile.
    Each Database(db) is a separate .xls file, and each Collection is a table.
    Meanwhile, each collection can record many items with there key&value.

    Active mongoDB

    Type ** mongod ** in Terminal, it will run in localhost with port 27017.

    Basic moves in terminal

    Start another terminal tab, and type ** mongo **, you can enter the mongo console which is running in your computer.

    Check the Datebase

    show dbs

    This command can tell you how many db has storage in your disk. And it will also shows how many space they have taken.

    use xx

    xx is a db name. This command will switch your current work path to the db you select.

    show tables

    Print all the tables(collections) under this db.

    Backup a table(collection)

    Here is a example about how to backup the collection "xxCollect" into "bakCollectionName".

    1. Create a empty collection.

    db.creatCollection('bakCollectionName')

    1. Copy your collection into the backup file.

    db.xxCollect.copyTo('bakCollectionNmae')

    Import json file into mongoDB

    If there is a json file like this:

    [ 
    {
    "title":"Introduction",
    "url":"http://courses.engr.illinois.edu/cs598jhm/sp2013/Slides/Lecture01.pdf",
    "description":""
    }
    ,
    {
    "title":"Conjugate priors",
    "url":"http://courses.engr.illinois.edu/cs598jhm/sp2013/Slides/Lecture02.pdf",
    "description":"T. Griffiths and A. Yuille A primer on probabilistic inference; Chapters 8 and 9 of D. Barber Bayesian Reasoning and Machine Learning. See also this diagram of conjugate prior relationships"
    }
    ]
    

    It can be import to mongo as a collection by 2 steps.

    1. Create a empty collection. (mongo Shell)

    db.creatCollection('newCollect')

    1. Use mongoimport. (in Terminal)

    mongoimport --db datebaseName --collection newCollect --file /home/tmp/course_temp.json --jsonArray

    also, it can be write as:

    mongoimport -d dbName -c collectName path/file.json

    Modify a table(collection) with Pymongo

    There is a table named itemList in db named myDatabase.
    All these code below is pymongo model function. It can help us manage mongoDB with python.

    Start a connection

    import pymongo
    
    client = pymongo.MongoClient('localhost', 27017)
    myDB = client['myDatabase']
    myTable = myDB['itemList']
    

    IF the database or collection doesn't exist, it will create one with this code. Like the open function in python.

    Add a record

    All record should be dict before it is add into collection.

    myTable.insert_one(dataDict)

    Delete a record

    myTable.remove({'words':0})

    The argument is also a dict, which means delete the item with a key&value compared.

    Modify a record

    myTable.update(arg1, arg2)
    eg.
    myTable.update({id:1}, {'$set':{name:2}}

    arg1 is a selection, arg2 is the exact operation.

    Check a record

    myTable.find( )



    HomeWork1: Find out all rooms whose price greater than 500

    Target

    First, crawl all rooms' info in the first three pages;
    Second, select those rooms whose price greater than 500

    Coding

    import requests
    from bs4 import BeautifulSoup
    import pymongo
    
    
    def getBriefFromListPage(url):
        response = requests.get(url)
        soup = BeautifulSoup(response.text, 'lxml')
        # print(soup.prettify())
        itemsA = soup.select('#page_list > ul > li')
        itemsB = soup.select('#page_list li i')
        infos = soup.select('#page_list li em.hiddenTxt')
    
        dataList = []
        for itemA, itemB, info in zip(itemsA, itemsB, infos):
            link = itemA.a.get('href')
            # image = itemA.a.img.get('src')  # 图片是异步加载的,无法获取
            title = itemA.a.img.get('title')
            price = int(itemB.string)
            otherInfo = info.get_text().replace(' ', '').replace('\n', '')
            data = {  # 以字典的形式存入数据库中去
                'title': title,
                'price': price,
                'otherInfo': otherInfo,
                'link': link
            }
            dataList.append(data)
        return dataList
    
    
    def putListDataInMongo(ListData, DBname, SHEETname):
        '把字典组成的列表放进数据库的指定位置中 DBname->SHeetname'
        client = pymongo.MongoClient('localhost', 27017)
        myDataBase = client[DBname]
        mysheet = myDataBase[SHEETname]
        for eachData in ListData:
            mysheet.insert_one(eachData)
        print('Already put:', len(ListData), 'datas into DB.')
    

    Here is the utility function. Their usage is below.

    start_url = 'http://bj.xiaozhu.com/search-duanzufang-p{pageNumber}-0/'  # pageNumber=1 的时候是第一页
    
    for index in range(1, 4):
        listPageLink = start_url.format(pageNumber=index)
        listDataDict = getBriefFromListPage(listPageLink)
        print(listPageLink)
        print(listDataDict)
        putListDataInMongo(listDataDict, 'testDB', 'sheetXiaoZhu')
    
    client = pymongo.MongoClient('localhost', 27017)
    dbname = client['testDB']
    sheet = dbname['sheetXiaoZhu']
    for index, item in enumerate(sheet.find({'price': {'$gte': 500}})):
        print(index, item)
    

    Meanwhile, I found that mongoDB can tolerant with those duplicate items. So I try to made a piece of code to remove those duplicities.

    client = pymongo.MongoClient('localhost', 27017)
    dbname = client['testDB']
    sheet = dbname['sheetXiaoZhu']
    allData = sheet.find()
    
    for each in allData:
        lindAddr = each['link']
        check = sheet.find({'link': lindAddr})
        count = 0
        for che in check:
            count+=1
        if count == 2:
            sheet.remove({'link': lindAddr}, False)
    

    Appendix

    MongoDB_Tutorial ( cn_Zh )
    MongoDB_CheatSheet.pdf (en)

    相关文章

      网友评论

        本文标题:Week2 hw1: MongoDB

        本文链接:https://www.haomeiwen.com/subject/cgrqdttx.html