美文网首页
python实战计划:爬取手机号

python实战计划:爬取手机号

作者: black_crow | 来源:发表于2016-10-07 17:00 被阅读1059次

    Date:2016-10-7
    By:Black Crow

    前言:

    本次作业为第二周第二节、第三节的作业合并,爬取的是58的手机号。
    因为作业分为两部分:第一部分是爬取页面里的URL,第二部分爬取单个页面的详情。
    第三节的断点续传使用的是find_one(),先检查数据库里是否存在,如过存在跳过,不存在写入。

    作业效果:

    手机urls.png
    手机详情.png

    我的代码:

    20161007代码PART1:爬取列表

    from bs4 import BeautifulSoup
    import requests,time
    from pymongo import MongoClient

    p = 'http://bj.58.com/shoujihao/pn2/'

    client = MongoClient('localhost',27017)
    tongcheng = client['tongcheng']
    mobile_pages = tongcheng['mobile_pages']
    def counter(i=[0]):
    next = i[-1] + 1
    i.append(next)
    return i[-1]
    def get_shouji_urls(page_url):
    wb_data= requests.get(page_url)
    soup =BeautifulSoup(wb_data.text,'lxml')
    phone_numbers = soup.select('a.t > strong')
    phone_urls = soup.select('a.t')
    # print(phone_numbers)
    for phone_number,phone_url in zip(phone_numbers,phone_urls):
    data ={
    'phone_number':phone_number.get_text(),
    'phone_url':phone_url.get('href').split('?')[0],
    }
    if 'jump' in list(data['phone_url'].split('//')[1].split('.')):
    pass
    else:
    #print(data)
    mobile_pages.insert_one(data)
    print(counter())
    def page_get():
    for page_number in range(0,200):
    page = 'http://bj.58.com/shoujihao/pn{}/'.format(str(page_number))
    wb_data = requests.get(page)
    soup = BeautifulSoup(wb_data.text, 'lxml')
    pages_check = soup.select('#infocont > span > b')
    for page_check in pages_check:
    page_check = page_check.get_text()
    # print(page_check)
    if page_check =='0':
    pass
    else:
    get_shouji_urls(page)
    time.sleep(1)
    page_get()

    #####20161007代码PART2:爬取详情
    >```
    from bs4 import BeautifulSoup
    import requests,time
    from pymongo import MongoClient
    client = MongoClient('localhost',27017)
    tongcheng = client['tongcheng']
    mobile_info1 = tongcheng['mobile_info1']
    mobile_pages = tongcheng['mobile_pages']
    # path= 'http://bj.58.com/shoujihao/27614539752242x.shtml'
    def counter(i=[0]):
        next = i[-1] + 1
        i.append(next)
        return i[-1]
    def get_shouji_info(url):
        wb_data= requests.get(url)
        soup =BeautifulSoup(wb_data.text,'lxml')
        titles = soup.select('h1')
        prices = soup.select('span.price')
        ymds = soup.select('li.time')
        # print(times)
        for title,price,ymd in zip(titles,prices,ymds):
            data={
                'title':title.get_text().strip(),
                'price':price.get_text().strip(),
                'ymd':ymd.get_text(),
                'url':url
            }
            if mobile_info1.find_one({'url':data['url']}):#如有相同的URL就提示,否则写入
                # if mobile_info1.find_one({'title':data['title']}):
                print('already exsist')
            else:
                mobile_info1.insert_one(data)
                print(counter())
                time.sleep(1)
            #print(data)
    #get_shouji_info(path)
    for item in mobile_pages.find():
        get_shouji_info(item['phone_url'])
    

    总结:

    1. pool()函数尚未添加进去,速度有点慢;
    1. find_one()的效率如何?尚未测算。
    2. 爬取的结果中有空值,还需要检查问题在哪。

    相关文章

      网友评论

          本文标题:python实战计划:爬取手机号

          本文链接:https://www.haomeiwen.com/subject/mrbyyttx.html