美文网首页
用Python爬取淘宝模特(淘女郎)信息

用Python爬取淘宝模特(淘女郎)信息

作者: 贼噶人 | 来源:发表于2017-12-15 17:50 被阅读0次

准备工作

1、这里默认用户配置了Python的开发环境,pip页安装成功
2、用pip安装BeautifulSoup

开始把你的想法变用Python来实现

Python、SQLite3 创建数据库

import sqlite3 as sql
connect = sql.connect('tnl.db')
connect.execute('CREATE TABLE IF NOT EXISTS info(_id INTEGER PRIMARY KEY,name TEXT NOT NULL,via_url TEXT NOT NULL UNIQUE,zone_url TEXT NOT NULL UNIQUE);')
for i in range(10, 1000000):
    show_name_by_page(i, connect)
connect.close()

上面代码主要逻辑是创建一个保存美女信息的数据库和数据表

让我们来实现show_name_by_page函数吧

我们用request
https://mm.taobao.com/json/request_top_list.htm?page=1获取信息用BeautifulSoup解析出美女的昵称、头像地址、空间地址

def show_name_by_page(page, connect: object):
    response = net.get('https://mm.taobao.com/json/request_top_list.htm?page=%d' % page)
    soup = BeautifulSoup(response.text, 'html.parser')
    personal_info_list = soup.find_all(name='div', attrs={'class': 'personal-info'})
    for personal_info in personal_info_list:
        lady_avatar = personal_info.find(name='div', attrs={'class': 'pic s60'})
        lady_name = personal_info.find(name='a', attrs={'class': 'lady-name'})
        print('姓名:%s' % lady_name.text)
        print('头像地址:http:%s' % lady_avatar.find('img').get('src'))
        print('空间地址:http:%s' % lady_name.get('href'))
        try:
            connect.execute(('INSERT INTO info (name,via_url,zone_url) VALUES ("%s","http:%s","http:%s")'
                         % (lady_name.text, lady_avatar.find('img').get('src'), lady_name.get('href'))))
            connect.commit()
        except sql.OperationalError as e:
            print(e)
        except sql.IntegrityError as e:
            print(e)

运行结果

姓名:KingKing
头像地址:http://gtd.alicdn.com/sns_logo/i2/T1kCZiFlVXXXb1upjX.jpg_60x60.jpg
空间地址:http://mm.taobao.com/self/model_card.htm?user_id=14306938
姓名:冷玩妹
头像地址:http://gtd.alicdn.com/sns_logo/i3/T1YIKwFqhgXXb1upjX.jpg_60x60.jpg
空间地址:http://mm.taobao.com/self/model_card.htm?user_id=23104539
姓名:张瑞
头像地址:http://gtd.alicdn.com/sns_logo/i2/T1HmqKFrNdXXb1upjX.jpg_60x60.jpg
空间地址:http://mm.taobao.com/self/model_card.htm?user_id=913423950
姓名:程汐儿
头像地址:http://gtd.alicdn.com/sns_logo/i4/TB1SRe5GXXXXXcfXXXXSutbFXXX.jpg_60x60.jpg
空间地址:http://mm.taobao.com/self/model_card.htm?user_id=299135017
姓名:佟小七
头像地址:http://gtd.alicdn.com/sns_logo/i6/TB1SSNcHpXXXXcAXpXXSutbFXXX.jpg_60x60.jpg
空间地址:http://mm.taobao.com/self/model_card.htm?user_id=654742757
姓名:陈孝霙
头像地址:http://gtd.alicdn.com/sns_logo/i6/T1RiWUXzRhXXb1upjX.jpg_60x60.jpg
空间地址:http://mm.taobao.com/self/model_card.htm?user_id=280219592
姓名:魏媛
头像地址:http://gtd.alicdn.com/sns_logo/i6/T1NNwiFopXXXb1upjX.jpg_60x60.jpg
空间地址:http://mm.taobao.com/self/model_card.htm?user_id=511955884
姓名:潘琪琪
头像地址:http://gtd.alicdn.com/sns_logo/i4/T1d8GgFydhXXb1upjX.jpg_60x60.jpg
空间地址:http://mm.taobao.com/self/model_card.htm?user_id=736634638

完整代码

from bs4 import BeautifulSoup
import requests as net
import sqlite3 as sql


def show_name_by_page(page, connect: object):
    response = net.get('https://mm.taobao.com/json/request_top_list.htm?page=%d' % page)
    soup = BeautifulSoup(response.text, 'html.parser')
    personal_info_list = soup.find_all(name='div', attrs={'class': 'personal-info'})
    for personal_info in personal_info_list:
        lady_avatar = personal_info.find(name='div', attrs={'class': 'pic s60'})
        lady_name = personal_info.find(name='a', attrs={'class': 'lady-name'})
        print('姓名:%s' % lady_name.text)
        print('头像地址:http:%s' % lady_avatar.find('img').get('src'))
        print('空间地址:http:%s' % lady_name.get('href'))
        try:
            connect.execute(('INSERT INTO info (name,via_url,zone_url) VALUES ("%s","http:%s","http:%s")'
                         % (lady_name.text, lady_avatar.find('img').get('src'), lady_name.get('href'))))
            connect.commit()
        except sql.OperationalError as e:
            print(e)
        except sql.IntegrityError as e:
            print(e)


connect = sql.connect('tnl.db')
connect.execute('CREATE TABLE IF NOT EXISTS info(_id INTEGER PRIMARY KEY,name TEXT NOT NULL'
                ',via_url TEXT NOT NULL UNIQUE,zone_url TEXT NOT NULL UNIQUE);')
for i in range(10, 1000000):
    show_name_by_page(i, connect)
connect.close()

相关文章

网友评论

      本文标题:用Python爬取淘宝模特(淘女郎)信息

      本文链接:https://www.haomeiwen.com/subject/brvdwxtx.html