批量爬取某(和谐)论坛的美图

作者: _weber_ | 来源:发表于2017-02-09 16:31 被阅读131次

功能说明

爬取某个不能说的网站的"新时代的xxx"版块下的所有帖子中的图片
因论坛限制，未注册用户只能访问前100页数据
该文章仅供学习，代码中的地址已做和谐处理，(✿◡‿◡)

第一部分代码

1，构建函数get_post_urls()解析出帖子列表页面所有帖子的地址
2，调用函数，解析出前100页帖子的地址
3，将帖子地址保存到mongodb数据库中，因为该论坛的地址会随时变动，所以我们只保存了帖子地址的后办部分，后期使用的时候可以根据情况随时更改前半部分，也就是代码中的forum_url。
4，'status': 'new'值用于标识地址的状态

# -*- coding: utf-8 -*-

import requests
from bs4 import BeautifulSoup
import pymongo


def get_post_urls(sub_url):
    """获取某一页帖子列表中所有帖子的地址"""
    res = requests.get(sub_url, headers={'Accept-Encoding': ''})
    soup = BeautifulSoup(res.text,'lxml')
    datas = soup.select('h3 > a')
    post_urls = []
    for data in datas:
        post_urls.append(data.get('href'))
    return post_urls


 if __name__ == '__main__':
     client = pymongo.MongoClient('localhost', 27017)
     db_cl = client['db_cl']
     col_newworld = db_cl['col_newworld']
     forum_url = 'http://c*.o1t.***/'
     sub_urls = [forum_url + 'thread0806.php?fid=8&search=&page={}'.format(str(i)) for i in range(1, 100)]
     for sub_url in sub_urls:
         for post_url in get_post_urls(sub_url):
             post_url = post_url
             data = {
                 'status': 'new',
                 'post_url': post_url
             }
             col_newworld.insert_one(data)
             print(post_url)
         print(sub_url)
     print('END!!!')

第二部分代码

1，从数据库中取出帖子地址，解析出标题和图片地址
2，使用标题做文件夹名称，每个帖子创建一个文件夹，保存该帖子下的所有图片
3，使用完一个地址后则在数据库中将其状态更新为"used"，避免重复下载
4，因为我也不知道的原因，帖子标题正常解析出来都是乱码，经过几个小时的搜索加了代码res.encoding = res.apparent_encoding总算部分解决了，说部分解决是因为繁体字解析出来仍然是乱码。
5，.limit(5)控制每次下载的帖子数目，可以根据情况自己修改

# -*- coding: utf-8 -*-

"""
从数据库中读取帖子地址并下载帖子中的图片保存到本地
"""
import requests
from bs4 import BeautifulSoup
import pymongo
import os


def dl_img(post_url):
    """根据帖子地址解析出帖子中所有图片地址"""
    res = requests.get(post_url)
    res.encoding = res.apparent_encoding  # 解决中文乱码问题
    soup = BeautifulSoup(res.text, 'lxml')
    title = soup.select('h4')[0].get_text()
    img_urls = soup.select('input[type="image"]')
    fdir = os.mkdir('C:\WWB\python\clpic\\' + title + '\\')
    index = 1
    for img_url in img_urls:
        img = requests.get(img_url.get('src'))
        file = 'C:\WWB\python\clpic\\' + title + '\\' + str(index) + '.jpg'
        with open(file, 'wb') as f:
            f.write(img.content)
            print(index)
            index += 1
    print(post_url)


if __name__ == '__main__':
    client = pymongo.MongoClient('localhost', 27017)
    db_cl = client['db_cl']
    col_newworld = db_cl['col_newworld']
    forum_url = 'http://c*.o1t.***/'
    for item in col_newworld.find({'status': 'new'}).limit(5):
        try:
            dl_img(forum_url + item['post_url'])
            col_newworld.update({'status': 'new'}, {'$set': {'status': 'used'}})
        except Exception as e:
            print(item)
    print('END!!!')

批量爬取某(和谐)论坛的美图

功能说明

第一部分代码

第二部分代码

相关文章

网友评论

延伸阅读

深度阅读

栏目导航

热点阅读

Python中文社区

Machine Learning && Computational Catalysis

python爬虫

Codez.python

Python语言与信息数据获取和机器学习