美文网首页编程地带
百度贴吧用户爬虫

百度贴吧用户爬虫

作者: MA木易YA | 来源:发表于2019-03-24 10:31 被阅读2次

        这里随便选取一个贴吧(python),进入会员页面,抓取会员信息,包括用户名、昵称、最近常逛的贴吧名,吧龄、发帖数

    image.png

    items.py

    class TiebaUserItem(scrapy.Item):
        # define the fields for your item here like:
        # name = scrapy.Field()
        nickname = scrapy.Field()
        username = scrapy.Field()
        attention = scrapy.Field()
        age = scrapy.Field()
        post_number = scrapy.Field()
    

        对于简单字段逻辑还是不难理解的,直接用srapy逐个获取就行,翻页也不难处理,这里如果想要抓取全部的关注贴吧名或者关注人、被关注人的话会涉及页面嵌套,比较复杂,下次更新的时候着重介绍,这里仅是简化版的基础信息抓取,看代码就能懂了。

    spider.py

    class TiebaSpiderSpider(scrapy.Spider):
        name = 'tieba_spider'
        allowed_domains = ['tieba.baidu.com']
        start_urls = ['http://tieba.baidu.com/bawu2/platform/listMemberInfo?word=python&pn=1']
    
    
        def parse(self, response):
            href_list = response.xpath('//span[starts-with(@class,"member")]/div/a/@href').extract()
            for href in href_list:
                yield scrapy.Request(url="http://tieba.baidu.com" + href, callback=self.parse_tag)
            next_link = response.xpath('//div[@class="tbui_pagination tbui_pagination_left"]/ul/li/a[@class="next_page"]/@href').extract()
            if next_link:
                next_link = next_link[0]
                yield scrapy.Request(url="http://tieba.baidu.com" + next_link, callback=self.parse)
    
        def parse_tag(self, response):
            item = TiebaUserItem()
            item['nickname'] = response.xpath('//div[@class="userinfo_title"]/span/text()').extract_first()
            if response.xpath('//div[@class="userinfo_userdata"]/span[@class="user_name"]/text()').extract_first():
                item['username'] = response.xpath('//div[@class="userinfo_userdata"]/span[@class="user_name"]/text()').extract_first()[4:]
            item['attention'] = response.xpath('//div[@class="ihome_forum_group ihome_section clearfix"]/div[@class="clearfix u-f-wrap"]/a//text()').extract()
            if response.xpath('//span[@class="user_name"]//span[2]/text()'):
                item['age'] = response.xpath('//span[@class="user_name"]//span[2]/text()').extract_first()[3:]
    
            if response.xpath('//span[@class="user_name"]//span[4]/text()'):
                item['post_number'] = response.xpath('//span[@class="user_name"]//span[4]/text()').extract_first()[3:]
    
            return item
    
    • 更多爬虫代码详情查看Github
    0.jpg

    相关文章

      网友评论

        本文标题:百度贴吧用户爬虫

        本文链接:https://www.haomeiwen.com/subject/gfwyvqtx.html