美文网首页py爬虫
python爬取百度贴吧的图片1

python爬取百度贴吧的图片1

作者: lunabird | 来源:发表于2015-11-24 22:24 被阅读293次

    python版本:2.7.10
    学习python爬虫,首先写了一个爬取百度贴吧图片的程序。参考了静觅的系列博客

    好了,先上代码:

    # -*- coding : utf-8 -*-
    import urllib
    import urllib2
    import re
    
    
    class imgTest:
    
        def __init__(self, baseUrl, seeLZ):
            self.baseUrl = baseUrl
            self.seeLZ = '?see_lz='+str(seeLZ)
            # self.tool = Tool()
        #save a single img 
        def saveImg(self,imageURL,filename):
            u = urllib.urlopen(imageURL)
            data = u.read()
            f = open(filename,'wb')
            f.write(data)
            f.close()
        #download images
        def saveImgs(self, images, name, num):
            number = num
            for imageURL in images:
                splitPath = imageURL.split('.')
                fTail = splitPath.pop()
                if len(fTail)>3:
                    fTail = "jpg"
                fileName = name+"/"+str(number)+"."+fTail
                self.saveImg(imageURL,fileName)
                number += 1
        #get img urls       
        def getAllImageURLs(self,pageNum):
            page = self.getPage(pageNum)        
            patternImg = re.compile(r'<img class="BDE_Image" pic_type="0".*?src="(.+?\.jpg)" pic_ext="jpeg"')
            images = re.findall(patternImg, page)
            for item in images:
                print item
                self.printToLog("".join(item))
                # print("\n\n")
            return images
        #print to log.txt
        def printToLog(self,mystr):
            f = open('txt/log.txt', 'a')
            # f = open('txt/log.txt')
            f.write(mystr+"\n")
            f.close()
    
        #get the title of the bbs
        def getTitle(self):
            page = self.getPage(1)
            pattern = re.compile('<h3 class="core_title_txt.*?>(.*?)</h3>',re.S)
            result = re.search(pattern, page)
            if result:
                self.printToLog("bbs title:"+result.group(1))
                return result.group(1).strip()
            else:
                return None
        #get the total number of the tiezi
        def getPageNum(self):
            page = self.getPage(1)
            pattern = re.compile('<li class="l_reply_num".*?<span .*?</span>.*?<span.*?>(.*?)</span>',re.S)
            result = re.search(pattern, page)
            if result:
                self.printToLog("page total num:"+result.group(1))
                return result.group(1).strip()
            else:
                return None
        #get the html source code
        def getPage(self, pageNum):
            
            try:
                url = self.baseUrl+self.seeLZ +'&pn='+str(pageNum)
                request = urllib2.Request(url)
                response = urllib2.urlopen(request)
                content = response.read()
                return content
            except urllib2.URLError, e:
                if hasattr(e, "reason"):
                    print "failed to connect baidutieba.",e.reason
                    return None
    
    baseURL = 'http://tieba.baidu.com/p/3925387672'
    imgtest = imgTest(baseURL,1)
    totalnum = int(imgtest.getPageNum())
    
    imageCount = 0
    for i in range(1, totalnum+1):
        imageURLs = imgtest.getAllImageURLs(i)
        imgtest.saveImgs(imageURLs,"pic",imageCount)
        imageCount += len(imageURLs)
        print imageCount
    

    由于我的sublime Text有一点编码问题我还没来及管,所以函数的注释就用英文注释了(好像只有我能看懂)。最关键的一步就是getAllImageURLs这个函数了,需要从网页中抽取到图片的url,学好正则表达式真的很重要啊。还有一点,就是我发现百度贴吧的帖子里的图片url格式不太一样,不同的帖子要具体的分析过才行的哦,不过呢,这个正则表达式只要稍作修改就可以满足要求了。

    OK,我要去爬本漫画书来看咯:)

    相关文章

      网友评论

        本文标题:python爬取百度贴吧的图片1

        本文链接:https://www.haomeiwen.com/subject/vaelhttx.html