爬虫实践抓取糗事百科的段子和图片

作者: zp秋枫暮霞 | 来源:发表于2017-06-06 10:49 被阅读19次

创建python文件

导入所需要的库
import urllib import urllib2 import re import os
定义要抓取的链接地址和头
page=2 url='http://www.qiushibaike.com/8hr/page/%s/?s=4988835'%(str(page)) user_agent='Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/47.0.2526.80 Safari/537.36' headers={'User-Agent':user_agent};
抓取的链接是糗事百科首页的链接。
头是根据浏览器网络请求拿到的
我用的是chorm浏览器
快键键
shift+command+c 打开的开发人员工具
将这里的信息复制出来就是header 了

Paste_Image.png

def createDocuments():
#获取当前文件所在的绝对路径
    print os.path.abspath(' ')
#在当前目录下创建一个新的文件夹
    absPath=os.path.abspath('')
#先将要创建的路径拼出来 注如果直接拼接字符串 在不同的操作系统上可能有不同的分隔符
    cratePath= os.path.join(absPath,'duanzi')
    print cratePath
    if os.path.isdir(cratePath) == True:
        print '已经存在'
    else:
        os.mkdir(cratePath)
    return cratePath



def saveFile(url,path,name):
    f=open(path+'/'+str(name)+'.jpg',"wb")
    # print path+'/'+url
    req=urllib2.urlopen(url)
    buf=req.read()
    f.write(buf)

# saveFile('3123123.jpg', path)

path= createDocuments();

try:
    request=urllib2.Request(url,headers=headers);
    respponse=urllib2.urlopen(request);
    content=respponse.read().decode('utf-8')
        #这里注释代码是抓取段子的
    # pattern=re.compile(r'<div class="content">.*?<span>(.*?)</span>.*?</div>',re.S)
    # items=re.findall(pattern,content);
    # for item in items:
    #   print item
    images=re.compile(r'<a.*?![]((//pic.qiushibaike.*?)).*?</a>',re.S)
    imageList=re.findall(images, content)
    print json.dumps(imageList)
    x=0
    for imageUrl in imageList:

        url='http:'+imageUrl.decode('utf-8')
        print url

        saveFile(url,path,x)
        x=x+1

except Exception as e:
    raise e

网友评论

WEB前端程序开发

本文标题：爬虫实践抓取糗事百科的段子和图片

本文链接：https://www.haomeiwen.com/subject/ykrgfxtx.html

延伸阅读

深度阅读

您也可以注册成为美文阅读网的作者，发表您的原创作品、分享您的心情！

爬虫实践抓取糗事百科的段子和图片

创建python文件

相关文章

网友评论

延伸阅读

深度阅读

栏目导航

热点阅读

WEB前端程序开发