爬虫01

作者: 六六的建斌 | 来源:发表于2017-07-16 17:37 被阅读0次

爬虫入门基础
爬虫01
爬虫01
Python爬虫实战之爬取链家广州房价_03存储
爬虫：01.爬虫初探
2.网络爬虫基本原理
如何使用阿里云WAF进行反爬虫
01 : 网络爬虫
python爬虫01
爬虫学习（01）

前面正则表达式毕竟学的不咋地。所以原理知道。但不晓得怎么去匹配信息，

import urllib.request re 导入相关的模块

data=urllib.request.urlopen(“url”).read().decode(utf-8) 打开url界面，并读取

req=.*? 正则表达式匹配要得到的信息

res=re.compile(req).findall(data) 在url中选取要匹配的信息

for line in range(0,len(res)):

print(line[i]) 可以看到所爬的所有信息

写入文件中去：

with open(,w) as f: 添加文件路径，windows里面要用\\转义

for line in range(0,len(res)): 遍历

f.write(res[i]+"\n") 将每一Line写入\n的意思是写一个换一行。

有关urllib的基础知识：

import urllib.request

urllib.request.urlretrieve(网址,本地文件存贮地址) 这个函数可以直接从网上下东西到本地

urllib.request.urlcleanup() 可用于直接清除缓存，减少内存压力

还有info() 表示相关信息

对中文的转码：

keyword="彭坤"

keyword=urllib.request.quote(keyword)

超时设置：

网站服务器反应问题造成的网页显示时间长短，根据需要设定超时时间

urllib.request.urlopen("url",timeout=5) 这样来设置

自动模拟http请求

post() 表单，要登录的那种 and get()

get 一般网址为 url+http.?字段=值&字段=值&等

post格式：表单操作

import urllib.request

import urllib.parse

posturl=url

posttt=urllib.parse.urlencode({"name":"nideminzi","password":"nidemima"}).decode("utf-8") 对url进行解析，

进行post，需要用到urllib.request.Requset(地址，解析过的数据)

req=urllib.request.Request(posturl,posttt)

res=urllib.request.urlopen(req).read().decode("utf-8")

爬虫异常处理：

如果没有异常处理，遇到异常时会崩溃，下次运行时会重新开始运行

URLError 原因 :连不上服务器，远程url不存在，无网络，触发HTTPError错误

HTTPError

爬虫的浏览器伪装技术：fn+f12进入开发者工具栏

请求头的格式：（"User-Agent",具体的值）元组形式

headers=("User-Agent"," ")

opener=urllib.request.build_opener()

opener.addheaders=[headers]

data=opener.open(url).read().decode("utf-8")

用户代理池的用法

ip代理与ip代理池的构建：用代理ip爬网站百度搜搜西刺代理

初始化ip，和初始化用户代理差不多一样的步骤，

from bs4 import BeautifulSoup

import os

import urllib.request

if not os.path.exists('photofirst'):

os.makedirs('phtofirst')

url="https://pixabay.com/zh/photos/?q=%E9%A3%8E%E6%99%AF&image_type=&min_width=&min_height=&cat=&pagi="

for i in range(1,200):

res=urllib.request.urlopen(url+str(i))

data=BeautifulSoup(res,'lxml')

datas=data.find_all('img')

link=[]

for i in datas:

s=i.get('srcset')

if s is None:

continue

else:

link.append(s.split(' ')[0])

i=0

for links in link:

i+=1

filename='photofirst//'+'photofist'+str(i)+'.gpj'

with open(filename,'w'):

urllib.request.urlretrieve(links,filename)

网友评论

本文标题：爬虫01

本文链接：https://www.haomeiwen.com/subject/gbvohxtx.html

延伸阅读

深度阅读

您也可以注册成为美文阅读网的作者，发表您的原创作品、分享您的心情！

爬虫01

相关文章

爬虫入门基础

爬虫01

爬虫01

Python爬虫实战之爬取链家广州房价_03存储

爬虫：01.爬虫初探

2.网络爬虫基本原理

如何使用阿里云WAF进行反爬虫

01 : 网络爬虫

python爬虫01

爬虫学习（01）

网友评论

延伸阅读

深度阅读

栏目导航

热点阅读