爬取代码很简单就写出来了,比较难的是反爬,
为了防反爬昨天晚上踩了很多坑,一开始我以为服务器封了我的ip,换了几个ip之后发现都没有用。
然后尝试了用selenium模拟浏览器右键另存为操作发现不可行= =
之后还有看到用selenium模拟浏览器登录,然后用requests下载图片的,一看就不可行,但是还是忍不住想试一试,果然又是403= =
最后在网上看到好像是因为请求头里面没有写referer,referer包含一个URL,用户从该URL代表的页面出发访问当前请求的页面。
然后我尝试了一下加入referer,把图集的url加进去,发现只能爬到6张,也就是说,每一页的图片都要带上他所在页面的url作为请求头里的referer,这样才能用requests.get()方法来获取到图片。
那么后面的操作在程序里就一目了然了_
import requests as re
from lxml import etree
from lxml.etree import HTMLParser
import time
def getUrls(url = "", pageNumber = 25):
urls = []
url1 = url[:-5]
'''
该网站翻页的规律是:
如第一页的url为https://www.meitulu.com/item/15653.html,
那么第n页的url就是https://www.meitulu.com/item/15653_n.html
'''
urls.append(url)
for i in range(2,26):
temp = url1 + str(i) + ".html"
urls.append(temp)
return urls
def downloadImageInUrls(urls = [], path = ''):
index = 1
for u in urls:
headers = {
'User-Agent': "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:63.0) Gecko/20100101 Firefox/63.0",
"Request": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
"Referer": u
}#对于每个网站内的图片需要设定referer为他所在的网页的地址
t = re.get(u, headers = headers, allow_redirects = False)
html = etree.HTML(t.text,etree.HTMLParser()) # 初始化生成一个XPath解析对象
result = html.xpath('//img[@class="content_img"]') # 解析对象输出代码
for r in result:
imgUrl = r.get("src")
time.sleep(1)
print("正在下载:" + imgUrl + "...")
# with open("D:\\佳苗pic\\urls.txt",'a') as f1:
# f1.write(imgUrl+"\n")
p = re.get(imgUrl, headers = headers, allow_redirects = False)
if(p.status_code == 200):
print("下载成功!")
with open(path + str(index) + '.jpg','wb+') as f2:
f2.write(p.content)
index = index+1
if __name__ == "__main__":
#图集网址
url = "https://www.meitulu.com/item/15653.html";
#图集的页数
pageNumber = 25
#存储的路径
path = "/home/zxsama/picture/MM"
downloadImageInUrls(urls = getUrls(url = url, pageNumber = pageNumber), path = path)
网友评论