2019-06-24
proxyhandler代理IP
image.pngresponse.text和response.content的区别
解码。decode:bytes->str,写入文件的时候得解码成字符串
编码。encode:str->bytes
image.png
发post请求
image.png使用代理
image.pngcookie
image.pngsession
image.png不受信任的证书
设为false后,不会报错
image.png
xpath,lxml
image.pngimage.png
image.png
image.png
爬取豆瓣上正在上映的电影信息
image.pngimage.png
完整代码
import requests
from lxml import etree
headers = {
'Referer': 'https://movie.douban.com/',
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/75.0.3770.100 Safari/537.36'
}
url = 'https://movie.douban.com/cinema/nowplaying/hangzhou/'
response = requests.get(url,headers=headers)
text = response.content.decode('utf-8')
html = etree.HTML(text)
ul = html.xpath("//ul[@class='lists']")[0]
lis = ul.xpath("./li")
movies = []
for li in lis:
title = li.xpath("@data-title")[0]
score = li.xpath("@data-score")[0]
region = li.xpath("@data-region")[0]
director = li.xpath("@data-director")[0]
actors = li.xpath("@data-actors")[0]
poster = li.xpath(".//img/@src")[0]
movie = {
'title':title,
'score':score,
'region':region,
'director':director,
'actors':actors,
'poster':poster
}
movies.append(movie)
print(movies)
出错的地方,字典里面左边要加引号。。要拿当前元素下的子元素用./不是.//!,xpath返回的都是列表,后面加[0]获得字符串
image.png
电影天堂爬虫
image.png优化版:
lambda匿名函数,就是把后面那个列表中的每个一个元素,都进行匿名函数中的操作
image.png
获得页面详情
image.png
image.png
image.png
image.png
image.png
image.png
xpath遇到错误
- lxml.etree.XPathEvalError: Unfinished literal
class写错了 - Python Xpath: lxml.etree.XPathEvalError: Invalid predicate
class少了一个]闭合
不完整代码
import requests
from lxml import etree
import re
HEADERS = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/75.0.3770.100 Safari/537.36'
}
MAINNET = 'https://www.dytt8.net'
def get_detail_url(url):
response = requests.get(url,headers=HEADERS)
text = response.text
html = etree.HTML(text)
detail_urls = html.xpath("//div[@class='co_content8']//table//a/@href")
detail_urls = map(lambda url:MAINNET+url,detail_urls)
return detail_urls
def handle_detail_url(detail_url):
movie = {}
response = requests.get(detail_url,headers=HEADERS)
text = response.content.decode('gbk')
html = etree.HTML(text)
title = html.xpath("//div[@class='title_all']//font[@color='#07519a']/text()")[0]
movie['title'] = title
download = html.xpath("//td[@bgcolor='#fdfddf']/a/@href")[0]
movie['download'] = download
return movie
def spider():
base_url="https://www.dytt8.net/html/gndy/dyzz/list_23_{}.html"
movies = []
for x in range(1,3):
url = base_url.format(x)
detail_urls = get_detail_url(url)
for detail_url in detail_urls:
movie = handle_detail_url(detail_url)
movies.append(movie)
print(movie)
break
break
if __name__ == "__main__":
spider()
很奇怪的几个地方,我拿不到thunder连接,获得的href的网址也打不开。。。。。。
网友评论