一、摘要
本文利用requests和lxml两个库实现了对豆瓣电影网址中每一部电影的名称、导演、主演、评分以及简介进行爬取,方法比较老套,但是也会收获一些东西。
c9c20be4de7fe783bee20ef77eaa53a.png
二、运行环境
1.Pycharm
2.python 3.6
3.requests
4.lxml
三、思路
(1)主页链接为https://movie.douban.com/top250,一共有10页,每页25部电影,每部电影都位于<li>标签下。
(2)我们有两种方法实现翻页,通过分析,我们发现每一页的url都存在密切的关联,第一页为https://movie.douban.com/top250?start=0&filter=,第二页为https://movie.douban.com/top250?start=25&filter=,第三页为https://movie.douban.com/top250?start=50&filter=,我们发现url中start的值在发生变化,因此我们可以利用一个for循环遍历每一页;第二种方法为我们可以自动提取每一个中“后页”这个按钮的链接,从而快速构建下一页的url。
四、实现代码
import requests
import random
from lxml import etree
UA_LIST = [
"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1063.0 Safari/536.3",
"Mozilla/5.0 (Windows NT 5.1) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1063.0 Safari/536.3",
"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_8_0) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1063.0 Safari/536.3",
"Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1062.0 Safari/536.3",
"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1062.0 Safari/536.3",
"Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.1 Safari/536.3",
"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.1 Safari/536.3",
"Mozilla/5.0 (Windows NT 6.1) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.1 Safari/536.3",
"Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.0 Safari/536.3",
"Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/535.24 (KHTML, like Gecko) Chrome/19.0.1055.1 Safari/535.24",
"Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/535.24 (KHTML, like Gecko) Chrome/19.0.1055.1 Safari/535.24"
]
headers = {
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8',
'Accept-Encoding': 'gzip, deflate, br',
'Accept-Language': 'zh-CN,zh;q=0.9',
'Connection': 'keep-alive',
'Host': 'movie.douban.com',
'User-Agent': random.choice(UA_LIST)
}
def downloadHtml(url):
try:
r = requests.get(url, headers=headers)
r.raise_for_status()
r.encoding = r.apparent_encoding
return r.text
except:
return ""
def parse(url):
response = downloadHtml(url)
html = etree.HTML(response)
try:
names = html.xpath("//*[@id='content']/div/div[1]/ol/li/div/div[2]/div[1]/a/span[1]/text()")
doctor = html.xpath("//*[@id='content']/div/div[1]/ol/li/div/div[2]/div[2]/p[1]/text()[1]")
review = html.xpath("//*[@id='content']/div/div[1]/ol/li/div/div[2]/div[2]/div/span[2]/text()")
introduce = html.xpath("//*[@id='content']/div/div[1]/ol/li/div/div[2]/div[2]/p[2]/span/text()")
for names_i, doctor_i, review_i, introduce_i in zip(names, doctor, review, introduce):
content = {
'names': names_i,
'doctor': doctor_i.replace('\n', ' ').replace(' ', '').replace('\xa0', ''),
'review': review_i,
'introduce': introduce_i
}
print(content)
except:
print("错误信息")
def URL(url):
try:
response = downloadHtml(url)
html = etree.HTML(response)
new_url = html.xpath("//*[@id='content']/div/div[1]/div[2]/span[3]/a/@href")[0]
return new_url
except:
print("执行结束")
if __name__ == '__main__':
start_url = "https://movie.douban.com/top250"
# 点击后页
while(True):
try:
parse(start_url)
n_url = "https://movie.douban.com/top250" + URL(start_url)
start_url = n_url
except:
break
网友评论