今晚七夕就不出去逛了,准备找个电影看看,打开猫眼电影榜单Top100,打开发现一页页往下点太麻烦了,就想把网页内容爬下来。
我们浏览电影网页,分析网页结构
image
image
image.png
分析之后拿到如下的请求地址和请求头相关信息
url_main = "https://maoyan.com/board/4?offset=0"
user_headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/84.0.4147.105 Safari/537.36'
}
然后获取第一页内容的代码如下
import requests
import time
import parsel
import threading
url_main = "https://maoyan.com/board/4?offset=0"
user_headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/84.0.4147.105 Safari/537.36'
}
def get_html_content(page):
response_main = requests.get(url_main, headers=user_headers)
assert response_main.status_code == 200 # 当响应码不是200时候,做断言报错处理
html_main = response_main.text
# print(html_main)
有了内容之后我们发现这一个页面的内容的基本结构,加上Chrome浏览器F12对内容的分析。我们可以通过CSS选择器提取出"电影名","主演","上映时间","评分" 。
- image.png
代码如下:
print(dd.css('p.name a::text').getall()[0])
print(dd.css('p.star::text').getall()[0].strip())
print(dd.css('p.releasetime::text').getall()[0])
print("".join(dd.css('p.score i::text').getall()))
然后我们整合下代码
import requests
import time
import parsel
import threading
url_main = "https://maoyan.com/board/4?offset=0"
user_headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/84.0.4147.105 Safari/537.36'
}
def get_html_content(page):
response_main = requests.get(url_main, headers=user_headers)
assert response_main.status_code == 200 # 当响应码不是200时候,做断言报错处理
html_main = response_main.text
# print(html_main)
sel = parsel.Selector(html_main)
dds = sel.css('dd')
for dd in dds:
movie_name = dd.css('p.name a::text').getall()[0]
movie_star = dd.css('p.star::text').getall()[0].strip()
movie_release_time = dd.css('p.releasetime::text').getall()[0]
movie_score = "".join(dd.css('p.score i::text').getall())
# 保存到字典
movie_dict = {
"电影名:": movie_name,
"主演:": movie_star,
"上映时间": movie_release_time,
"评分:": movie_score
}
print(movie_dict)
if __name__ == '__main__':
start_time = time.time()
for page in range(0, 100, 10):
get_html_content(page)
print("总运行时间:", time.time() - start_time, 's', sep='')
代码中我们加了计算运行时间
print("总运行时间:",time.time()-start_time,'s',sep='')
运行结果如下
- image.png
总共用时2.4秒多。
接下来我们开启多线程,再看下时间
if __name__ == '__main__':
start_time = time.time()
for page in range(0, 100, 10):
# 开启多线程,有十个页面就开启十个线程。
thread_tem = threading.Thread(target = get_html_content,args=(page,))
thread_tem.start()
while len(threading.enumerate())>1:
pass
print("总运行时间:", time.time() - start_time, 's', sep='')
image.png
总用时只要0.7秒
对比下时间,速度上快了2倍多。
源码地址:https://github.com/LesterZoeyXu/pachong
或者对Python感兴趣的朋友可以关注我的简书和公众号。需要Python或者爬虫电子书的朋友们关注微信公众号后台回复“python电子书”
网友评论