学了python,抓取《战狼2》的评论,熟悉一下。把整个过程写到一个类中,看起来整洁些。
首先,罗列出评论页的地址。
def Url(self):
url_duan = "https://movie.douban.com/subject/26363254/reviews?start="
start_urls = [url_duan + str((i - 1) * 20) for i in range(1, 342)]
for url in start_urls:
self.spider(url)
注意:总的评论汇总页有342页如图:
![](https://img.haomeiwen.com/i2698797/132c50d46e7f6a98.png)
观察得到每个评论数字减一再乘以20即可拼接出所有的评论页。
找出每个评论的地址。
import time
import urllib.request
from bs4 import BeautifulSoup
def spider(self,url):
time.sleep(2)
proxy_support = urllib.request.ProxyHandler({'sock5':random.choice(self.iplist)})
opener = urllib.request.build_opener(proxy_support)
html = opener.open(url, data= self.data()) #注意此处的写法,不然代理可能不成功。至于怎么验证代理是否成功,我还没想出来。
htm = html.read().decode("utf-8")
soup = BeautifulSoup(htm,"lxml")
reviewer_urls = soup.find_all('a', 'title-link')
reviewer_url =[]
for i in range(len(reviewer_urls)):
reviewer_url.append(reviewer_urls[i]['href'])
#print(reviewer_url)
for i in range(len(reviewer_url)):
#print(reviewer_url[i])
self.spider_1(reviewer_url[i])
通过每个汇总页,提取每个评论的链接。保存到reviewer_url中。为了能够成功爬去网页的数据,加了代理。
import random
import urllib.parse
def data(self):
ua = random.choice(self.user_agent_list)
data = {
"Host": "movie.douban.com",
"User-Agent": ua,
}
data = urllib.parse.urlencode(data).encode('utf-8')
return data
模拟浏览器。
def __init__(self):
self.user_agent_list=["Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/22.0.1207.1 Safari/537.1",
"Mozilla/5.0 (X11; CrOS i686 2268.111.0) AppleWebKit/536.11 (KHTML, like Gecko) Chrome/20.0.1132.57 Safari/536.11",
"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.6 (KHTML, like Gecko) Chrome/20.0.1092.0 Safari/536.6",
"Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.6 (KHTML, like Gecko) Chrome/20.0.1090.0 Safari/536.6",
"Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/19.77.34.5 Safari/537.1",
"Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/536.5 (KHTML, like Gecko) Chrome/19.0.1084.9 Safari/536.5",
"Mozilla/5.0 (Windows NT 6.0) AppleWebKit/536.5 (KHTML, like Gecko) Chrome/19.0.1084.36 Safari/536.5",
"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1063.0 Safari/536.3",
"Mozilla/5.0 (Windows NT 5.1) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1063.0 Safari/536.3",
"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_8_0) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1063.0 Safari/536.3",
"Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1062.0 Safari/536.3",
"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1062.0 Safari/536.3",
"Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.1 Safari/536.3",
"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.1 Safari/536.3",
"Mozilla/5.0 (Windows NT 6.1) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.1 Safari/536.3",
"Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.0 Safari/536.3",
"Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/535.24 (KHTML, like Gecko) Chrome/19.0.1055.1 Safari/535.24",
"Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/535.24 (KHTML, like Gecko) Chrome/19.0.1055.1 Safari/535.24"]
iplist =[] ##自己找一些代理ip就可以了,格式为ip:port,例如:127.0.0.1:30.
现在开始真正抓取每个评论的内容,没要图片内容。
import string
def spider_1(self,url):
proxy_support = urllib.request.ProxyHandler({'sock5':random.choice(self.iplist)})
opener = urllib.request.build_opener(proxy_support)
f = opener.open(url,data= self.data(),timeout=10)
soup = BeautifulSoup(f.read().decode('utf-8'), 'lxml')
#得到评论人
reviewer = soup.find_all('span', property='v:reviewer')[0].get_text()
#if not os.path.exists(reviewer+'.txt'):
#exit()
string_1 =string.punctuation
for str_1 in string_1:
reviewer= reviewer.replace(str_1,'')
content = []
##方法一:之前只知道用find_all,不知道find,所以用之前的方法,现在利用find和find_all的组合试试,但是必须将结果修改为字符串才行。
content_1 = soup.find('div', property='v:description').find_all('p')
for i in range(len(content_1)):
content.append(content_1[i].string)
with open(reviewer+'.txt', 'w+', encoding= 'utf-8') as f:
for i in range(len(content)):
try:
f.write(content[i]+'\n')
except TypeError as e:
f.write('\n')
#os.mknod('/ping/'+reviewer+'.txt')
中文文档汇总
完整代码:https://github.com/Hadesghost/zhanlang2/blob/master/zhanlang.py
re库的文档:http://python.usyiyi.cn/translate/python_352/library/re.html
time库的文档:http://python.usyiyi.cn/translate/python_352/library/time.html
urllib库的文档:http://python.usyiyi.cn/translate/python_352/library/urllib.html
bs4库的文档:http://beautifulsoup.readthedocs.io/zh_CN/latest/
random库的文档:http://python.usyiyi.cn/translate/python_352/library/random.html
string的用法:http://python.usyiyi.cn/translate/python_352/library/string.html
os的用法:http://python.usyiyi.cn/translate/python_352/library/os.path.html
其它的中文文档可以从以下地方查找:
https://readthedocs.org/
http://python.usyiyi.cn/translate/python_352/library/index.html#library-index
http://docs.pythontab.com/
网友评论