因为上一篇文章没办法放代码,所以重新发一次
2018/5/7
看到作者发的代码很简单,嗯,适合我这个新手+懒人!但是,问题没有那么简单!
原文地址:https://www.jianshu.com/p/ea0b56e3bd86
Python版本:2.7.13
2018/5/8
5/9
参考前辈经验后,编码问题得到改善。
链接:https://blog.csdn.net/gyafdxis/article/details/77923516
参考了这位大神(https://www.jianshu.com/p/d1bf2f0bdc51)对数据的存储方法,终于成功了!!
5/10
今日任务:把代码改为函数
最后修改完成的代码
import requests
from lxml import etree
import pandas as pd
import time
import random
from tqdm import tqdm
import csv
import codecs
'''
import sys
reload(sys)
sys.setdefaultencoding('utf8')
'''
data = []
def getyp(page):
url = 'https://movie.douban.com/subject/6390825/comments?start=%d&limit=20&sort=new_score&status=P&percent_type='%(page*20)
response = requests.get(url)
response.encoding = 'utf-8'
#print (response.content)
response = etree.HTML(response.content)
print (url)
for i in range(1,21):#每页显示20条评论
name1 = response.xpath('//*[@id="comments"]/div[%d]/div[2]/h3/span[2]/a'%(i))#获取用户名,保存为列表形式,每次循环,name1中都只有一个元素
score1 = response.xpath('//*[@id="comments"]/div[%d]/div[2]/h3/span[2]/span[2]'%(i))#获取评分
comment1 = response.xpath('//*[@id="comments"]/div[%d]/div[2]/p'%(i))#获取影评
if type(name1[0].text) == unicode:
name_element = name1[0].text.encode('utf-8')
else:
name_element = name1[0].text
score_element = score1[0].attrib['class'][7]#获取class属性,取第8个字符
if type(comment1[0].text) == unicode:
comment_element = comment1[0].text.encode('utf-8')
else:
comment_element = comment1[0].text
print type(comment_element)
data.append([name_element,score_element,comment_element])
for i in tqdm(range(1,3)):#抓取2页
getyp(i)
time.sleep(random.uniform(6,9))
with open("DBtest.csv","wb") as f:
f.write(codecs.BOM_UTF8)
writer = csv.writer(f)
writer.writerow(['name','score','comment'])
for k in data:
writer.writerow(k)
网友评论