2019-04-29

作者: Emily0917 | 来源:发表于2019-04-29 11:53 被阅读0次

day25
好工作不是定位出来的，而是不断校准、调试出来的
【每日经济学人】2019-04-29
卖萌的ScalersTalk第四轮新概念朗读持续力训练Day18
java基础知识学习（四）
Java基础知识学习（三）
1. Iterator模式-一个一个遍历
荆的ScalersTalk第四轮新概念朗读持续力训练Day200
Notion 新增了数据库模板 Database Templat
2019-4-29晨间日记

爬取今日头条街拍数据---反爬策略滑动验证码

爬取的主页：https://www.toutiao.com/search/?keyword=%E8%A1%97%E6%8B%8D

今日头条是一个js动态加载的网站

我一开始用的requests库通过接口爬取，但是发现其url请求有一个timestamp请求，一个时间戳的请求，百度了一下，发现这应该是今日头条新的反爬策略(“萌新猜测！！”),,无奈，，从未遇到这种问题，，没解决掉。然后开始尝试selenium库自动化爬取

通过selenium库进行爬取，代码如下

结果不如人意，出现了验证码，这应该就是今日头条的反爬策略，只有把这个验证码破解了，才能得到想要的数据

滑动验证码

这也是我第一次接触到反爬验证码，在一波百度学习之后，思路如下：

由于这个验证码是自动跳出的，所以我们直接就能获取

步骤1 ：没有缺口的图片--未操作的验证码

步骤2 ：获取带缺口的图片

步骤3 ：对比2张图片的不同，得到不一样的像素点的x值，即要移动的距离。

步骤4 ：模拟人的行为（先匀加速拖动再匀减速拖动，）把需要拖动的距离分为一段段的轨迹

步骤5 ：实施拖动的过程，完成验证

步骤7：获取数据

from selenium.webdriver.common.byimport By

from PILimport Image

from ioimport BytesIO

from selenium.webdriver.common.action_chainsimport ActionChains

import time

import re

import json

from bs4import BeautifulSoup

def get_snap(driver):#对整个网页截图，保存成图片，然后用PIL.Image拿到图片对象

'''

对整个网页截图，保存成图片，然后用PIL.Image拿到图片对象

:return: 图片对象

'''

driver.get_screenshot_as_file('snap.png')

page_snap_obj=Image.open('snap.png')

return page_snap_obj

def get_image(wait,driver):#从网页的网站截图中，截取验证码图片,图片的获取

'''

从网页的网站截图中，截取验证码图片

:return: 验证码图片

'''

img = wait.until(EC.presence_of_element_located((By.CLASS_NAME, 'validate-main')))

time.sleep(2)# 保证图片刷新出来

print(img)

localtion = img.location

size = img.size

top = localtion['y']

bottom = localtion['y'] + size['height']

left = localtion['x']

right = localtion['x'] + size['width']

page_snap_obj = get_snap(driver)

crop_imag_obj = page_snap_obj.crop((left, top, right, bottom))

return crop_imag_obj

def get_distance(image1, image2):

'''

拿到滑动验证码需要移动的距离

:param image1:没有缺口的图片对象

:param image2:带缺口的图片对象

:return:需要移动的距离

'''

# 拿到滑动验证码需要移动的距离

# :param

# image1: 没有缺口的图片对象

# :param

# image2: 带缺口的图片对象

# :return:需要移动的距离

start =57

threhold =60

for iin range(start, image1.size[0]):

for jin range(image1.size[1]):

rgb1 = image1.load()[i, j]

rgb2 = image2.load()[i, j]

res1 =abs(rgb1[0] - rgb2[0])

res2 =abs(rgb1[1] - rgb2[1])

res3 =abs(rgb1[2] - rgb2[2])

# print(res1,res2,res3)

if not (res1 < threholdand res2 < threholdand res3 < threhold):

return i -7

def get_tracks(distance):

'''

拿到移动轨迹，模仿人的滑动行为，先匀加速后匀减速

匀变速运动基本公式：

①v=v0+at

②s=v0t+½at²

③v²-v0²=2as

:paramdistance: 需要移动的距离

:return: 存放每0.3秒移动的距离

'''

# 初速度

v =0

# 单位时间为0.2s来统计轨迹，轨迹即0.2内的位移

t =0.3

# 位移/轨迹列表，列表内的一个元素代表0.2s的位移

tracks = []

# 当前的位移

current =0

# 到达mid值开始减速

mid = distance *4 /5

while current < distance:

if current < mid:

# 加速度越小，单位时间的位移越小,模拟的轨迹就越多越详细

a =2

else:

a = -3

# 初速度

v0 = v

# 0.2秒时间内的位移

s = v0 * t +0.5 * a * (t **2)

# 当前的位置

current += s

# 添加到轨迹列表

tracks.append(round(s))

# 速度已经达到v,该速度作为下次的初速度

v = v0 + a * t

return tracks

def main():

driver = webdriver.Chrome()

driver.get('https://www.toutiao.com/search/?keyword=%E8%A1%97%E6%8B%8D')

wait = WebDriverWait(driver, 20)

# 步骤二：拿到没有缺口的图片

image1 = get_image(wait,driver)

# 步骤三：点击拖动按钮，弹出有缺口的图片

button = wait.until(EC.presence_of_element_located((By.CLASS_NAME, 'validate-drag-button')))

button.click()

# 步骤四：拿到有缺口的图片

image2 = get_image(wait,driver)

print(image1,image1.size)

print(image2,image2.size)

# 步骤五：对比两张图片的所有RBG像素点，得到不一样像素点的x值，即要移动的距离

distance = get_distance(image1, image2)

print(distance)

# 步骤六：模拟人的行为习惯（先匀加速拖动后匀减速拖动），把需要拖动的总距离分成一段一段小的轨迹

tracks = get_tracks(distance)

print(tracks)

print(image1.size)

print(distance, sum(tracks))

# 步骤七：按照轨迹拖动，完全验证

button = driver.find_elements_by_class_name('ovalidate-drag-button')

ActionChains(driver).click_and_hold(button).perform()

for trackin tracks:

ActionChains(driver).move_by_offset(xoffset=track, yoffset=0).perform()

else:

ActionChains(driver).move_by_offset(xoffset=3, yoffset=0).perform()# 先移过一点

ActionChains(driver).move_by_offset(xoffset=-3, yoffset=0).perform()# 再退回来，是不是更像人了

time.sleep(0.5)# 0.5秒后释放鼠标

ActionChains(driver).release().perform()

shixian(driver)

这就是验证的代码(说实话，我也不是特别理解算法)，再执行shixian()函数，进行爬取Json的动态网页

def shixian(driver):

for jin range(0,1000,20): #offset每20 换一页，所以这边要设置了从offset=0 为第一页开始爬取，步数为20，到1000停止爬取,每一页数据，不过这边会出错，因为还没到1000就没有数据可以爬了。(其实只有7页，也就是0ffset=140!)

url ="https://www.toutiao.com/api/search/content/?aid=24&app_name=web_search&offset={}&format=json&keyword=%E8%A1%97%E6%8B%8D&autoload=true&count=20&en_qc=1&cur_tab=1&from=search_tab&pd=synthesis&timestamp=1556371760933".format(j) #这就是今日头条的json页面，我们要爬取的数据都在者上面

driver.get(url=url)

text = driver.page_source #这边得到的数据不是个干净的json格式字符串，而是以html标签包裹的字符串，所以当用json.loads 处理时会报错。

后面还有很多数据未显示

pattern1 = re.compile(r'<html xmlns="http://www.w3.org/1999/xhtml"><head></head><body><pre style="word-wrap: break-word; white-space: pre-wrap;">') #一个正则表达式

out1 = re.sub(pattern1,'',text) #通过re.sub ，用’‘来替换在text中匹配到的字符串，这样就做到了把不干净的html标签清除下面同理

# print(out1)

pattern =re.compile(r'</pre></body></html>')

data =re.sub(pattern,'',out1)

datad = json.loads(data)

# print(datad)

shuju = datad['data'] #直接选出data 中的内容

for iin shuju:

print(i.get('abstract','')) #因为data中的内容是字典形式，而且是大字典中包含小字典，所以用for 遍历每个小字典，再用字典的.get 查询每个小字典的相应键对。.get(’查询的键‘ ，’默认的方式‘) （ps,默认的方式，就是当你字典中查不到出错时，就默认为空。来防止程序中断执行报错）

print(i.get('image_list',''))