RPA—pyautogui+PIL+pandas破解滑动验证码,

作者: 小螺旋丸 | 来源:发表于2019-08-14 14:12 被阅读12次

    引言

      最近公司有个新需求,大体流程是这样的,进入天津市市场主体信用信息公示系统,根据excel中表格的企业名称或税号查询企业的股东信息,查到之后获取股东信息的税号,然后再分别查询股东的股东,最后把查询结果录入excel。
      读excel——>查询企业股东——》获取股东税号——》输入股东税号查询其股东——》查询结果录入excel,是不是让人觉得十分无语,简单一句话,查询股东的股东的相关信息录入excel,当时听到这个需求感觉理论上是可以实现的,唯一的难点就在于滑块验证码,破解了它之后后面的就是一些网页数据提取的工作了。

    破解滑块验证码

      话不多说,上爬虫呗,因为有滑块验证码这个东西的存在,所以只能选择浏览器爬虫了,虽然效率慢点,但是万物皆可爬,因为抓包分析那些请求数据实在是让人恶心的想吐。在这里我使用 “艺赛旗RPA设计器” 来辅助完成工作,不得不说,这个东西真的好用,而且它的python库十分强大,设计完流程可以自动生成python代码,自己只需要关心一些核心的算法和业务逻辑就可以了,事半功倍。
      首先看一下验证码的图片:

    image

      是比较常见的 “极验” 验证码,很多网站都在使用这个东西,但是政府的网站明显落后了一点,现在 "极验3.0" 已经更新了,这个还停留在2.0。区别就是2.0一开始显示的是完整的图片,点击滑动按钮会出现有缺口的图片,而3.0一开始显示的就是带缺口的图片,不过也是可以破解的。

    • 2.0破解方法:
      对原图和带缺口的图分别截图,比较两张图的像素点,从而算出需要移动的距离,准确率可以高达80%。
    • 3.0破解方法:
      ——1 .有些网站是利用css样式隐藏了滑块,找到标签利用python执行js脚本把display:none去除,后面就跟2.0一样了。
      ——2.直接对带缺口的图片截图,缺口部分的rgb通常都小于150,这样就可以找到缺口位置!

      在这里我们以2.0为例,3.0的核心代码我也会贴上,先看一下2.0破解的步骤:

    点击搜索-->截取验证码图片-->点击滑动按钮-->截取带缺口图片-->比较像素计算偏移量-->移动
    

      因为我们使用了RPA设计器,所以像点击鼠标,截图之类的代码都不需要自己去写,选择相应的元素,点击对应页面的元素,他就可以自动为我们生成python代码,当然是高度封装的,源码是可以随时看的,底层其实还是那一套。唯一需要我们动手写的是计算偏移量以及鼠标移动,虽然他本身有鼠标拖动的组件,但是拖动的时候过于直来直去,会被检测到,提示 “被怪物吃掉” 所以我稍微修改了一下他的源码,封装了一个自己的方法,先看一下验证码识别的流程图:


    image
    image
    image

      设计好了流程图设计器就可以帮我们自动生成代码,代码如下:

    # coding=utf-8
    # 编译日期:2019-08-14 10:09:34
    import time
    import pdb
    from ubpa.ilog import ILog
    from ubpa.base_img import *
    import getopt
    from sys import argv
    import sys
    from ubpa.itools import rpa_import
    GlobalFun = rpa_import.import_global_fun(__file__)
    import ubpa.ibox as ibox
    import ubpa.iexcel as iexcel
    import ubpa.ifile as ifile
    import ubpa.iie as iie
    import ubpa.iimg as iimg
    import ubpa.ikeyboard as ikeyboard
    
    class getTjInfo:
         
        def __init__(self,**kwargs):
            self.__logger = ILog(__file__)
            self.path = set_img_res_path(__file__)
            self.robot_no = ''
            self.proc_no = ''
            self.job_no = ''
            if('robot_no' in kwargs.keys()):
                self.robot_no = kwargs['robot_no']
            if('proc_no' in kwargs.keys()):
                self.proc_no = kwargs['proc_no']
            if('job_no' in kwargs.keys()):
                self.job_no = kwargs['job_no']
        #验证码识别
        def checkCode(self):
            existFlg=None
            distance=None
            xy=None
            imageTwo=None
            imageOne=None
            # 截图
            self.__logger.debug('Flow:checkCode,StepNodeTag:1313475530184,Note:')
            imageOne = iimg.capture_image(win_title=r'天津市市场主体',win_text=r'',in_img_path=r'C:/Users/Administrator/Desktop/',left_indent=823,top_indent=521,width=266,height=121,waitfor=30)
            # 鼠标点击
            self.__logger.debug('Flow:checkCode,StepNodeTag:1313475530183,Note:')
            iie.do_click_pos(win_title=r'天津市市场主体',url=r'http://credit.scjg.tj.gov.cn/gsxt/',selector=r'.gt_holder gt_popup gt_show > DIV:nth-of-type(2) > DIV:nth-of-type(2) > DIV:nth-of-type(2) > DIV:nth-of-type(2)',button=r'left',curson=r'center',times=1,run_mode=r'unctrl',waitfor=10,scroll_view='no')
            time.sleep(4)
            # 截图
            self.__logger.debug('Flow:checkCode,StepNodeTag:1313475530186,Note:')
            imageTwo = iimg.capture_image(win_title=r'天津市市场主体',win_text=r'',in_img_path=r'C:/Users/Administrator/Desktop/',left_indent=823,top_indent=521,width=266,height=121,waitfor=30)
            # 自定义函数
            self.__logger.debug('Flow:checkCode,StepNodeTag:1313475530182,Note:')
            distance = GlobalFun.get_distance(imageOne,imageTwo)
            # 获取元素位置
            self.__logger.debug('Flow:checkCode,StepNodeTag:1313475530181,Note:')
            xy = iie.get_element_rect(win_title=r'天津市市场主体',url=r'http://credit.scjg.tj.gov.cn/gsxt*',selector=r'.gt_holder gt_popup gt_show > DIV:nth-of-type(2) > DIV:nth-of-type(2) > DIV:nth-of-type(2) > DIV:nth-of-type(2)',curson=r'center',waitfor=10)
            # 代码块
            self.__logger.debug('Flow:checkCode,StepNodeTag:1313475530180,Note:')
            print(xy)
            lastxy=(xy[0]+distance,xy[1],xy[2],xy[3])
            print(lastxy)
            if(xy==(847.0, 682.0, 44, 44) and lastxy==(900.0, 682.0, 44, 44)):
              print('修正')
              lastxy=(895.0, 682.0, 44, 44)
            if(xy==(847.0, 682.0, 44, 44) and lastxy==(976.0, 682.0, 44, 44)):
              print('修正')
              lastxy=(868.0, 682.0, 44, 44)
            # 自定义函数
            self.__logger.debug('Flow:checkCode,StepNodeTag:1313475530185,Note:')
            GlobalFun.myDo_drag_to(win_title=r'天津市市场主体', srcpos=xy,distpos=lastxy)
            #删除文件
            self.__logger.debug('Flow:checkCode,StepNodeTag:1313475530187,Note:')
            ifile.del_file(file=imageOne)
            #删除文件
            self.__logger.debug('Flow:checkCode,StepNodeTag:1313475530188,Note:')
            ifile.del_file(file=imageTwo)
            # 图像检测
            self.__logger.debug('Flow:checkCode,StepNodeTag:13140204311151,Note:')
            time.sleep(3.5)
            existFlg = iimg.img_exists(win_title=r'天津市市场主体',img_res_path=self.path,image=r'snapshot_20190813135330024.png',fuzzy=True,confidence=0.85,waitfor=3)
            # IF分支
            self.__logger.debug('Flow:checkCode,StepNodeTag:13140549531176,Note:')
            if existFlg:
                #消息框
                self.__logger.debug('Flow:checkCode,StepNodeTag:13143951406201,Note:')
                ibox.msg_box(msg='验证失败,重试!',timeout=1.5)
                time.sleep(1)
                # 鼠标点击
                self.__logger.debug('Flow:checkCode,StepNodeTag:13140738964184,Note:')
                iie.do_click_pos(win_title=r'天津市市场主体',url=r'http://credit.scjg.tj.gov.cn/gsxt*',selector=r'.gt_holder gt_popup gt_show > DIV:nth-of-type(2) > DIV:nth-of-type(2) > DIV:nth-of-type(1) > DIV:nth-of-type(3) > A:nth-of-type(1)',button=r'left',curson=r'center',times=1,run_mode=r'unctrl',waitfor=100,scroll_view='no')
                time.sleep(1.5)
                # Return返回
                self.__logger.debug('Flow:checkCode,StepNodeTag:13140556594179,Note:')
                return True
            else:
                # Return返回
                self.__logger.debug('Flow:checkCode,StepNodeTag:13140620186183,Note:')
                return False
            # 代码块
            self.__logger.debug('Flow:checkCode,StepNodeTag:13141700326199,Note:')
            print(existFlg)
        #处理表格数据
        def dealTableData(self,tableData=None):
            currentCom=None
            currentTableData=None
            currentComName=None
            # 代码块
            self.__logger.debug('Flow:dealTableData,StepNodeTag:13161341316275,Note:')
            columns=tableData.columns
            realDataList=tableData.values.tolist()
            # 热键输入
            self.__logger.debug('Flow:dealTableData,StepNodeTag:13161638010281,Note:')
            ikeyboard.key_send_cs(win_title=r'天津市市场主体',text='^{F4}',waitfor=10)
            # 热键输入
            self.__logger.debug('Flow:dealTableData,StepNodeTag:13164935964357,Note:')
            ikeyboard.key_send_cs(text='^{F4}',waitfor=10)
            #消息框
            self.__logger.debug('Flow:dealTableData,StepNodeTag:13161731539284,Note:')
            ibox.msg_box(msg='开始处理二级公司数据',timeout=2)
            time.sleep(0.002)
            # For循环
            self.__logger.debug('Flow:dealTableData,StepNodeTag:13161910442289,Note:')
            for i in range(len(realDataList)):
                # 代码块
                self.__logger.debug('Flow:dealTableData,StepNodeTag:13162051185291,Note:')
                currentList=realDataList[i]
                if(columns[0]=='有限责任公司本年度是否有股权转让    '):
                    currentCom=currentList[0]
                    currentComName=currentList[0]
                if(columns[0]=='企业是否有股权信息或购买其它公司股权'):
                    currentCom=currentList[0]
                    currentComName=currentList[1]
                if("天津" not in currentCom):
                    continue
                time.sleep(1)
                # 子流程:finishCheckCode
                self.__logger.debug('Flow:dealTableData,StepNodeTag:13161503452279,Note:')
                self.finishCheckCode(comName=currentComName)
                # 子流程:goToDetail
                self.__logger.debug('Flow:dealTableData,StepNodeTag:13163507635351,Note:')
                currentTableData=self.goToDetail()
                # 代码块
                self.__logger.debug('Flow:dealTableData,StepNodeTag:13170553111368,Note:')
                currentTableData[0].drop(['变更后股权比例','股权变更日期'], axis=1)
                currentTableData[1].drop(['投资设立企业后购买股权企业名称',r'统一社会信用代码/注册号'], axis=1)
                lastTableData0=currentTableData[0].values.tolist()
                lastTableData1=currentTableData[1].values.tolist()
                #插入行
                self.__logger.debug('Flow:dealTableData,StepNodeTag:13171433395371,Note:')
                iexcel.ins_row(path='C:/Users/Administrator/Desktop/testData.xlsx',data=lastTableData1)
                # 热键输入
                self.__logger.debug('Flow:dealTableData,StepNodeTag:13165826952364,Note:')
                ikeyboard.key_send_cs(win_title=r'天津市市场主体',text='^{F4}',waitfor=10)
                # 热键输入
                self.__logger.debug('Flow:dealTableData,StepNodeTag:13165850702366,Note:')
                ikeyboard.key_send_cs(text='^{F4}',waitfor=10)
        #完成验证
        def finishCheckCode(self,comName='911200006630613577'):
            #网站
            self.__logger.debug('Flow:finishCheckCode,StepNodeTag:1308451082917,Note:')
            iie.open_url(url=r'http://credit.scjg.tj.gov.cn/gsxt/')
            # 鼠标点击
            self.__logger.debug('Flow:finishCheckCode,StepNodeTag:1308451082912,Note:')
            iie.do_click_pos(win_title=r'天津市市场主体',url=r'http://credit.scjg.tj.gov.cn/gsxt/#',selector=r'http-equiv="x-ua-compatible":nth-of-type(1) > DIV:nth-of-type(3) > DIV:nth-of-type(2) > UL:nth-of-type(1) > LI:nth-of-type(2) > A:nth-of-type(1)',button=r'left',curson=r'center',times=1,run_mode=r'unctrl',waitfor=300,scroll_view='no')
            time.sleep(0.5)
            # 鼠标点击
            self.__logger.debug('Flow:finishCheckCode,StepNodeTag:1308451082913,Note:')
            iie.do_click_pos(win_title=r'天津市市场主体',url=r'http://credit.scjg.tj.gov.cn/gsxt/#',selector=r'http-equiv="x-ua-compatible":nth-of-type(1) > DIV:nth-of-type(3) > DIV:nth-of-type(2) > UL:nth-of-type(1) > LI:nth-of-type(2) > A:nth-of-type(1)',button=r'left',curson=r'center',offsetY=45,times=1,run_mode=r'unctrl',waitfor=10,scroll_view='no')
            time.sleep(2)
            # 设置文本
            self.__logger.debug('Flow:finishCheckCode,StepNodeTag:130845108293,Note:')
            iie.set_text(url=r'http://credit.scjg.tj.gov.cn/gsxt/',selector=r'#searchName',text=comName,waitfor=10)
            # 鼠标点击
            self.__logger.debug('Flow:finishCheckCode,StepNodeTag:130845108292,Note:')
            iie.do_click_pos(win_title=r'天津市市场主体',url=r'http://credit.scjg.tj.gov.cn/gsxt/',selector=r'#entSearchLink',button=r'left',curson=r'center',times=1,run_mode=r'unctrl',waitfor=10,scroll_view='no')
            time.sleep(2.5)
            # While循环
            self.__logger.debug('Flow:finishCheckCode,StepNodeTag:13135310584131,Note:')
            while True:
                # 子流程:checkCode
                self.__logger.debug('Flow:finishCheckCode,StepNodeTag:13140310308161,Note:')
                tvar13140310308161=self.checkCode()
                # IF分支
                self.__logger.debug('Flow:finishCheckCode,StepNodeTag:13135336032135,Note:')
                if tvar13140310308161:
                    pass
                else:
                    # Break中断
                    self.__logger.debug('Flow:finishCheckCode,StepNodeTag:13135345176138,Note:')
                    break
        #获取股东公司信息
        def getChildCom(self):
            tableData2=None
            tableData1=None
            table2Columns=None
            table1Columns=None
            tableDatas=None
            # 子流程:goToDetail
            self.__logger.debug('Flow:getChildCom,StepNodeTag:13162921190325,Note:')
            tableDatas=self.goToDetail()
            # IF分支
            self.__logger.debug('Flow:getChildCom,StepNodeTag:13145633622210,Note:')
            if tableDatas[0].columns[1]=='否':
                pass
            else:
                # 代码块
                self.__logger.debug('Flow:getChildCom,StepNodeTag:13163242189338,Note:')
                tableData1=tableDatas[0]
                # 子流程:dealTableData
                self.__logger.debug('Flow:getChildCom,StepNodeTag:13163127828333,Note:')
                self.dealTableData(tableData=tableData1)
            # IF分支
            self.__logger.debug('Flow:getChildCom,StepNodeTag:13145822725214,Note:')
            if tableDatas[1].columns[1]=='否':
                pass
            else:
                # 代码块
                self.__logger.debug('Flow:getChildCom,StepNodeTag:13163302199339,Note:')
                tableData2=tableDatas[1]
                # 子流程:dealTableData
                self.__logger.debug('Flow:getChildCom,StepNodeTag:13163131389335,Note:')
                self.dealTableData(tableData=tableData2)
            #消息框
            self.__logger.debug('Flow:getChildCom,StepNodeTag:13151905124229,Note:')
            ibox.msg_box(msg='当前企业数据处理完毕,下一个。。',timeout=1.5)
            time.sleep(1.5)
        #去往详情页
        def goToDetail(self):
            table2Data=None
            table1Data=None
            # 鼠标点击
            self.__logger.debug('Flow:goToDetail,StepNodeTag:13162716929301,Note:')
            iie.do_click_pos(win_title=r'天津市市场主体',url=r'http://credit.scjg.tj.gov.cn/gsxt/',selector=r'#center_content > DIV:nth-of-type(1) > DIV:nth-of-type(2) > DIV:nth-of-type(2) > UL:nth-of-type(1) > LI:nth-of-type(1) > H1:nth-of-type(1) > A:nth-of-type(1)',button=r'left',curson=r'center',times=1,run_mode=r'unctrl',waitfor=20,scroll_view='no')
            time.sleep(1.5)
            # 鼠标点击
            self.__logger.debug('Flow:goToDetail,StepNodeTag:13162716929300,Note:')
            iie.do_click_pos(win_title=r'天津市市场主体',url=r'http://credit.scjg.tj.gov.cn/gsxt/',selector=r'#tabs > DIV:nth-of-type(1) > DIV:nth-of-type(3) > SPAN:nth-of-type(1)',button=r'left',curson=r'center',times=1,run_mode=r'unctrl',waitfor=30,scroll_view='no')
            # 鼠标点击
            self.__logger.debug('Flow:goToDetail,StepNodeTag:13162716929299,Note:')
            iie.do_click_pos(win_title=r'天津市市场主体',url=r'http://credit.scjg.tj.gov.cn/gsxt/',selector=r'#tableInfoDiv > DIV:nth-of-type(2) > TABLE:nth-of-type(1) > TBODY:nth-of-type(1) > TR:nth-of-type(3) > TD:nth-of-type(4) > A:nth-of-type(1)',button=r'left',curson=r'center',times=1,run_mode=r'unctrl',waitfor=30,scroll_view='no')
            # 自定义函数
            self.__logger.debug('Flow:goToDetail,StepNodeTag:13162716929298,Note:股权转让')
            table1Data = GlobalFun.getTableData('年报详情','#show_alter')
            # 自定义函数
            self.__logger.debug('Flow:goToDetail,StepNodeTag:13162716929297,Note:是否有狗买')
            table2Data = GlobalFun.getTableData('年报详情','#show_invest')
            # Return返回
            self.__logger.debug('Flow:goToDetail,StepNodeTag:13162803390322,Note:')
            return table1Data,table2Data
          
        def Main(self):
            # 子流程:finishCheckCode
            self.__logger.debug('Flow:Main,StepNodeTag:13165330947360,Note:')
            self.finishCheckCode(comName='911200006630613577')
            # 子流程:getChildCom
            self.__logger.debug('Flow:Main,StepNodeTag:13151828292226,Note:')
            self.getChildCom()
            #消息框
            self.__logger.debug('Flow:Main,StepNodeTag:13152147770235,Note:')
            ibox.msg_box(msg='全部数据处理完毕!')
     
    if __name__ == '__main__':
        robot_no = ''
        proc_no = ''
        job_no = ''
        try:
            argv = sys.argv[1:]
            opts, args = getopt.getopt(argv,"hr:p:j:",["robot = ","proc = ","job = "])
        except getopt.GetoptError:
            print ('robot.py -r <robot> -p <proc> -j <job>')
        for opt, arg in opts:
            if opt == '-h':
                print ('robot.py -r <robot> -p <proc> -j <job>')
            elif opt in ("-r", "--robot"):
                robot_no = arg
            elif opt in ("-p", "--proc"):
                proc_no = arg
            elif opt in ("-j", "--job"):
                job_no = arg
        pro = getTjInfo(robot_no=robot_no,proc_no=proc_no,job_no=job_no)
        pro.Main()
    
    

      使用的全局函数的代码,在这里我们需要引入PIL库来进行图片的读取以及像素的处理,具体方法见 get_distance ,引入pyautogui库来对浏览器页面进行操作,在这里主要用它控制鼠标滑动,具体方法见 myDo_drag_to 引入pandas库来进行页面表格的数据获取,具体方法见 getTableData ,如下:

    # 编译日期:2019-08-12 10:47:48
    # coding=utf-8
    from selenium.webdriver.common.action_chains import ActionChains
    from selenium import webdriver
    import time
    from PIL import Image
    import ubpa.ics as ics
    import pyautogui
    from ubpa import iwin
    import math
    import ubpa.iie as iie
    import re
    import pandas as pd
    
    def getTableData(titleStr,selectorStr):
        table_string = iie.get_html(title=titleStr,selector=selectorStr,waitfor=30)
        tb_start = re.compile('')
        tb_end = re.compile('')
        last_str = tb_end.sub('', tb_start.sub('', table_string))
        #调用了pandas中的read_html方法,注意header=0,有些表格header不是0
        data = pd.read_html(last_str, flavor="bs4", header=0)[0]
        print(data)
        print(data.columns)
        return data
    
    def get_point_axis(axis_list,distpos,point):
        pos_val_list = []
    
        for i in range(1, 10000):
            if i >= point:
                break
            n = len(axis_list) * (i / (point + 1))
            pos_val = axis_list[int(n)]
            pos_val_list.append(pos_val)
    
        pos_val_list.append(distpos)
    
        return pos_val_list
    
    def get_axis_list(srcpos=(0, 0), distpos=(0, 0)):
        pos_list = []
        x1 = srcpos[0]
        y1 = srcpos[1]
        x2 = distpos[0]
        y2 = distpos[1]
    
        if x1 == x2:
            if y1 > y2:
                for i in range(math.ceil(y2), int(y1) + 1):
                    pos_list.append((x1, i))
                    pos_list.reverse()
            elif y1 < y2:
                for i in range(math.ceil(y1), int(y2) + 1):
                    pos_list.append((x1, i))
            else:
                pos_list = []
        else:
            if y1 == y2:
                if x1 < x2:
                    x1 = math.ceil(x1)
                    x2 = int(x2)
                    length = x2 - x1
                    for i in range(0, length + 1):
                        pos_list.append((x1 + i, y2))
                if x1 > x2:
                    x1 = int(x1)
                    x2 = math.ceil(x2)
                    length = x1 - x2
                    for i in range(0, length + 1):
                        pos_list.append((x1 + i, y2))
            else:
                if x1 < x2:
                    for i in range(math.ceil(x1), int(x2) + 1):
                        if y1 < y2:
                            h = (i - x1) * (y2 - y1) / (x2 - x1)
                            pos_list.append((i, y1 + h))
                        else:
                            h = (i - x1) * (y1 - y2) / (x2 - x1)
                            pos_list.append((i, y1 - h))
                else:
                    for i in range(math.ceil(x2), int(x1) + 1):
                        if y1 < y2:
                            h = (i - x2) * (y2 - y1) / (x1 - x2)
                            pos_list.append((i, y2 - h))
                        else:
                            h = (i - x2) * (y1 - y2) / (x1 - x2)
                            pos_list.append((i, y2 + h))
    
                    pos_list.reverse()
    
        return pos_list
    
    def myDo_drag_to(win_title=None, srcpos=(0,0), distpos=(0,0), point=0, stimes=1, model=pyautogui.easeInOutQuad, waitfor=10):
        '''
               验证拖拽
                     x1:起点位置x坐标
                     y1:起点位置y坐标
                     x2:终点位置x坐标
                     y2:终点位置y坐标
                     point:停顿次数,默认是0
                     stimes:移动快慢,默认是1
                     model:移动方式,easeInQuad先慢后快,easeOutQuad先快后慢,easeInOutQuad开始和结束快 中间慢,easeInBounce结束反弹,easeInElastic持续反弹
    
            '''
        try:
            if win_title != None and win_title.strip() != '':
                ''''如果窗口不活跃状态'''
                if not iwin.do_win_is_active(win_title):
                    iwin.do_win_activate(win_title=win_title, waitfor=2)
    
            pyautogui.moveTo(srcpos[0], srcpos[1], 0.5)
    
            pyautogui.mouseDown(button='left', _pause=True)
    
            axis_list = get_axis_list(srcpos, distpos)
            if len(axis_list) > 0:
    
                pos_val_list = get_point_axis(axis_list, distpos, point)
               # print(pos_val_list)
                for index in pos_val_list:
    
                    pyautogui.dragTo(float(index[0]+20), float(index[1]), stimes, model)
                    time.sleep(0.5)
                    pyautogui.dragTo(float(index[0]-5), float(index[1]), stimes, model)
                    time.sleep(0.5)
                    pyautogui.dragTo(float(index[0]), float(index[1]), stimes, model)
    
                time.sleep(0.5)
                pyautogui.mouseUp(button='left', _pause=True)
        except Exception as e:
            raise e
    
    # 2.0获取偏移量
    def get_distance(imageOne,imageTwo):
        '''
        拿到滑动验证码需要移动的距离
        :param image1:没有缺口的图片对象
        :param image2:带缺口的图片对象
        :return:需要移动的距离
        '''
        threshold=150
        left=60
        image1 = Image.open(imageOne)
        image2 = Image.open(imageTwo)
        for i in range(left,image1.size[0]):
            for j in range(image1.size[1]):
                rgb1=image1.load()[i,j]
                rgb2=image2.load()[i,j]
                res1=abs(rgb1[0]-rgb2[0])
                res2=abs(rgb1[1]-rgb2[1])
                res3=abs(rgb1[2]-rgb2[2])
                if not (res1 < threshold and res2 < threshold and res3 < threshold):
                    print(i-7)
                    return i-7 #经过测试,误差为大概为7
        print(i-7)
        return i-7#经过测试,误差为大概为7
    

      以上代码为整个流程的代码,我在这里全贴出来了,3.0验证码破解的获取偏移量方法如下:

    #极验3.0破解方法
    def get_gap(image):
        """
        获取缺口偏移量
        :param image: 带缺口图片
        :return:
        """
        # left_list保存所有符合条件的x轴坐标
        left_list = []
        # 需要获取的是凹槽的x轴坐标,就不需要遍历所有y轴,遍历几个等分点就行
        for i in [10 * i for i in range(1,image.size[1]/11)]:
            # x轴从x为image.size[0]/5.16的像素点开始遍历,因为凹槽不会在x轴为50以内
            for j in range(image.size[0]/5.16, image.size[0] - int(image.size[0]/8.6)):
                if is_pixel_equal(image, j, i, left_list):
                    break
         #其中(x, z)中的x为凹槽左侧的位置,z是count,就是从该x点坐标起有多少连续像素点的R、G、B都是小于150的,因为我们遍历y轴,所有我们的得到几个值,其中,z值最接近40的,结果最符合
        left_list = sorted(left_list, key=lambda x: abs(x[1]-40))
        #取第一个元素的x下标 最后结果 -7 或者 -14 一般 -7就可以
        return left_list[0][0] - 7
        
    def is_pixel_equal(image, x, y, left_list):
        """
        判断两个像素是否相同
        :param image: 图片
        :param x: 位置x
        :param y: 位置y
        :return: 像素是否相同
        """
        # 取图片的像素点
        pixel1 = image.load()[x, y]
        threshold = 150
        # count记录一次向右有多少个像素点R、G、B都是小于150的
        count = 0
        # 如果该点的R、G、B都小于150,就开始向右遍历,记录向右有多少个像素点R、G、B都是小于150的
        if pixel1[0] < threshold and pixel1[1] < threshold and pixel1[2] < threshold:
            for i in range(x + 1, image.size[0]):
                piexl = image.load()[i, y]
                if piexl[0] < threshold and piexl[1] < threshold and piexl[2] < threshold:
                    count += 1
                else:
                    break
        if int(image.size[0]/8.6) < count < int(image.size[0]/4.3):
            left_list.append((x, count))
            return True
        else:
            return False
    

      代码都有明确注释,静下心来看的话很容易就可以明白。
      还有一个不错的处理页面表格的方法,上面的代码里已经有了,代码如下:

    def getTableData(titleStr,selectorStr):
        table_string = iie.get_html(title=titleStr,selector=selectorStr,waitfor=30)
        tb_start = re.compile('')
        tb_end = re.compile('')
        last_str = tb_end.sub('', tb_start.sub('', table_string))
        #调用了pandas中的read_html方法,注意header=0,有些表格header不是0
        data = pd.read_html(last_str, flavor="bs4", header=0)[0]
        print(data)
        print(data.columns)
        return data
    

      titleStr为浏览器标题,只要标题包含传入的参数就可以识别,selectorStr是css选择器的选择字符串,css选择器是设计器原生支持的,本身这个东西在爬虫方面也很重要,不懂的可以自行百度,iie是他们自己的python库里的组件,可以直接读取已经打开的页面的信息,使用这个方法传入页面table的位置,就可以把表格转化为dataframe类型,不得不说,pandas还是好用!

    最终运行效果

      验证码运行效果,失败了会自己重试,如下:



      整体运行效果,最后成功抓取了企业表格的数据录入excel,如下:


      感谢您的观看!
      扫一扫领取海量干货资料!


    相关文章

      网友评论

        本文标题:RPA—pyautogui+PIL+pandas破解滑动验证码,

        本文链接:https://www.haomeiwen.com/subject/xfmwjctx.html