CNN大战验证码

作者: 山阴少年 | 来源:发表于2018-09-25 21:02 被阅读499次

    介绍

      爬虫江湖,风云再起。自从有了爬虫,也就有了反爬虫;自从有了反爬虫,也就有了反反爬虫。
      反爬虫界的一大利器,就是验证码(CAPTCHA),各种各样的验证码让人眼花缭乱,也让很多人在爬虫的过程知难而返,从入门到放弃,当然,这就达到了网站建设者们的目的。但是,但是,所谓的验证码,并不是牢不可破的,在深度学习(Deeping Learning)盛行的今天,很多简单的验证码也许显得不堪一击。
      本文将会介绍如何利用Python,OpenCV和CNN来攻破一类验证码,希望能让大家对Deeping Learning的魅力有些体会。

    获取数据

      笔者收集了某个账号注册网站的验证码,一共是346个验证码,如下:

    验证码数据集

    可以看到,这些验证码由大写字母和数字组成,噪声较多,而且部分字母会黏连在一起。

    标记数据

      仅仅用这些验证码是无法建模的,我们需要对这些验证码进行预处理,以符合建模的标准。
      验证码的预处理方法见博客: OpenCV入门之获取验证码的单个字符(二),然后对每张图片进行标记,将它们放入到合适到文件夹中。没错,你没看错,就是对每张图片进行一一标记,笔者一共花了3个小时多,o(╥﹏╥)o~(为了建模,前期的数据标记是不可避免的,当然,也是一个痛苦的过程,比如WordNet, ImageNet等。)标记完后的文件夹如下:

    标记完后的文件夹

      可以看到,一共是31个文件夹,也就是31个目标类,字符0,M,W,I,O没有出现在验证码中。得到的有效字符为1371个,也就是1371个样本。以字母U为例,字母U的文件夹中的图片如下:

    字母U的样本

    统一尺寸

      仅仅标记完图片后,还是没能达到建模的标准,这是因为得到的每个字符的图片大小是统一的。因此,我们需要这些样本字符统一尺寸,经过观察,笔者将统一尺寸定义为16*20,实现的Python脚本如下:

    import os
    import cv2
    import uuid
    
    def convert(dir, file):
    
        imagepath = dir+'/'+file
        # 读取图片
        image = cv2.imread(imagepath, 0)
        # 二值化
        ret, thresh = cv2.threshold(image, 127, 255, cv2.THRESH_BINARY)
        img = cv2.resize(thresh, (16, 20), interpolation=cv2.INTER_AREA)
        # 显示图片
    
        cv2.imwrite('%s/%s.jpg' % (dir, uuid.uuid1()), img)
        os.remove(imagepath)
    
    def main():
        chars = '123456789ABCDEFGHJKLNPQRSTUVXYZ'
        dirs= ['E://verifycode_data/%s'%char for char in chars]
        for dir in dirs:
            for file in os.listdir(dir):
                convert(dir, file)
    
    main()
    

    样本数据集

      有了尺寸统一的字符图片,我们就需要将这些图片转化为向量。图片为黑白图片,因此,我们将图片读取为0-1值的向量,其标签(y值)为该图片所在的文件的名称。具体的Python实现脚本如下:

    import os
    import cv2
    import pandas as pd
    
    table= []
    
    def Read_Data(dir, file):
    
        imagepath = dir+'/'+file
        # 读取图片
        image = cv2.imread(imagepath, 0)
        # 二值化
        ret, thresh = cv2.threshold(image, 127, 255, cv2.THRESH_BINARY)
        # 显示图片
        bin_values = [1 if pixel==255 else 0 for pixel in thresh.ravel()]
        label = dir.split('/')[-1]
        table.append(bin_values+[label])
    
    def main():
        chars = '123456789ABCDEFGHJKLNPQRSTUVXYZ'
        dirs= ['E://verifycode_data/%s'%char for char in chars]
        print(dirs)
        for dir in dirs:
            for file in os.listdir(dir):
                Read_Data(dir, file)
    
        features = ['v'+str(i) for i in range(1, 16*20+1)]
        label = ['label']
        df = pd.DataFrame(table, columns=features+label)
        # print(df.head())
    
        df.to_csv('E://verifycode_data/data.csv', index=False)
    main()
    

      我们将样本的字符图片转为为data.csv中的向量及标签,data.csv的部分内容如下:

    字符图片对应的向量及标签

    CNN大战验证码

      有了样本数据集,我们就可以用CNN来进行建模了。典型的CNN由多层卷积层(Convolution Layer)和池化层(Pooling Layer)组成, 最后由全连接网络层输出,示意图如下:

    CNN.png

      本文建模的CNN模型由两个卷积层和两个池化层,在此基础上增加一个dropout层(防止模型过拟合),再连接一个全连接层(Fully Connected),最后由softmax层输出结果。采用的损失函数为对数损失函数,用梯度下降法(GD)调整模型中的参数。具体的Python代码(VerifyCodeCNN.py)如下:

    # -*- coding: utf-8 -*-
    import tensorflow as tf
    import logging
    
    # 日志设置
    logging.basicConfig(level = logging.INFO, format='%(asctime)s - %(levelname)s: %(message)s')
    logger = logging.getLogger(__name__)
    
    class CNN:
    
        # 初始化
        # 参数为: epoch: 训练次数
        #        learning_rate: 使用GD优化时的学习率
        #        save_model_path: 模型保存的绝对路径
        def __init__(self, epoch, learning_rate, save_model_path):
    
            self.epoch = epoch
            self.learning_rate = learning_rate
            self.save_model_path = save_model_path
    
            """
            第一层 卷积层和池化层
            x_image(batch, 16, 20, 1) -> h_pool1(batch, 8, 10, 10)
            """
            x = tf.placeholder(tf.float32, [None, 320])
            self.x = x
            x_image = tf.reshape(x, [-1, 16, 20, 1])  # 最后一维代表通道数目,如果是rgb则为3
            W_conv1 = self.weight_variable([3, 3, 1, 10])
            b_conv1 = self.bias_variable([10])
    
            h_conv1 = tf.nn.relu(self.conv2d(x_image, W_conv1) + b_conv1)
            h_pool1 = self.max_pool_2x2(h_conv1)
    
            """
            第二层 卷积层和池化层
            h_pool1(batch, 8, 10, 10) -> h_pool2(batch, 4, 5, 20)
            """
            W_conv2 = self.weight_variable([3, 3, 10, 20])
            b_conv2 = self.bias_variable([20])
    
            h_conv2 = tf.nn.relu(self.conv2d(h_pool1, W_conv2) + b_conv2)
            h_pool2 = self.max_pool_2x2(h_conv2)
    
            """
            第三层 全连接层
            h_pool2(batch, 4, 5, 20) -> h_fc1(1, 100)
            """
            W_fc1 = self.weight_variable([4 * 5 * 20, 200])
            b_fc1 = self.bias_variable([200])
    
            h_pool2_flat = tf.reshape(h_pool2, [-1, 4 * 5 * 20])
            h_fc1 = tf.nn.relu(tf.matmul(h_pool2_flat, W_fc1) + b_fc1)
    
            """
            第四层 Dropout层
            h_fc1 -> h_fc1_drop, 训练中启用,测试中关闭
            """
            self.keep_prob = tf.placeholder(dtype=tf.float32)
            h_fc1_drop = tf.nn.dropout(h_fc1, self.keep_prob)
    
            """
            第五层 Softmax输出层
            """
            W_fc2 = self.weight_variable([200, 31])
            b_fc2 = self.bias_variable([31])
    
            self.y_conv = tf.nn.softmax(tf.matmul(h_fc1_drop, W_fc2) + b_fc2)
    
            """
            训练和评估模型
            ADAM优化器来做梯度最速下降,feed_dict中加入参数keep_prob控制dropout比例
            """
            self.y_true = tf.placeholder(shape = [None, 31], dtype=tf.float32)
            self.cross_entropy = -tf.reduce_mean(tf.reduce_sum(self.y_true * tf.log(self.y_conv), axis=1))  # 计算交叉熵
    
            # 使用adam优化器来以0.0001的学习率来进行微调
            self.train_model = tf.train.AdamOptimizer(self.learning_rate).minimize(self.cross_entropy)
    
            self.saver = tf.train.Saver()
            logger.info('Initialize the model...')
    
        def train(self, x_data, y_data):
    
            logger.info('Training the model...')
    
            with tf.Session() as sess:
                # 对所有变量进行初始化
                sess.run(tf.global_variables_initializer())
    
                feed_dict = {self.x: x_data, self.y_true: y_data, self.keep_prob:1.0}
                # 进行迭代学习
                for i in range(self.epoch + 1):
                    sess.run(self.train_model, feed_dict=feed_dict)
                    if i % int(self.epoch / 50) == 0:
                        # to see the step improvement
                        print('已训练%d次, loss: %s.' % (i, sess.run(self.cross_entropy, feed_dict=feed_dict)))
    
                # 保存ANN模型
                logger.info('Saving the model...')
                self.saver.save(sess, self.save_model_path)
    
        def predict(self, data):
    
            with tf.Session() as sess:
                logger.info('Restoring the model...')
                self.saver.restore(sess, self.save_model_path)
                predict = sess.run(self.y_conv, feed_dict={self.x: data, self.keep_prob:1.0})
    
            return predict
    
        """
        权重初始化
        初始化为一个接近0的很小的正数
        """
        def weight_variable(self, shape):
            initial = tf.truncated_normal(shape, stddev=0.1)
            return tf.Variable(initial)
    
        def bias_variable(self, shape):
            initial = tf.constant(0.1, shape=shape)
            return tf.Variable(initial)
    
        """
        卷积和池化,使用卷积步长为1(stride size),0边距(padding size)
        池化用简单传统的2x2大小的模板做max pooling
        """
        def conv2d(self, x, W):
            return tf.nn.conv2d(x, W, strides=[1, 1, 1, 1], padding='SAME')
    
        def max_pool_2x2(self, x):
            return tf.nn.max_pool(x, ksize=[1, 2, 2, 1], strides=[1, 2, 2, 1], padding='SAME')
    

    模型训练

      对上述的1371个样本用CNN模型进行训练,训练集为960个赝本,411个样本为测试集。一共训练1000次,梯度下降法(GD)的学习率取0.0005.
      模型训练的Python脚本如下:

    # -*- coding: utf-8 -*-
    
    """
    数字字母识别
    利用CNN对验证码的数据集进行多分类
    """
    
    from VerifyCodeCNN import CNN
    import pandas as pd
    from sklearn.model_selection import train_test_split
    from sklearn.metrics import accuracy_score
    from sklearn.preprocessing import LabelBinarizer
    
    CSV_FILE_PATH = 'E://verifycode_data/data.csv'          # CSV 文件路径
    df = pd.read_csv(CSV_FILE_PATH)       # 读取CSV文件
    
    # 数据集的特征
    features = ['v'+str(i+1) for i in range(16*20)]
    labels = df['label'].unique()
    # 对样本的真实标签进行标签二值化
    lb = LabelBinarizer()
    lb.fit(labels)
    y_ture = pd.DataFrame(lb.transform(df['label']), columns=['y'+str(i) for i in range(31)])
    y_bin_columns = list(y_ture.columns)
    
    for col in y_bin_columns:
        df[col] = y_ture[col]
    
    # 将数据集分为训练集和测试集,训练集70%, 测试集30%
    x_train, x_test, y_train, y_test = train_test_split(df[features], df[y_bin_columns], \
                                                        train_size = 0.7, test_size=0.3, random_state=123)
    
    # 使用CNN进行预测
    # 构建CNN网络
    # 模型保存地址
    MODEL_SAVE_PATH = 'E://logs/cnn_verifycode.ckpt'
    # CNN初始化
    cnn = CNN(1000, 0.0005, MODEL_SAVE_PATH)
    
    # 训练CNN
    cnn.train(x_train, y_train)
    # 预测数据
    y_pred = cnn.predict(x_test)
    
    label = '123456789ABCDEFGHJKLNPQRSTUVXYZ'
    # 预测分类
    prediction = []
    for pred in y_pred:
        label = labels[list(pred).index(max(pred))]
        prediction.append(label)
    
    # 计算预测的准确率
    x_test['prediction'] = prediction
    x_test['label'] = df['label'][y_test.index]
    print(x_test.head())
    accuracy = accuracy_score(x_test['prediction'], x_test['label'])
    print('CNN的预测准确率为%.2f%%.'%(accuracy*100))
    
    

    该CNN模型一共训练了75min,输出的结果如下:

    2018-09-24 11:51:17,784 - INFO: Initialize the model...
    2018-09-24 11:51:17,784 - INFO: Training the model...
    2018-09-24 11:51:17.793631: I T:\src\github\tensorflow\tensorflow\core\platform\cpu_feature_guard.cc:141] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2
    已训练0次, loss: 3.5277689.
    已训练20次, loss: 3.2297606.
    已训练40次, loss: 2.8372495.
    已训练60次, loss: 1.9687067.
    已训练80次, loss: 0.90995216.
    已训练100次, loss: 0.42356998.
    已训练120次, loss: 0.25189978.
    已训练140次, loss: 0.16736577.
    已训练160次, loss: 0.116674595.
    已训练180次, loss: 0.08325087.
    已训练200次, loss: 0.06060778.
    已训练220次, loss: 0.045051433.
    已训练240次, loss: 0.03401592.
    已训练260次, loss: 0.026168587.
    已训练280次, loss: 0.02056558.
    已训练300次, loss: 0.01649161.
    已训练320次, loss: 0.013489108.
    已训练340次, loss: 0.011219621.
    已训练360次, loss: 0.00946489.
    已训练380次, loss: 0.008093053.
    已训练400次, loss: 0.0069935927.
    已训练420次, loss: 0.006101626.
    已训练440次, loss: 0.0053245267.
    已训练460次, loss: 0.004677901.
    已训练480次, loss: 0.0041349586.
    已训练500次, loss: 0.0036762774.
    已训练520次, loss: 0.003284876.
    已训练540次, loss: 0.0029500276.
    已训练560次, loss: 0.0026618005.
    已训练580次, loss: 0.0024126293.
    已训练600次, loss: 0.0021957452.
    已训练620次, loss: 0.0020071461.
    已训练640次, loss: 0.0018413183.
    已训练660次, loss: 0.001695599.
    已训练680次, loss: 0.0015665392.
    已训练700次, loss: 0.0014519279.
    已训练720次, loss: 0.0013496162.
    已训练740次, loss: 0.001257321.
    已训练760次, loss: 0.0011744777.
    已训练780次, loss: 0.001099603.
    已训练800次, loss: 0.0010316349.
    已训练820次, loss: 0.0009697884.
    已训练840次, loss: 0.00091331534.
    已训练860次, loss: 0.0008617487.
    已训练880次, loss: 0.0008141668.
    已训练900次, loss: 0.0007705136.
    已训练920次, loss: 0.0007302323.
    已训练940次, loss: 0.00069312396.
    已训练960次, loss: 0.0006586343.
    已训练980次, loss: 0.00062668725.
    2018-09-24 13:07:42,272 - INFO: Saving the model...
    已训练1000次, loss: 0.0005970755.
    2018-09-24 13:07:42,538 - INFO: Restoring the model...
    INFO:tensorflow:Restoring parameters from E://logs/cnn_verifycode.ckpt
    2018-09-24 13:07:42,538 - INFO: Restoring parameters from E://logs/cnn_verifycode.ckpt
          v1  v2  v3  v4  v5  v6  v7  v8  v9  v10  ...    v313  v314  v315  v316  \
    657    1   1   1   1   1   1   1   1   1    1  ...       1     1     1     1   
    18     1   1   1   1   1   1   1   1   1    1  ...       1     1     1     1   
    700    1   1   1   1   1   1   1   1   1    1  ...       1     1     1     1   
    221    1   1   1   1   1   1   1   1   1    1  ...       1     1     1     1   
    1219   1   1   1   1   1   1   1   1   1    1  ...       1     1     1     1   
    
          v317  v318  v319  v320  prediction  label  
    657      1     1     1     1           G      G  
    18       1     1     1     1           T      1  
    700      1     1     1     1           H      H  
    221      1     1     1     1           5      5  
    1219     1     1     1     1           V      V  
    
    [5 rows x 322 columns]
    CNN的预测准确率为93.45%.
    

    可以看到,该CNN模型在测试集上的预测准确率为93.45%,效果OK.训练完后的模型保存为 E://logs/cnn_verifycode.ckpt.

    预测新验证码

      训练完模型,以下就是见证奇迹的时刻!
      笔者重新在刚才的账号注册网站弄了60张验证码,新的验证码如下:

    新验证码

      笔者写了个预测验证码的Pyhton脚本,如下:

    # -*- coding: utf-8 -*-
    
    """
    利用训练好的CNN模型对验证码进行识别
    (共训练960条数据,训练1000次,loss:0.00059, 测试集上的准确率为%93.45.)
    """
    import os
    import cv2
    import pandas as pd
    from VerifyCodeCNN import CNN
    
    def split_picture(imagepath):
    
        # 以灰度模式读取图片
        gray = cv2.imread(imagepath, 0)
    
        # 将图片的边缘变为白色
        height, width = gray.shape
        for i in range(width):
            gray[0, i] = 255
            gray[height-1, i] = 255
        for j in range(height):
            gray[j, 0] = 255
            gray[j, width-1] = 255
    
        # 中值滤波
        blur = cv2.medianBlur(gray, 3) #模板大小3*3
    
        # 二值化
        ret,thresh1 = cv2.threshold(blur, 200, 255, cv2.THRESH_BINARY)
    
        # 提取单个字符
        chars_list = []
        image, contours, hierarchy = cv2.findContours(thresh1, 2, 2)
        for cnt in contours:
            # 最小的外接矩形
            x, y, w, h = cv2.boundingRect(cnt)
            if x != 0 and y != 0 and w*h >= 100:
                chars_list.append((x,y,w,h))
    
        sorted_chars_list = sorted(chars_list, key=lambda x:x[0])
        for i,item in enumerate(sorted_chars_list):
            x, y, w, h = item
            cv2.imwrite('E://test_verifycode/chars/%d.jpg'%(i+1), thresh1[y:y+h, x:x+w])
    
    def remove_edge_picture(imagepath):
    
        image = cv2.imread(imagepath, 0)
        height, width = image.shape
        corner_list = [image[0,0] < 127,
                       image[height-1, 0] < 127,
                       image[0, width-1]<127,
                       image[ height-1, width-1] < 127
                       ]
        if sum(corner_list) >= 3:
            os.remove(imagepath)
    
    def resplit_with_parts(imagepath, parts):
        image = cv2.imread(imagepath, 0)
        os.remove(imagepath)
        height, width = image.shape
    
        file_name = imagepath.split('/')[-1].split(r'.')[0]
        # 将图片重新分裂成parts部分
        step = width//parts     # 步长
        start = 0             # 起始位置
        for i in range(parts):
            cv2.imwrite('E://test_verifycode/chars/%s.jpg'%(file_name+'-'+str(i)), \
                        image[:, start:start+step])
            start += step
    
    def resplit(imagepath):
    
        image = cv2.imread(imagepath, 0)
        height, width = image.shape
    
        if width >= 64:
            resplit_with_parts(imagepath, 4)
        elif width >= 48:
            resplit_with_parts(imagepath, 3)
        elif width >= 26:
            resplit_with_parts(imagepath, 2)
    
    # rename and convert to 16*20 size
    def convert(dir, file):
    
        imagepath = dir+'/'+file
        # 读取图片
        image = cv2.imread(imagepath, 0)
        # 二值化
        ret, thresh = cv2.threshold(image, 127, 255, cv2.THRESH_BINARY)
        img = cv2.resize(thresh, (16, 20), interpolation=cv2.INTER_AREA)
        # 保存图片
        cv2.imwrite('%s/%s' % (dir, file), img)
    
    # 读取图片的数据,并转化为0-1值
    def Read_Data(dir, file):
    
        imagepath = dir+'/'+file
        # 读取图片
        image = cv2.imread(imagepath, 0)
        # 二值化
        ret, thresh = cv2.threshold(image, 127, 255, cv2.THRESH_BINARY)
        # 显示图片
        bin_values = [1 if pixel==255 else 0 for pixel in thresh.ravel()]
    
        return bin_values
    
    def main():
    
        VerifyCodePath = 'E://test_verifycode/E224.jpg'
        dir = 'E://test_verifycode/chars'
        files = os.listdir(dir)
    
        # 清空原有的文件
        if files:
            for file in files:
                os.remove(dir + '/' + file)
    
        split_picture(VerifyCodePath)
    
        files = os.listdir(dir)
        if not files:
            print('查看的文件夹为空!')
        else:
    
            # 去除噪声图片
            for file in files:
                remove_edge_picture(dir + '/' + file)
    
            # 对黏连图片进行重分割
            for file in os.listdir(dir):
                resplit(dir + '/' + file)
    
            # 将图片统一调整至16*20大小
            for file in os.listdir(dir):
                convert(dir, file)
    
            # 图片中的字符代表的向量
            table = [Read_Data(dir, file) for file in os.listdir(dir)]
            test_data = pd.DataFrame(table, columns=['v%d'%i for i in range(1,321)])
    
            # 模型保存地址
            MODEL_SAVE_PATH = 'E://logs/cnn_verifycode.ckpt'
            # CNN初始化
            cnn = CNN(1000, 0.0005, MODEL_SAVE_PATH)
            y_pred = cnn.predict(test_data)
    
            # 预测分类
            prediction = []
            labels = '123456789ABCDEFGHJKLNPQRSTUVXYZ'
            for pred in y_pred:
                label = labels[list(pred).index(max(pred))]
                prediction.append(label)
    
            print(prediction)
    
    
    main()
    

    以图片E224.jpg为例,输出的结果为:

    2018-09-25 20:50:33,227 - INFO: Initialize the model...
    2018-09-25 20:50:33.238309: I T:\src\github\tensorflow\tensorflow\core\platform\cpu_feature_guard.cc:141] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2
    2018-09-25 20:50:33,227 - INFO: Restoring the model...
    INFO:tensorflow:Restoring parameters from E://logs/cnn_verifycode.ckpt
    2018-09-25 20:50:33,305 - INFO: Restoring parameters from E://logs/cnn_verifycode.ckpt
    ['E', '2', '2', '4']
    

    预测完全准确。接下来我们对所有的60张图片进行测试,一共有54张图片预测完整正确,其他6张验证码有部分错误,预测的准确率高达90%.

    总结

      在验证码识别的过程中,CNN模型大放异彩,从中我们能够感受到深度学习的强大~
      当然,文本识别的验证码还是比较简单的,只是作为CNN的一个应用,对于更难的验证码,处理的流程会更复杂,希望读者在读者此文后,可以自己去尝试更难的验证码识别~~

    注意:本人现已开通微信公众号: 轻松学会Python爬虫(微信号为:easy_web_scrape), 欢迎大家关注哦~~

    相关文章

      网友评论

        本文标题:CNN大战验证码

        本文链接:https://www.haomeiwen.com/subject/tbtroftx.html