Tensorflow 模型训练

作者: HansenGuan | 来源:发表于2018-10-12 16:14 被阅读0次

2018-04-04
2020-05-18 神经网络保存训练模型
用浏览器训练Tensorflow.js模型的18个技巧（下）
TensorFlow 训练模型
用浏览器训练Tensorflow.js模型的18个技巧（上）
Tensorflow不能使用GPU
卷积神经网络手写字体识别-高级API
tensorflow模型保存
TensorFlow2.0教程-使用keras训练模型
java 调用tensorflow

Tensorflow 环境搭建

Windows GPU 版安装

依赖软件包

Tensorflow 1.5.0/1.6.0
Cuda v9.0
cudnn v7.0.5 for cuda 9.0

cuDNN v7.0.5 解压后将文件(bin、include、lib)拷贝到 CUDA 安装目录（NVIDIA GPU Computing Toolkit/CUDA/v9.0）下

各个版本需要保持一致，不然会存在版本不一致问题，注意选择正确的系统版本

python 环境安装（训练环境/开发环境）

训练环境建议安装 Anaconda , 它是一个流行的进行数据科学研究的 python 平台，预安装了很多库，可以很方便的管理多个版本的 python 环境，实现 python 环境的自由切换

Tensorflow 底层使用了 gRPC 框架，使用 Protocol Buffers 数据交换协议，protoc 工具是一个编译器，可以很方便将 proto 协议文件编译成供多个语言版本使用

此处使用 3.4.0 版本，新版本编译命令可能不同，为避免后续出现错误，可以直接使用 3.4.0 版本

安装

下载Anaconda并安装
配置环境变量 安装目录\Anaconda3;安装目录\Anaconda3\Scripts;安装目录\Anaconda3\Library\bin; 到 path（系统环境变量）中；
配置国内源

  # 添加Anaconda的TUNA镜像
  conda config --add channels https://mirrors.tuna.tsinghua.edu.cn/anaconda/pkgs/free/
  # 设置搜索时显示通道地址
  conda config --set show_channel_urls yes

安装 python 环境

#查看系统当前已有的Python环境，
conda info --envs
#安装指定版本的 python 环境
conda create --name py35 python=3.5
#切换 python 环境
activate py35
#切回原来的Python环境
deactivate py35
#删除环境
conda remove --name py35 --all

python 3 环境下的 Tensorflow 安装

install Tensorflow

# For CPU
pip install tensorflow==1.6
# For GPU
pip install tensorflow-gpu==1.6

users can install dependencies using pip:

pip install Cython
pip install pillow
pip install lxml
pip install jupyter
pip install matplotlib

模型训练项目的编译准备

Protobuf Compilation

protoc object_detection/protos/*.proto --python_out=.

Add Libraries to PYTHONPATH

1. 在你的Anaconda3安装路/Anaconda3/Lib/site-packages 下新建一个txt文件 
（我这里的安装路径是C:\ProgramData\Anaconda3\Lib\site-packages）；如果安装有其他 python 环境，则在对应的环境目录（Anaconda3\envs\py35\Lib\site-packages）下新建一个txt文件 。

2. 在新建的txt文件中写入自己对应的 Tensorflow object_detection 工程的目录路径：
F:\project\project
F:\project\project\slim

3. 将文件名改为 tensorflow_model.pth (注意这里的后缀一定要以pth结尾）

Testing the Installation

#From tensorflow/models/research/
python object_detection/builders/model_builder_test.py

模型训练

样本标注

使用 label_images 工具用于标记图片，生成 Pascal voc 格式的标注文件

生成 tensorflow 支持的 tfrecord 文件

工作目录结构

|- template
|  |- annotations (标注文件)
|  |- images （样本图片）
|  |- label_maps
|  |  |- *.pbtxt （标注映射文件，id 从 1 开始）

脚本工具 - tfrecord_util.py 【python 3 环境】


import os
import io
import tensorflow as tf

from PIL import Image

from object_detection.utils import dataset_util
from object_detection.utils import label_map_util
from collections import namedtuple
import glob
import pandas as pd
import xml.etree.ElementTree as ET


current_path = 'template所在目录'
train_path = os.path.join(current_path, "template")
# 图片标注文件目录
annotations_dir = os.path.join(train_path, "annotations")
# 图片目录
images_path = os.path.join(train_path, "images")
# 映射文件
labels_path = os.path.join(train_path, "label_maps")
labels_file = os.path.join(labels_path, "mscoco_label_map.pbtxt")
# csv 文件(全路径)
csv_file = os.path.join(train_path, "temp_csv_name.csv")
# record 文件(全路径)
tf_record_file = os.path.join(train_path, "tf_record_file.record")
# ---------------------------------------------------------------------- xml operator

def xml_to_csv(path):
    xml_list = []
    for xml_file in glob.glob(path + '/*.xml'):
        tree = ET.parse(xml_file)
        root = tree.getroot()
        for member in root.findall('object'):
            # if member[0].text != 'a_hn_101':
            #     continue

            file_path = root.find('path').text
            filename = file_path.split("/")[-1].split("\\")[-1]
            value = (filename,
                     int(root.find('size')[0].text),
                     int(root.find('size')[1].text),
                     member[0].text,
                     int(member[4][0].text),
                     int(member[4][1].text),
                     int(member[4][2].text),
                     int(member[4][3].text)
                     )

            xml_list.append(value)
    column_name = ['filename', 'width', 'height', 'class', 'xmin', 'ymin', 'xmax', 'ymax']
    xml_df = pd.DataFrame(xml_list, columns=column_name)
    return xml_df

# ---------------------------------------------------------------------- tfrecord operator


classes_num = 100

label_map = label_map_util.load_labelmap(labels_file)
print("success loading label map file["+str(labels_file)+"]")
# print('\n-------------label_map------------------\n')
# print(label_map)
# categories array [{'id':id,'name':name},···]
categories = label_map_util.convert_label_map_to_categories(label_map, max_num_classes=classes_num, use_display_name=True)
# category_index  dic  {id : {'id':id,'name':name}, ···}
# category_index = label_map_util.create_category_index(categories)

# category_index  dic  {name : {'id':id,'name':name}, ···}
category_index = {}
for cat in categories:
    category_index[cat['name']] = cat
print(category_index)
print("success generating categories dic")


def class_text_to_int(row_label):
    if row_label in category_index.keys():
        # print(str(category_index[row_label]['id']))
        return category_index[row_label]['id']
    else:
        # print(row_label)
        return 0


def split(df, group):
    data = namedtuple('data', ['filename', 'object'])
    gb = df.groupby(group)
    return [data(filename, gb.get_group(x)) for filename, x in zip(gb.groups.keys(), gb.groups)]


def create_tf_example(group, path):

    with tf.gfile.GFile(os.path.join(path, '{}'.format(group.filename)), 'rb') as fid:
        encoded_jpg = fid.read()
    encoded_jpg_io = io.BytesIO(encoded_jpg)
    image = Image.open(encoded_jpg_io)
    width, height = image.size

    filename = group.filename.encode('utf8')
    # image_format = b'jpg'
    if image.format != 'JPEG':
        print(group.filename)
        raise ValueError('Image format not JPEG')
    else:
        image_format = b'jpg'
    xmins = []
    xmaxs = []
    ymins = []
    ymaxs = []
    classes_text = []
    classes = []

    for index, row in group.object.iterrows():

        if class_text_to_int(row['class']) == 0:
            print(group.filename)
            # print(row['class'].encode('utf8'))
            continue
        xmins.append(row['xmin'] / width)
        xmaxs.append(row['xmax'] / width)
        ymins.append(row['ymin'] / height)
        ymaxs.append(row['ymax'] / height)
        classes_text.append(row['class'].encode('utf8'))
        classes.append(class_text_to_int(row['class']))

    tf_example = tf.train.Example(features=tf.train.Features(feature={
        'image/height': dataset_util.int64_feature(height),
        'image/width': dataset_util.int64_feature(width),
        'image/filename': dataset_util.bytes_feature(filename),
        'image/source_id': dataset_util.bytes_feature(filename),
        'image/encoded': dataset_util.bytes_feature(encoded_jpg),
        'image/format': dataset_util.bytes_feature(image_format),
        'image/object/bbox/xmin': dataset_util.float_list_feature(xmins),
        'image/object/bbox/xmax': dataset_util.float_list_feature(xmaxs),
        'image/object/bbox/ymin': dataset_util.float_list_feature(ymins),
        'image/object/bbox/ymax': dataset_util.float_list_feature(ymaxs),
        'image/object/class/text': dataset_util.bytes_list_feature(classes_text),
        'image/object/class/label': dataset_util.int64_list_feature(classes),
    }))
    return tf_example

# ----------------------------------------------------------------------


def generate_tf_record_file(recreate=True):
    """
    generate the tensorflow record file from label xml files which belongs sample images
    :param recreate:  if create a new record file
    :return:  tf_record_file path
    """
    if recreate:
        # 1. 读取图片标注文件目录下的所有 xml 文件，并转化为 csv 文件
        xml_df = xml_to_csv(annotations_dir)
        xml_df.to_csv(csv_file, index=None)
        print('Successfully converted xml['+str(annotations_dir)+'] to csv['+str(csv_file)+'].')

        print(csv_file)
        # 2. 将 csv 文件转 record 文件
        examples = pd.read_csv(csv_file)
        grouped = split(examples, 'filename')

        writer = tf.python_io.TFRecordWriter(tf_record_file)
        for group in grouped:
            try:
                tf_example = create_tf_example(group, images_path)
            except:
                print(group.filename)
                continue
            writer.write(tf_example.SerializeToString())
        writer.close()
        print('Successfully created the TFRecords: {}'.format(tf_record_file))
        return tf_record_file

    else:
        # TODO - look up the exist file
        return None

def main(_):
    my_tf_record_file = generate_tf_record_file()
    print(my_tf_record_file)

if __name__ == '__main__':
    tf.app.run()

模型训练相关配置

配置文件 faster_rcnn_resnet101.config

# Faster R-CNN with Resnet-101 (v1) configuration for MSCOCO Dataset.
# Users should configure the fine_tune_checkpoint field in the train config as
# well as the label_map_path and input_path fields in the train_input_reader and
# eval_input_reader. Search for "PATH_TO_BE_CONFIGURED" to find the fields that
# should be configured.

model {
  faster_rcnn {
    num_classes: 23
    image_resizer {
      keep_aspect_ratio_resizer {
        min_dimension: 1024
        max_dimension: 1280  
      }
    }
    feature_extractor {
      type: 'faster_rcnn_resnet101'
      first_stage_features_stride: 16
    }
    first_stage_anchor_generator {
      grid_anchor_generator {
        scales: [0.25, 0.5, 1.0, 2.0]
        aspect_ratios: [0.5, 1.0, 2.0]
        height_stride: 16
        width_stride: 16
      }
    }
    first_stage_box_predictor_conv_hyperparams {
      op: CONV
      regularizer {
        l2_regularizer {
          weight: 0.0
        }
      }
      initializer {
        truncated_normal_initializer {
          stddev: 0.01
        }
      }
    }
    first_stage_nms_score_threshold: 0.0
    first_stage_nms_iou_threshold: 0.6
    first_stage_max_proposals: 400
    first_stage_localization_loss_weight: 2.0
    first_stage_objectness_loss_weight: 1.0
    initial_crop_size: 14
    maxpool_kernel_size: 2
    maxpool_stride: 2
    second_stage_box_predictor {
      mask_rcnn_box_predictor {
        use_dropout: false
        dropout_keep_probability: 1.0
        fc_hyperparams {
          op: FC
          regularizer {
            l2_regularizer {
              weight: 0.0
            }
          }
          initializer {
            variance_scaling_initializer {
              factor: 1.0
              uniform: true
              mode: FAN_AVG
            }
          }
        }
      }
    }
    second_stage_post_processing {
      batch_non_max_suppression {
        score_threshold: 0.0
        iou_threshold: 0.7
        max_detections_per_class: 100
        max_total_detections: 300
      }
      score_converter: SOFTMAX
    }
    second_stage_localization_loss_weight: 2.0
    second_stage_classification_loss_weight: 1.0
  }
}

train_config: {
  batch_size: 1
  optimizer {
    momentum_optimizer: {
      learning_rate: {
        manual_step_learning_rate {
          initial_learning_rate: 0.0002
          schedule {
            step: 900000
            learning_rate: .00003
          }
          schedule {
            step: 1200000
            learning_rate: .000003
          }
        }
      }
      momentum_optimizer_value: 0.9
    }
    use_moving_average: false
  }
  gradient_clipping_by_norm: 10.0
  # fine_tune_checkpoint: "F:/project/project/faster_rcnn_resnet101_coco_2018_01_28/model.ckpt"
  # from_detection_checkpoint: true
  # Note: The below line limits the training process to 200K steps, which we
  # empirically found to be sufficient enough to train the pets dataset. This
  # effectively bypasses the learning rate schedule (the learning rate will
  # never decay). Remove the below line to train indefinitely.
  #num_steps: 10000
  data_augmentation_options {
    random_adjust_brightness {
      max_delta: 0.1
    }
  }
  data_augmentation_options {
    random_image_scale {
      min_scale_ratio: 0.8
      max_scale_ratio: 1.2
    }
  }
  #data_augmentation_options {
  #  random_crop_to_aspect_ratio {
  #  }
  #}

  #data_augmentation_options {
  #  random_adjust_contrast {
  #      min_delta: 0.5
  #      max_delta: 1.5
  #  }
  #}
  #data_augmentation_options {
  #  random_adjust_saturation {
  #    min_delta: 0.5
  #    max_delta: 1.5
  #  }
  #}
}

train_input_reader: {
  tf_record_input_reader {
    input_path: "D:/Workspace/train_dir/all/tf_record_file_23_3035_20180724.record"
  }
  label_map_path: "D:/Workspace/train_dir/all/mscoco_label_map_23.pbtxt"
  shuffle: true
}

eval_config: {
  # num_examples: 1
  num_visualizations: 200
  # Note: The below line limits the evaluation process to 10 evaluations.
  # Remove the below line to evaluate indefinitely.
  max_evals: 2
  visualization_export_dir: "D:/Workspace/train_dir/all/20180724/eval/exportfrcnn"
}

eval_input_reader: {
  tf_record_input_reader {
    input_path: "D:/Workspace/train_dir/all/tf_record_file_23_3035_20180724_eval.record"
  }
  label_map_path: "D:/Workspace/train_dir/all/mscoco_label_map_23.pbtxt"
  shuffle: true
  num_readers: 5
  num_epochs: 1
}

配置文件主要分为 5 个部分：

model ：定义神经网络模型结构，及相关超参数
train_config：训练相关配置
train_input_reader：训练样本输入相关配置
eval_config：模型评估相关配置
eval_input_reader：评估样本输入相关配置

model 部分

num_classes 对应待检测物体的总数（一共有多少个标注样本）
keep_aspect_ratio_resizer.min_dimension、keep_aspect_ratio_resizer.max_dimension 控制样本输入缩放后的大小
feature_extractor.first_stage_features_stride 第一阶段特征提取步长，训练时可以保持 16 不变，如果样本中 sku 比较密集，多是远拍，sku 比较小，16 的情况下的训练效果不佳，可以考虑减小该值为 8
grid_anchor_generator.height_stride、grid_anchor_generator.width_stride 物体框训练时的滑动步长，训练时可以保持 16 不变，如果样本中 sku 比较密集，多是远拍，sku 比较小，如果样本中 sku 比较密集，多是远拍，sku 比较小，16 的情况下的训练效果不佳，可以考虑减小该值为 8
first_stage_nms_iou_threshold 第一阶段框 IOU 阈值，可以适当减小来增大查全率，但相应准确率可能降低，范围 0~1
first_stage_max_proposals 第一阶段选取得推荐框的个数，可以适当增大来增大查全率，但相应准确率可能降低
batch_non_max_suppression.iou_threshold 第二阶段 IOU 阈值，可以适当减小来增大查全率，但相应准确率可能降低，范围 0~1
batch_non_max_suppression.max_detections_per_class 每类样本的最大检测数量
batch_non_max_suppression.max_total_detections 所有样本的最大检测数量

train_config 部分

initial_learning_rate 初始学习率， 0.0003、0.0002都可以
data_augmentation_options 数据增强选项
- random_adjust_brightness 随机调节亮度
- random_image_scale 随机缩放图片大小
- random_crop_to_aspect_ratio 随机裁剪到指定比例大小
  以上几类增强比较常用

train_input_reader 部分

tf_record_input_reader.input_path 指定 tfrecord 文件路径
label_map_path 指定标注映射文件路径
shuffle 是否打乱样本原有顺序，随机输入训练

eval_config 部分

num_visualizations 评估导出图片数量，根据评估输入样本决定，不用太大，主要用于评估结果的可视化
visualization_export_dir 指定评估图片的保存路径

eval_input_reader 部分

tf_record_input_reader.input_path 指定 tfrecord 文件路径
label_map_path 指定标注映射文件路径
shuffle 是否打乱样本原有顺序，随机输入训练
num_epochs 评估样本几次，一般不用改

模型训练

训练：

# object_detection 工程所在目录下，执行如下命令
python object_detection/train.py  --logtostderr --pipeline_config_path=F:/Workspaces/hongniu3sku/train/faster_rcnn_resnet101_20180530.config  --train_dir=F:/Workspaces/hongniu3sku/train/train_data/train/20180530

# pipeline_config_path ：训练配置文件所在路径
# train_dir ： 训练所产生的中间文件保存目录

评估：

# object_detection 工程所在目录下，执行如下命令
python object_detection/eval.py --logtostderr  --pipeline_config_path=F:/Workspaces/hongniu3sku/train/faster_rcnn_resnet101_20180530.config  --checkpoint_dir=F:/Workspaces/hongniu3sku/train/train_data/train/20180530  --eval_dir=F:/Workspaces/hongniu3sku/train/train_data/eval/20180530

# pipeline_config_path ：训练配置文件所在路径
# checkpoint_dir ： 指定训练时所产生的中间文件的保存目录
# eval_dir： 评估时所产生的中间文件保存目录

导出模型：

# object_detection 工程所在目录下，执行如下命令
python object_detection/export_inference_graph.py --input_type image_tensor --pipeline_config_path=F:/Workspaces/hongniu3sku/train/faster_rcnn_resnet101_20180530.config  --trained_checkpoint_prefix=F:/Workspaces/hongniu3sku/train/train_data/train/20180530/model.ckpt-157978  --output_directory=F:/Workspaces/hongniu3sku/train/train_data/export/20180530

# pipeline_config_path ：训练配置文件所在路径
# trained_checkpoint_prefix：指定模型导出使用的中间文件 ，model.ckpt-【数字】 对应导出哪一步的参数到最终模型中
# output_directory：指定模型最终的导出目录

最终导出的文件有：

|- saved_model
|  |- variables
|  |- saved_model.pb   (tensorflow serving 使用的模型文件)
|- checkpoint （检查点临时文件）
|- frozen_inference_graph.pb  （冻结参数的用于推理的图文件）
|- model.ckpt.*  （模型数据，参数、结构等）

建议每次训练后 checkpoint、frozen_inference_graph.pb、model.ckpt.* 都保存，方便后续对该模型进行优化

2018-04-04
TensorFlow 到底有几种模型格式？ CheckPoint(*.ckpt)在训练 TensorFlow 模型...
2020-05-18 神经网络保存训练模型
Tensorflow加载预训练模型和保存模型
用浏览器训练Tensorflow.js模型的18个技巧（下）
摘要：送你18个训练Tensorflow.js模型的小技巧！用浏览器训练Tensorflow.js模型的18个...
TensorFlow 训练模型
TensorFlow支持同步训练和异步训练两种模型训练方式。异步训练异步训练即TensorFlow上每个节点上...
用浏览器训练Tensorflow.js模型的18个技巧（上）
摘要：送你18个训练Tensorflow.js模型的小技巧！在移植现有模型（除tensorflow.js）进行...
Tensorflow不能使用GPU
跑tensorflow代码训练模型的时候发现tensorflow把参数copy进了GPU却仍然在CPU上训练，现象...
卷积神经网络手写字体识别-高级API
使用Estimators、Experiment高级API 原生版Tensorflow训练模型
tensorflow模型保存
在使用TensorFlow训练模型时，为了避免每次预测都要重新训练模型，模型保存必不可少。而在模型保存时，使用不同...
TensorFlow2.0教程-使用keras训练模型
TensorFlow2.0教程-使用keras训练模型完整tensorflow2.0教程代码请看tensorfl...
java 调用tensorflow
1.java读取tensorflow中图像的分类模型 2.tensorflow训练好的模型中java调用 3.『T...

Tensorflow 模型训练

Tensorflow 环境搭建

依赖软件包

python 环境安装（训练环境/开发环境）

python 3 环境下的 Tensorflow 安装

模型训练项目的编译准备

模型训练

样本标注

生成 tensorflow 支持的 tfrecord 文件

模型训练相关配置

model 部分

train_config 部分

train_input_reader 部分

eval_config 部分

eval_input_reader 部分

模型训练

相关文章

2018-04-04

2020-05-18 神经网络保存训练模型

用浏览器训练Tensorflow.js模型的18个技巧（下）

TensorFlow 训练模型

用浏览器训练Tensorflow.js模型的18个技巧（上）

Tensorflow不能使用GPU

卷积神经网络手写字体识别-高级API

tensorflow模型保存

TensorFlow2.0教程-使用keras训练模型

java 调用tensorflow

网友评论

延伸阅读

深度阅读

栏目导航

热点阅读