Bert系列：Bert源码分析，MRPC文本分类任务微调

作者: xiaogp | 来源:发表于2023-08-15 14:13 被阅读0次

BERT
Bert预训练模型
Bert在文本分类任务重如何进行 fine-tuning
Bert文本分类及服务部署实战
如何用 Python 和 BERT 做多标签文本分类？
Bert如何使用预留的[unused*]
BERT fine tuning 微调
如何用 Python 和 BERT 做多标签（multi-lab
我的实践：pytorch框架下基于BERT实现文本情感分类
BERT微调模型

关键词：Bert，预训练模型，微调

内容摘要

Bert源码工程介绍
MRPC任务介绍
输入层，数据格式要求
Bert模型层，transformer结构
下游任务微调网络
预训练模型参数迁移

Bert源码工程介绍

工程地址在github仓库位置google-research/bert，该工程包含多个Python脚本，包括

modeling.py：定义Bert的网络结构，主要transformer，embedding，pool等网络模块
run_classifier.py：基于Bert网络开启一个文本分类任务，如果指定了预训练模型，基于预训练模型的参数再训练做微调
run_pretraining.py：Bert的预训练部分，包括NSP任务和MLM任务
create_pretraining_data.py：制作预训练数据
tokenization.py：一些句子处理工具模块，包括分词，标点处理，格式统一等
run_squad.py：配置和启动基于bert在squad数据集上的问答任务
extract_features.py：通过Bert计算句子向量
optimization.py：定义了模型训练的优化器模块

在上一篇文章介绍完Bert，Transformer，预训练模型，微调的基本概念和关系之后，本篇从Bert的官方源码入手进行源码跟读学习，先从最容易地直接应用Bert预训练模型进行MRPC任务微调入手，以run_classifier.py脚本为入口。

MRPC任务介绍

MPRC的学习目标是给定两个句子，判断这两个句子是否说的是一个意思，相当于输入一对句子做二分类。样例数据如下

Quality #1 ID   #2 ID   #1 String       #2 String
1       702876  702977  Amrozi accused his brother , whom he called " the witness " , of deliberately distorting his evidence . Referring to him as only " the witness " , Amrozi accused his brother of deliberately distorting his evidence .

第一列代表y值，1意思相同，2意思不同，后面分别是句子1的id，句子2的id，句子1的内容，句子2的内容。相当于输入一对句子给模型，而Bert的预训练部分也是输入也是一对句子，两者输入基本相同，预训练基于输入无监督学习语义知识，微调部分基于输入迁移预训练的模型参数去做分类。
从网络上下载预训练模型uncased_L-2_H-128_A-2.zip（2层transformer，128维embedding，BERT-Tiny）和对应的MRPC数据，使用如下脚本即可允许训练和验证过程跑通模型

python run_classifier.py --task_name=MRPC --do_train=true --do_eval=true --data_dir=./bert-master/GLUE_MRPC --vocab_file=./bert-master/bert_base_model/vocab.txt --bert_config_file=./bert-master/bert_base_model/bert_config.json  --max_seq_length=128 --train_batch_size=32  --init_checkpoint=./bert-master/bert_base_model/bert_model.ckpt --learning_rate=2e-5 --num_train_epochs=3 --output_dir=/tmp/mrpc_output

运行结束验证集模型效果如下，验证集准确率达到0.710。

I0814 21:25:02.392750 140513357039424 run_classifier.py:993] ***** Eval results *****
INFO:tensorflow:  eval_accuracy = 0.7107843
INFO:tensorflow:  eval_loss = 0.56687844
INFO:tensorflow:  global_step = 343
INFO:tensorflow:  loss = 0.56687844

输入层，数据格式要求

进入run_classifier.py源码，从main入口开始看，第一步是构造数据，实例化一个MRPC的数据处理类MrpcProcessor，他的目的是读取训练，验证，测试数据，以及将y值，句子1，句子2全部写入Python内存集合中。

# 实例化MrpcProcessor
    processor = processors[task_name]()

然后实例化一个分词工具tokenizer，它的目的是在下面构造样本阶段提供空格分词，wordpiece，以及token转id，id转token的功能。

# 实例化一个分词信息类
    tokenizer = tokenization.FullTokenizer(
        # True
        vocab_file=FLAGS.vocab_file, do_lower_case=FLAGS.do_lower_case)

然后开始读取样本数据到Python集合，返回一个list of InputExample对象。

train_examples = processor.get_train_examples(FLAGS.data_dir)

每个InputExample包含全局样本id，句子1，句子2，y值。

class InputExample(object):
    def __init__(self, guid, text_a, text_b=None, label=None):
        self.guid = guid
        self.text_a = text_a
        self.text_b = text_b
        self.label = label

下一步将Python内存样本数据转化为tfrecord格式的磁盘数据，当数据量较大时不能一把将所有数据load到内存，此时训练过程中的数据IO会影响训练效率，因此将数据先转化为tfrecord提高效率。

train_file = os.path.join(FLAGS.output_dir, "train.tf_record")
        # 将list InputExample转化为tfrecord,写入/tmp/mrpc_output/train.tf_record
        file_based_convert_examples_to_features(
            train_examples, label_list, FLAGS.max_seq_length, tokenizer, train_file)

对Python集合样本数据逐行遍历，构造成key:value格式

feature = convert_single_example(ex_index, example, label_list,
                                         max_seq_length, tokenizer)
        def create_int_feature(values):
            f = tf.train.Feature(int64_list=tf.train.Int64List(value=list(values)))
            return f

        # 在5个特征5列
        features = collections.OrderedDict()
        features["input_ids"] = create_int_feature(feature.input_ids)
        features["input_mask"] = create_int_feature(feature.input_mask)
        features["segment_ids"] = create_int_feature(feature.segment_ids)
        # 改为list
        features["label_ids"] = create_int_feature([feature.label_id])
        features["is_real_example"] = create_int_feature(
            [int(feature.is_real_example)])

        tf_example = tf.train.Example(features=tf.train.Features(feature=features))
        # 一行一行写入
        writer.write(tf_example.SerializeToString())

将单个InputExample对象转化为tfrecord格式的逻辑如下，首先对句子长度进行截取最大长度128

    if tokens_b:
        # 如果tokens_a,tokens_b太长超过128，进行均匀截取,最大长度max_seq_length - 3,要留三个给[CLS], [SEP], [SEP]
        _truncate_seq_pair(tokens_a, tokens_b, max_seq_length - 3)
    else:
        # Account for [CLS] and [SEP] with "- 2"
        if len(tokens_a) > max_seq_length - 2:
            tokens_a = tokens_a[0:(max_seq_length - 2)]

当Bert的输入是一个句子时有两个特殊符[CLS]，[SEP]，当输入是一对句子时有三个特殊符[CLS]，[SEP]，[SEP]，因此最大字符串长度要对应减去2或者3。剩下的超长部分字符串从右侧截取，如果是两个句子谁长截取谁。
下面开始构造符合Bert输入的token_id和type_ids，作者的代码备注如下

    # The convention in BERT is:
    # (a) For sequence pairs:
    #  tokens:   [CLS] is this jack ##son ##ville ? [SEP] no it is not . [SEP]
    #  type_ids: 0     0  0    0    0     0       0 0     1  1  1  1   1 1
    # (b) For single sequences:
    #  tokens:   [CLS] the dog is hairy . [SEP]
    #  type_ids: 0     0   0   0  0     0 0

单句子和一对句子已经在上文有介绍，新出现的type_ids代表句子编号，第一句都是0，第二句都是1。
作者将[SEP]，[CLS]拼接原始的分词数据里面，然后对所有分词做了数字id转换

input_ids = tokenizer.convert_tokens_to_ids(tokens)

再此基础上构造mask，所有mask都是以0填充

input_mask = [1] * len(input_ids)

    # Zero-pad up to the sequence length.
    # 如果超过128，已经在前面处理成最大128了 while不进入
    while len(input_ids) < max_seq_length:
        # 所有padding全是0 [PAD]
        input_ids.append(0)
        input_mask.append(0)
        segment_ids.append(0)

数据构造完成，在日志里面作者打印出了5条样本，比如其中一条，特征包含tokens，input_ids，input_mask，segment_ids。

INFO:tensorflow:guid: dev-1
INFO:tensorflow:tokens: [CLS] he said the foods ##er ##vic ##e pie business doesn ' t fit the company ' s long - term growth strategy . [SEP] " the foods ##er ##vic ##e pie business does not fit our long - term growth strategy . [SEP]
INFO:tensorflow:input_ids: 101 2002 2056 1996 9440 2121 7903 2063 11345 2449 2987 1005 1056 4906 1996 2194 1005 1055 2146 1011 2744 3930 5656 1012 102 1000 1996 9440 2121 7903 2063 11345 2449 2515 2025 4906 2256 2146 1011 2744 3930 5656 1012 102 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
INFO:tensorflow:input_mask: 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
INFO:tensorflow:segment_ids: 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
INFO:tensorflow:label: 1 (id = 1)

Bert模型层，transformer结构

首先将预训练的Bert模型的可设置的参数读进来是一个字典的形式。

bert_config = modeling.BertConfig.from_json_file(FLAGS.bert_config_file)

{"hidden_size": 128, "hidden_act": "gelu", "initializer_range": 0.02, "vocab_size": 30522, "hidden_dropout_prob": 0.1, "num_attention_heads": 2, "type_vocab_size": 2, "max_position_embeddings": 512, "num_hidden_layers": 2, "intermediate_size": 512, "attention_probs_dropout_prob": 0.1}

主要包括隐藏层大小（词表embedding维度，以及多头self attention之后每个词的embedding维度），隐藏层激活函数，词表大小，dropout比例等。
下一步计算总训练step数，以及需要分配多少step用于warm up学习率。

# 计算总共多少step=343
        num_train_steps = int(
            # train_batch_size = 32
            # num_train_epochs = 3
            len(train_examples) / FLAGS.train_batch_size * FLAGS.num_train_epochs)
        # warmup_proportion=0.1,预热学习率，先以一个较小的学习率进行学习，然后再恢复为指定学习率,34
        num_warmup_steps = int(num_train_steps * FLAGS.warmup_proportion)

warmup_proportion=0.1代表前10%的step的学习率先从一个很小的值慢慢变大到设置的真实学习率，目的是为了模型在训练初期小步前进，防止模型的初始化参数和新任务不匹配，学习率太大导致无法收敛。
下一步构建模型输入函数

model_fn = model_fn_builder(
        bert_config=bert_config,
        num_labels=len(label_list),
        init_checkpoint=FLAGS.init_checkpoint,
        learning_rate=FLAGS.learning_rate,
        num_train_steps=num_train_steps,
        num_warmup_steps=num_warmup_steps,
        use_tpu=FLAGS.use_tpu,
        use_one_hot_embeddings=FLAGS.use_tpu)

跟进model_fn，作者在其中创建了Bert模型，并且拿到之前tfrecord制作的特征输入模型，基于一个二分类任务的y值计算了一次正向传播拿到了loss

# TODO 正向传播
        (total_loss, per_example_loss, logits, probabilities) = create_model(
            bert_config, is_training, input_ids, input_mask, segment_ids, label_ids,
            num_labels, use_one_hot_embeddings)

跟进create_model，这个方法定了Bert模型

model = modeling.BertModel(
        config=bert_config,
        is_training=is_training,
        input_ids=input_ids,
        input_mask=input_mask,
        token_type_ids=segment_ids,
        use_one_hot_embeddings=use_one_hot_embeddings)

跟进BertModel，这里有Bert的模型结构，首先作者构建了一个全部词表的embedding映射table，维度128，将input_ids输入进去拿到映射后的稠密向量。

            with tf.variable_scope("embeddings"):
                # Perform embedding lookup on the word ids.
                # 给出输入的embedding映射，以及lookup表
                (self.embedding_output, self.embedding_table) = embedding_lookup(
                    input_ids=input_ids,
                    vocab_size=config.vocab_size,
                    embedding_size=config.hidden_size,
                    initializer_range=config.initializer_range,
                    word_embedding_name="word_embeddings",
                    use_one_hot_embeddings=use_one_hot_embeddings)

然后对输出的embedding进行后处理，包括加入segment_id和pos_id的映射相加的结果，在做layer norm和dropout。

self.embedding_output = embedding_postprocessor(
                    input_tensor=self.embedding_output,
                    use_token_type=True,
                    token_type_ids=token_type_ids,
                    token_type_vocab_size=config.type_vocab_size,
                    token_type_embedding_name="token_type_embeddings",
                    use_position_embeddings=True,
                    position_embedding_name="position_embeddings",
                    initializer_range=config.initializer_range,
                    max_position_embeddings=config.max_position_embeddings,
                    dropout_prob=config.hidden_dropout_prob)

在构造位置编码的时候作者直接向前划去前128个位置的随机初始化向量作为位置编码，因为本例中限制了输入最大长度128

position_embeddings = tf.slice(full_position_embeddings, [0, 0],
                                           [seq_length, -1])

在将三个embedding相加之后所有输入准备完毕，进入encoder阶段，encoder和Transformer模型的内容一致，这里不做展开

self.all_encoder_layers = transformer_model(
                    input_tensor=self.embedding_output,
                    attention_mask=attention_mask,
                    hidden_size=config.hidden_size,  # 128
                    num_hidden_layers=config.num_hidden_layers,  # 2
                    num_attention_heads=config.num_attention_heads,  # 2
                    intermediate_size=config.intermediate_size,  # 512
                    intermediate_act_fn=get_activation(config.hidden_act),  # gelu激活函数
                    hidden_dropout_prob=config.hidden_dropout_prob,  # 0.1
                    attention_probs_dropout_prob=config.attention_probs_dropout_prob,  # 0.1
                    initializer_range=config.initializer_range,  # 0.02
                    do_return_all_layers=True)

输出self.all_encoder_layers是一个列表，记录了每个transformer block的结果，作者拿到最后一个transformer block的结果作为整个encoder层的输出

self.sequence_output = self.all_encoder_layers[-1]

最后作者将第一个位置[CLS]这个词的embedding输出拿到，并且对他额外做了一层全连接

            with tf.variable_scope("pooler"):
                # 0:1第一个词 => [None, 1, 128] => [None, 128]
                first_token_tensor = tf.squeeze(self.sequence_output[:, 0:1, :], axis=1)
                # [None, 128]
                # 取第一个词做全连接
                self.pooled_output = tf.layers.dense(
                    first_token_tensor,
                    config.hidden_size,  # 128
                    activation=tf.tanh,
                    kernel_initializer=create_initializer(config.initializer_range))

最终这个self.pooled_output是一个[None, 128]维度的张量。

下游任务微调网络

下游任务是MRPC的分类任务，判断两个句子是不是一个意思，作者直接拿到Transformer层的输出self.pooled_output，在加一层全连接映射到y值上计算拿到loss

output_layer = model.get_pooled_output()

    hidden_size = output_layer.shape[-1].value

    # 这个output_weights和output_bias是唯一不是ckpt复现出来的
    output_weights = tf.get_variable(
        "output_weights", [num_labels, hidden_size],
        initializer=tf.truncated_normal_initializer(stddev=0.02))

    output_bias = tf.get_variable(
        "output_bias", [num_labels], initializer=tf.zeros_initializer())

    with tf.variable_scope("loss"):
        if is_training:
            # I.e., 0.1 dropout
            output_layer = tf.nn.dropout(output_layer, keep_prob=0.9)

        # 手动实现的对数似然损失
        # 直接输出到目标
        logits = tf.matmul(output_layer, output_weights, transpose_b=True)
        logits = tf.nn.bias_add(logits, output_bias)
        probabilities = tf.nn.softmax(logits, axis=-1)
        log_probs = tf.nn.log_softmax(logits, axis=-1)

        one_hot_labels = tf.one_hot(labels, depth=num_labels, dtype=tf.float32)

        per_example_loss = -tf.reduce_sum(one_hot_labels * log_probs, axis=-1)
        loss = tf.reduce_mean(per_example_loss)

当mode为训练时，作者又手写了训练操作

        if mode == tf.estimator.ModeKeys.TRAIN:

            # 梯度修剪一次
            # TODO 反向传播
            train_op = optimization.create_optimizer(
                total_loss, learning_rate, num_train_steps, num_warmup_steps, use_tpu)

            output_spec = tf.contrib.tpu.TPUEstimatorSpec(
                mode=mode,
                loss=total_loss,
                train_op=train_op,
                scaffold_fn=scaffold_fn)

这个train_op包含一个warm up学习率的逻辑

    if num_warmup_steps:
        global_steps_int = tf.cast(global_step, tf.int32)
        warmup_steps_int = tf.constant(num_warmup_steps, dtype=tf.int32)

        global_steps_float = tf.cast(global_steps_int, tf.float32)
        warmup_steps_float = tf.cast(warmup_steps_int, tf.float32)
        # global_steps_float增大，warmup_learning_rate逐渐增大
        warmup_percent_done = global_steps_float / warmup_steps_float
        warmup_learning_rate = init_lr * warmup_percent_done

        is_warmup = tf.cast(global_steps_int < warmup_steps_int, tf.float32)
        # 要么是learning_rate要么是warmup_learning_rate
        learning_rate = (
                (1.0 - is_warmup) * learning_rate + is_warmup * warmup_learning_rate)

同时自己写了个优化器，包括学习率衰减，参数权重的迭代更新逻辑。

optimizer = AdamWeightDecayOptimizer(
        learning_rate=learning_rate,
        weight_decay_rate=0.01,
        beta_1=0.9,
        beta_2=0.999,
        epsilon=1e-6,
        exclude_from_weight_decay=["LayerNorm", "layer_norm", "bias"])

下面构造estimator，注意如果run_config里面有model_dir则mode为EVAL，直接读取训练好的模型进行评估

estimator = tf.contrib.tpu.TPUEstimator(
        # False
        use_tpu=FLAGS.use_tpu,
        model_fn=model_fn,
        config=run_config,
        train_batch_size=FLAGS.train_batch_size,  # 32
        eval_batch_size=FLAGS.eval_batch_size,  # 8
        predict_batch_size=FLAGS.predict_batch_size)  # 8

最后读取tfrecord磁盘地址文件，构造输入函数，给道estimator进行训练

train_input_fn = file_based_input_fn_builder(
            input_file=train_file,
            seq_length=FLAGS.max_seq_length,
            is_training=True,
            drop_remainder=True)
        estimator.train(input_fn=train_input_fn, max_steps=num_train_steps)

预训练模型参数迁移

作者在构建模型之后，通过tf.train.init_from_checkpoint将预训练模型的参数从ckpt检查点原封不动地迁移到新构建的Bert网络中同名的参数上

        if init_checkpoint:
            (assignment_map, initialized_variable_names
             ) = modeling.get_assignment_map_from_checkpoint(tvars, init_checkpoint)
            if use_tpu:

                def tpu_scaffold():
                    tf.train.init_from_checkpoint(init_checkpoint, assignment_map)
                    return tf.train.Scaffold()

                scaffold_fn = tpu_scaffold
            else:
                # 从ckpt中把同名参数的变量的值恢复到新网络中
                tf.train.init_from_checkpoint(init_checkpoint, assignment_map)

迁移的参数包括

变量名	维度	是否从ckpt迁移
bert/embeddings/word_embeddings:0,	(30522,128),	INIT_FROM_CKPT
bert/embeddings/token_type_embeddings:0,	(2,128),	INIT_FROM_CKPT
bert/embeddings/position_embeddings:0,	(512,128),	INIT_FROM_CKPT
bert/embeddings/LayerNorm/beta:0,	(128),	INIT_FROM_CKPT
bert/embeddings/LayerNorm/gamma:0,	(128),	INIT_FROM_CKPT
bert/encoder/layer_0/attention/self/query/kernel:0,	(128,128),	INIT_FROM_CKPT
bert/encoder/layer_0/attention/self/query/bias:0,	(128),	INIT_FROM_CKPT
bert/encoder/layer_0/attention/self/key/kernel:0,	(128,128),	INIT_FROM_CKPT
bert/encoder/layer_0/attention/self/key/bias:0,	(128),	INIT_FROM_CKPT
bert/encoder/layer_0/attention/self/value/kernel:0,	(128,128),	INIT_FROM_CKPT
bert/encoder/layer_0/attention/self/value/bias:0,	(128),	INIT_FROM_CKPT
bert/encoder/layer_0/attention/output/dense/kernel:0,	(128,128),	INIT_FROM_CKPT
bert/encoder/layer_0/attention/output/dense/bias:0,	(128),	INIT_FROM_CKPT
bert/encoder/layer_0/attention/output/LayerNorm/beta:0,	(128),	INIT_FROM_CKPT
bert/encoder/layer_0/attention/output/LayerNorm/gamma:0,	(128),	INIT_FROM_CKPT
bert/encoder/layer_0/intermediate/dense/kernel:0,	(128,512),	INIT_FROM_CKPT
bert/encoder/layer_0/intermediate/dense/bias:0,	(512),	INIT_FROM_CKPT
bert/encoder/layer_0/output/dense/kernel:0,	(512,128),	INIT_FROM_CKPT
bert/encoder/layer_0/output/dense/bias:0,	(128),	INIT_FROM_CKPT
bert/encoder/layer_0/output/LayerNorm/beta:0,	(128),	INIT_FROM_CKPT
bert/encoder/layer_0/output/LayerNorm/gamma:0,	(128),	INIT_FROM_CKPT
bert/encoder/layer_1/attention/self/query/kernel:0,	(128,128),	INIT_FROM_CKPT
bert/encoder/layer_1/attention/self/query/bias:0,	(128),	INIT_FROM_CKPT
bert/encoder/layer_1/attention/self/key/kernel:0,	(128,128),	INIT_FROM_CKPT
bert/encoder/layer_1/attention/self/key/bias:0,	(128),	INIT_FROM_CKPT
bert/encoder/layer_1/attention/self/value/kernel:0,	(128,128),	INIT_FROM_CKPT
bert/encoder/layer_1/attention/self/value/bias:0,	(128),	INIT_FROM_CKPT
bert/encoder/layer_1/attention/output/dense/kernel:0,	(128,128),	INIT_FROM_CKPT
bert/encoder/layer_1/attention/output/dense/bias:0,	(128),	INIT_FROM_CKPT
bert/encoder/layer_1/attention/output/LayerNorm/beta:0,	(128),	INIT_FROM_CKPT
bert/encoder/layer_1/attention/output/LayerNorm/gamma:0,	(128),	INIT_FROM_CKPT
bert/encoder/layer_1/intermediate/dense/kernel:0,	(128,512),	INIT_FROM_CKPT
bert/encoder/layer_1/intermediate/dense/bias:0,	(512),	INIT_FROM_CKPT
bert/encoder/layer_1/output/dense/kernel:0,	(512,128),	INIT_FROM_CKPT
bert/encoder/layer_1/output/dense/bias:0,	(128),	INIT_FROM_CKPT
bert/encoder/layer_1/output/LayerNorm/beta:0,	(128),	INIT_FROM_CKPT
bert/encoder/layer_1/output/LayerNorm/gamma:0,	(128),	INIT_FROM_CKPT
bert/pooler/dense/kernel:0,	(128,128),	INIT_FROM_CKPT
bert/pooler/dense/bias:0,	(128),	INIT_FROM_CKPT
output_weights:0,	(2, 128)
output_bias:0,	(2,)

其中只有微调层的全连接和偏置是新模型自己初始化的，其他参数都是从预训练模型进行迁移，最终将所有参数融合在一起进行训练。预训练模型的参数包括词向量，位置编码，segment编码，每层Transformer的QKV参数，以及全连接层参数。全部参数共计4386178个，440万，其中词向量独占390万，占比89%，由此可见如果预训练部分能训练出很好的词向量，则微调部分就越容易，因为词向量的调整占微调的第一大头。
如果此处init_checkpoint为None，则不使用预训练模型，直接使用新网络随机初始化的embedding进行训练，运行如下

python run_classifier.py --task_name=MRPC --do_train=true --do_eval=true --data_dir=./bert-master/GLUE_MRPC --vocab_file=./bert-master/bert_base_model/vocab.txt --bert_config_file=./bert-master/bert_base_model/bert_config.json  --max_seq_length=128 --train_batch_size=32 --learning_rate=2e-5 --num_train_epochs=3 --output_dir=/tmp/mrpc_output

验证集模型准确率如下，准确率只有0.683

I0814 21:21:21.268342 140291776911168 run_classifier.py:993] ***** Eval results *****
INFO:tensorflow:  eval_accuracy = 0.6838235
INFO:tensorflow:  eval_loss = 0.6237707
INFO:tensorflow:  global_step = 343
INFO:tensorflow:  loss = 0.6237707

相比使用预训练模型的效果0.71下降3个点，由此也体现了在大数据上做预训练+微调，相比于对某个样本直接建模模型效果更好的优势。

BERT
1.BERT介绍 2.基于Bert的多标签文本分类 3.基于Bert的命名实体识别
Bert预训练模型
NLP预训练模型简介旺达一、BERT简单文本分类实例 1、bert详细运行介绍--tensorflow htt...
Bert在文本分类任务重如何进行 fine-tuning
1. 前言文本分类是个经典的NLP任务。随着预训练语言模型的大受欢迎，诸如Bert等预训练语言模型在文本分类任务...
Bert文本分类及服务部署实战
谷歌发布bert已经有一段时间了，但是仅在最近一个文本分类任务中实战使用过，顺便记录下使用过程。记录前先对bert...
如何用 Python 和 BERT 做多标签文本分类？
10余行代码，借助 BERT 轻松完成多标签（multi-label）文本分类任务。疑问之前我写了《如何用 ...
Bert如何使用预留的[unused*]
背景在使用Bert进行文本分析的过程中，我们需要用BERT自带的分词器(Tokenizer)来对文本序列进行分词...
BERT fine tuning 微调
（一）BERT 微调 BERT对每一个词元返回抽取了上下文信息的特这个向量。不同的任务使用不同的特征。（1）句子...
如何用 Python 和 BERT 做多标签（multi-lab
10余行代码，借助 BERT 轻松完成多标签（multi-label）文本分类任务。疑问之前我写了《如何用 P...
我的实践：pytorch框架下基于BERT实现文本情感分类
当前，在BERT等预训练模型的基础上进行微调已经成了NLP任务的一个定式了。为了了解BERT怎么用，在这次实践中，...
BERT微调模型
使用BERT和Pytorch构建BERT微调模型，当然这里使用Pytorch的原因是用来比赛是比较方便的。部分代...