美文网首页
BERT实战(上)

BERT实战(上)

作者: 又双叒叕苟了一天 | 来源:发表于2023-04-28 16:15 被阅读0次

    上半部分介绍如何从BERT模型提取嵌入,下半部分介绍如何针对下游任务进行微调

    1. 预训练的BERT模型

    使用16GB的数据从头开始训练104M(1.04亿)参数量的BERT-base模型是很费算力的,Google发布了各种配置的BERT模型,可以基于这些模型对下游任务进行微调。L是Transformer层数,H是隐藏层维度。


    BERT模型配置

    2. 从预训练BERT模型中提取嵌入

    1. 标记级(词级)的特征
    2. 句级的特征。通常情况可以使用全部标记的特征平均或聚合而不单纯只用[CLS]标记产生的特征
      特征

    2.1 安装Hugging Face的Transformers库

    pip install Transformers
    

    2.2 BERT嵌入的生成

    1. 预处理句子
    2. 调用模型获得嵌入

    2.2.1 预处理句子

    2.2.1.1 引入模型和词元分析器

    from transformers import BertModel, BertTokenizer
    # 创建模型
    model = BertModel.from_pretrained('bert-base-uncased')
    # 创建tokenizer
    tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
    

    2.2.1.2 手动处理一个句子

    步骤:

    1. 分词:tokenizer.tokenize(sentence)
    2. 添加[CLS][SEP]标记
    3. 使用[PAD]补齐
    4. 创建注意力掩码:不注意[PAD]部分的句子
    5. 将标记转化为标记id:tokenizer.convert_tokens_to_ids(tokens) / 标记id解码成标记:tokenizer.decode(input_ids)
    6. 将标记id和注意力掩码转化为张量
    sentence = 'I love China'
    print('句子: {}'.format(sentence))
    # 句子: I love China
    
    tokens = tokenizer.tokenize(sentence)
    print('分词: {}'.format(tokens))
    # 分词: ['i', 'love', 'china']
    
    tokens = ['[CLS]'] + tokens + ['[SEP]']
    print('添加[CLS]和[SEP]标记: {}'.format(tokens))
    # 添加[CLS]和[SEP]标记: ['[CLS]', 'i', 'love', 'china', '[SEP]']
    
    tokens = tokens + ['[PAD]'] + ['[PAD]']
    print('使用[PAD]补齐: {}'.format(tokens))
    # 使用[PAD]补齐: ['[CLS]', 'i', 'love', 'china', '[SEP]', '[PAD]', '[PAD]']
    
    attention_mask = [1 if i != '[PAD]' else 0 for i in tokens]
    print('创建注意力掩码: {}'.format(attention_mask))
    # 创建注意力掩码: [1, 1, 1, 1, 1, 0, 0]
    
    input_ids = tokenizer.convert_tokens_to_ids(tokens)
    print('将标记转化为标记id: {}'.format(input_ids))
    # 将标记转化为标记id: [101, 1045, 2293, 2859, 102, 0, 0]
    decode_ids = tokenizer.decode(input_ids)
    print('标记id解码成标记: {}'.format(decode_ids))
    # 标记id解码成标记: [CLS] i love china [SEP] [PAD] [PAD]
    
    attention_mask = torch.tensor(attention_mask).unsqueeze(0)
    input_ids = torch.tensor(input_ids).unsqueeze(0)
    print('注意力掩码张量: {}'.format(attention_mask))
    # 注意力掩码张量: tensor([[1, 1, 1, 1, 1, 0, 0]])
    print('标记id张量: {}'.format(input_ids))
    # 标记id张量: tensor([[ 101, 1045, 2293, 2859,  102,    0,    0]])
    

    2.2.1.3 编码一个句子

    • tokenizer(sentence)
    sentence = 'I love China'
    inputs = tokenizer(sentence)
    print('句子: {}'.format(sentence))
    # 句子: I love China
    print('input_ids: {}'.format(inputs['input_ids']))
    # input_ids: [101, 1045, 2293, 2859, 102]
    print('attention_mask: {}'.format(inputs['attention_mask']))
    # attention_mask: [1, 1, 1, 1, 1]
    print('token_type_ids: {}'.format(inputs['token_type_ids']))
    # token_type_ids: [0, 0, 0, 0, 0]
    

    2.2.1.4 编码两个句子并补齐

    • tokenizer([sentence_a, sentence_b], padding=True)
    sentence_a = 'This is a short sentence.'
    sentence_b = 'This is a rather long sequence. It is at least longer than the sequence A.'
    print('句子a: {}'.format(sentence_a))
    print('句子b: {}'.format(sentence_b))
    
    outputs = tokenizer([sentence_a, sentence_b], padding=True)
    print('input_ids: {}'.format(outputs['input_ids']))
    # input_ids: [[101, 2023, 2003, 1037, 2460, 6251, 1012, 102, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [101, 2023, 2003, 1037, 2738, 2146, 5537, 1012, 2009, 2003, 2012, 2560, 2936, 2084, 1996, 5537, 1037, 1012, 102]]
    print('attention_mask: {}'.format(outputs['attention_mask']))
    # attention_mask: [[1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]]
    print('token_type_ids: {}'.format(outputs['token_type_ids']))
    # token_type_ids: [[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]]
    

    2.2.1.5 编码两个拼接的句子

    • tokenizer(sentence_a, sentence_b)
    sentence_a = 'This is a short sentence.'
    sentence_b = 'This is a rather long sequence. It is at least longer than the sequence A.'
    print('句子a: {}'.format(sentence_a))
    print('句子b: {}'.format(sentence_b))
    
    inputs = tokenizer(sentence_a, sentence_b)
    print('input_ids: {}'.format(inputs['input_ids']))
    # input_ids: [101, 2023, 2003, 1037, 2460, 6251, 1012, 102, 2023, 2003, 1037, 2738, 2146, 5537, 1012, 2009, 2003, 2012, 2560, 2936, 2084, 1996, 5537, 1037, 1012, 102]
    print('attention_mask: {}'.format(inputs['attention_mask']))
    # attention_mask: [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]
    print('token_type_ids: {}'.format(inputs['token_type_ids']))
    # token_type_ids: [0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]
    

    2.2.2 调用模型获得嵌入

    调用模型:

    1. model(input_ids, attention_mask=attention_mask, token_type_ids=token_type_ids)

    输出:

    1. outputs['pooler_output']:句级的嵌入,NSP任务使用这个预测。[CLS]标记经过tanh激活的前馈神经网络获得
    2. outputs['last_hidden_state']:词级的嵌入,MLM任务使用这个预测
    sentence_a = 'This is a short sentence.'
    sentence_b = 'This is a rather long sequence. It is at least longer than the sequence A.'
    inputs = tokenizer(sentence_a, sentence_b)
    input_ids = torch.tensor(inputs['input_ids']).unsqueeze(0)
    attention_mask = torch.tensor(inputs['attention_mask']).unsqueeze(0)
    token_type_ids = torch.tensor(inputs['token_type_ids']).unsqueeze(0)
    
    outputs = model(input_ids, attention_mask=attention_mask, token_type_ids=token_type_ids)
    pooler_output = outputs['pooler_output'] 
    last_hidden_state = outputs['last_hidden_state']  
    print('pooler_output shape: {}'.format(pooler_output.shape))  # [batch_size, hidden_size]
    # pooler_output shape: torch.Size([1, 768])
    print('last_hidden_state shape: {}'.format(last_hidden_state.shape))  # [batch_size, sequence_length, hidden_size]
    # last_hidden_state shape: torch.Size([1, 26, 768])
    

    3. 从BERT的所有编码器层中提取嵌入

    原因:单纯使用一个隐藏层的特征处理下游任务并非是结果最好的。BERT的研究人员在命名实体任务中进行了实验,比较了实用不同层特征的F1分数。当串联最后4个隐藏层时,取得了最好的F1分数。

    不同层的嵌入的F1分数

    3.1 提取方式

    • BertModel.from_pretrained('bert-base-uncased', output_hidden_states=True),设置output_hidden_states=True获得每一层的特征

    输出:

    1. outputs['hidden_states']:获得13 * [batch_size, sequence_length, hidden_size] 的张量
    2. hidden_states[0]:输入嵌入层的输出
    3. hidden_states[-1]:最后一个编码器层的输出
    model = BertModel.from_pretrained('bert-base-uncased', output_hidden_states=True)
    tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
    
    sentence_a = 'This is a short sentence.'
    sentence_b = 'This is a rather long sequence. It is at least longer than the sequence A.'
    print('句子a: {}'.format(sentence_a))
    print('句子b: {}'.format(sentence_b))
    outputs = tokenizer(sentence_a, sentence_b)
    input_ids = torch.tensor(outputs['input_ids']).unsqueeze(0)
    attention_mask = torch.tensor(outputs['attention_mask']).unsqueeze(0)
    token_type_ids = torch.tensor(outputs['token_type_ids']).unsqueeze(0)
    
    outputs = model(input_ids, attention_mask=attention_mask, token_type_ids=token_type_ids)
    pooler_output = outputs['pooler_output']
    last_hidden_state = outputs['last_hidden_state']
    hidden_states = outputs['hidden_states']
    print('pooler_output shape: {}'.format(pooler_output.shape))  # [batch_size, hidden_size]
    # pooler_output shape: torch.Size([1, 768])
    print('last_hidden_state shape: {}'.format(last_hidden_state.shape))  # [batch_size, sequence_length, hidden_size]
    # last_hidden_state shape: torch.Size([1, 26, 768])
    print('hidden_states length: {}'.format(len(hidden_states)))  # 13
    # hidden_states length: 13
    print('hidden_states[0] shape: {}'.format(hidden_states[0].shape))  # [batch_size, sequence_length, hidden_size]
    # hidden_states[0] shape: torch.Size([1, 26, 768])
    

    参考资料

    [1]. BERT基础教程Transformer大模型实战
    [2]. 如何计算Bert模型的参数量:https://blog.csdn.net/weixin_44402973/article/details/126405946
    [3]. Pytorch Transformer Tokenizer常见输入输出实战详解:https://blog.csdn.net/yosemite1998/article/details/122306758

    相关文章

      网友评论

          本文标题:BERT实战(上)

          本文链接:https://www.haomeiwen.com/subject/lqgmjdtx.html