美文网首页
Sentence-Transformer

Sentence-Transformer

作者: 三方斜阳 | 来源:发表于2021-02-10 21:59 被阅读0次

    详细用法,看官网:https://www.sbert.net/index.html
    这篇笔记大部分都是官网的例子,只是搬过来记录几个熟悉的用法

    简介:

    Sentence-Transformer 是一个 python 框架,用于句子和文本嵌入
    The initial work is described in paper Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks.
    可以使用这个框架计算100多种语言的句子或者文本嵌入,可以用于计算句子相似度,文本相似度,语义搜索,释义挖掘等下游任务(semantic textual similar, semantic search, or paraphrase mining)
    这个框架是基于 pytorch 和 transformer 的,并且提供大量的预训练模型,也易于在自己的模型上做微调 fine-tune.

    下面以官方提供的几个任务做例子,简单记录一下如何使用sentence embedding

    1. Semantic Textual Similarity

    计算两段文本的相似度,这里的例子是计算两段文本对应的每一条句子计算余弦相似度;

    from sentence_transformers import SentenceTransformer, util
    model = SentenceTransformer('paraphrase-distilroberta-base-v1',device='cuda')
    
    # Two lists of sentences
    sentences1 = ['The cat sits outside',
                 'A man is playing guitar',
                 'The new movie is awesome']
    
    sentences2 = ['The dog plays in the garden',
                  'A woman watches TV',
                  'The new movie is so great']
    
    #Compute embedding for both lists
    embeddings1 = model.encode(sentences1, convert_to_tensor=True)
    embeddings2 = model.encode(sentences2, convert_to_tensor=True)
    print(embeddings1)
    #Compute cosine-similarits
    cosine_scores = util.pytorch_cos_sim(embeddings1, embeddings2)
    print(cosine_scores)
    #Output the pairs with their score
    for i in range(len(sentences1)):
        print("{} \t\t {} \t\t Score: {:.4f}".format(sentences1[i], sentences2[i], cosine_scores[i][i]))
    >>
    tensor([[ 0.0111,  0.1261,  0.2388,  ..., -0.0673,  0.1713,  0.0163],
            [-0.1193,  0.1150, -0.1560,  ..., -0.2504, -0.0789, -0.1212],
            [ 0.0114,  0.1248, -0.0231,  ..., -0.2252,  0.3014,  0.1654]])
    tensor([[ 0.4579,  0.1059,  0.1447],
            [ 0.1239,  0.1759, -0.0344],
            [ 0.1696,  0.1313,  0.9283]])
    The cat sits outside         The dog plays in the garden         Score: 0.4579
    A man is playing guitar          A woman watches TV          Score: 0.1759
    The new movie is awesome         The new movie is so great       Score: 0.9283
    >>
    

    下面是一个:输入一句话,在一段文本中获取最相似的几句话,这要求计算出所有文本句子的词嵌入,然后计算向量的余弦距离,代码比较简单:

    from sentence_transformers import SentenceTransformer,util
    model = SentenceTransformer('roberta-large-nli-stsb-mean-tokens',device='cuda')
    sentences = ['The cat sits outside',
          'A man is playing guitar',
          'The new movie is awesome',
          'The new opera is nice']
    sentence_embeddings = model.encode(sentences,convert_to_tensor=True)
    
    for sentence, embedding in zip(sentences, sentence_embeddings):
        print("Sentence:", sentence)
        print("Embedding:", embedding)
        print("")
    
    query =  'The new movie is so great'  #  A query sentence uses for searching semantic similarity score.
    queries = [query]
    query_embeddings = model.encode(queries,convert_to_tensor=True)
    
    print("Semantic Search Results")
    number_top_matches = 2
    for query, query_embedding in zip(queries, query_embeddings):
        cosine_scores = util.pytorch_cos_sim(query_embedding,sentence_embeddings)[0]
        results = zip(range(len(cosine_scores)), cosine_scores)
        results = sorted(results, key=lambda x: x[1],reverse=True)
        for i,j in results:
          print(i,j)
        print("Query:", query)
        print("\nTop {} most similar sentences in corpus:".format(number_top_matches))
    
        for idx, distance in results[0:number_top_matches]:
            print(sentences[idx].strip(), "(Cosine Score: %.4f)" % distance)
    >>输出
    Sentence: The cat sits outside
    Embedding: tensor([-0.6349,  0.3843, -0.4646,  ..., -0.3325, -0.7107, -0.0827])
    
    Sentence: A man is playing guitar
    Embedding: tensor([-0.1785,  0.6163, -0.1034,  ...,  1.2210, -1.2130, -0.4310])
    
    Sentence: The new movie is awesome
    Embedding: tensor([ 0.8274,  0.5452, -0.1739,  ...,  0.7432, -2.1740,  1.8347])
    
    Sentence: The new opera is nice
    Embedding: tensor([ 1.4234,  0.9776, -0.4403,  ...,  0.5330, -0.8313,  1.5077])
    
    Semantic Search Results
    2 tensor(0.9788)
    3 tensor(0.6040)
    1 tensor(0.0189)
    0 tensor(-0.0109)
    Query: The new movie is so great
    
    Top 2 most similar sentences in corpus:
    The new movie is awesome (Cosine Score: 0.9788)
    The new opera is nice (Cosine Score: 0.6040)
    >>
    

    util.pytorch_cos_sim()

    1. 上面两个例子类似,都展示了 util.pytorch_cos_sim() 计算余弦相似度的方式,接受的参数可以是两个二维的tensor,也可以其中一个是一个tensor,分别代表每个文本的句子嵌入,计算时候,会将每个句子嵌入跟另一个文本的每个句子嵌入计算相似度,最后返回一个多维的tensor 一一对应句子之间的相似度结果.
    2. 对于 Semantic Textual Similarity 任务来说,有很多效果都不错的预训练模型,例子中的 roberta-large-nli-stsb-mean-tokens 和 paraphrase-distilroberta-base-v1 只是其中选取的
    3. 这是在文本句子列表中查找相似句子的简化版本,对于更大的句子集合,官方提供了另外一个高效的函数 paraphrase_mining
    4. 另外还有直接使用 util.semantic_search() 寻找最相似的句子,使用GPU等加速方式,并且指定top num;以及使用粗略计算的方式加速训练和在更大的语料上面计算的算法 :API 示例

    2. Clustering

    将几句话使用 k-means 简单聚类:

    """
    This is a simple application for sentence embeddings: clustering
    Sentences are mapped to sentence embeddings and then k-mean clustering is applied.
    """
    from sentence_transformers import SentenceTransformer
    from sklearn.cluster import KMeans
    
    embedder = SentenceTransformer('paraphrase-distilroberta-base-v1',device='cuda')
    
    # Corpus with example sentences
    corpus = ['A man is eating food.',
          'A man is eating a piece of bread.',
          'A man is eating pasta.',
          'The girl is carrying a baby.',
          'The baby is carried by the woman',
          'A man is riding a horse.',
          'A man is riding a white horse on an enclosed ground.',
          'A monkey is playing drums.',
          'Someone in a gorilla costume is playing a set of drums.',
          'A cheetah is running behind its prey.',
          'A cheetah chases prey on across a field.'
          ]
    corpus_embeddings = embedder.encode(corpus)
    
    # Perform kmean clustering
    num_clusters = 5
    clustering_model = KMeans(n_clusters=num_clusters)
    clustering_model.fit(corpus_embeddings)
    cluster_assignment = clustering_model.labels_
    #print(cluster_assignment)[1 1 1 0 0 3 3 4 4 2 2]每句话属于哪个类别打上id 标签
    
    clustered_sentences = [[] for i in range(num_clusters)]
    for sentence_id, cluster_id in enumerate(cluster_assignment):
        clustered_sentences[cluster_id].append(corpus[sentence_id])
    
    for i, cluster in enumerate(clustered_sentences):
        print("Cluster ", i+1)
        print(cluster)
        print("")
    >>
    Cluster  1
    ['A cheetah is running behind its prey.', 'A cheetah chases prey on across a field.']
    
    Cluster  2
    ['A man is riding a horse.', 'A man is riding a white horse on an enclosed ground.']
    
    Cluster  3
    ['The girl is carrying a baby.', 'The baby is carried by the woman']
    
    Cluster  4
    ['A monkey is playing drums.', 'Someone in a gorilla costume is playing a set of drums.']
    
    Cluster  5
    ['A man is eating food.', 'A man is eating a piece of bread.', 'A man is eating pasta.']
    >>
    

    另外还有 Agglomerative Clustering 和 Fast Clustering 这两种聚类算法的使用 参见官网详细的解释:cluster

    3. train own embedding

    使用 sentence-transformer 来微调自己的 sentence / text embedding ,最基本的网络结构来训练embedding:


    from sentence_transformers import SentenceTransformer, models
    
    word_embedding_model = models.Transformer('bert-base-uncased', max_seq_length=256)
    pooling_model = models.Pooling(word_embedding_model.get_word_embedding_dimension())
    
    model = SentenceTransformer(modules=[word_embedding_model, pooling_model])
    
    1. 将句子输入一个网络层如 bert,bert 为每一个 token 输出一个embedding ,接着将输出进入一个 池化层(pooling),选择最简单的平均池化层,将所有token embedding 的均值作为输出,所以得到跟输入句子长度无关的一个 定长的句子嵌入sentence embedding(768dim),之前的例子模型是直接调用的 封装好的 pre-trained model 已经由一个bert 层和 pooling 层组成。
    2. 同样使用 Semantic Textual Similarity 任务训练,那么模型结构如下:

      输入是两个句子,label 为两者的相似度打分,句子转换成 embedding u & v,将这两个向量计算余弦相似度然后跟模型输入的 gold similarity score 比较,计算出 loss ,接着进行模型下一步的 fine-tune 参数。调用 model.fit()来 tune 搭建好的模型,关于 .fit() 的参数,模型细节参看:trainning overview
    #Define your train examples. You need more than just two examples...
    train_examples = [InputExample(texts=['My first sentence', 'My second sentence'], label=0.8),
        InputExample(texts=['Another pair', 'Unrelated sentence'], label=0.3)]
    #Define your train dataset, the dataloader and the train loss
    train_dataloader = DataLoader(train_examples, shuffle=True, batch_size=16)
    train_loss = losses.CosineSimilarityLoss(model)
    
    #Tune the model
    model.fit(train_objectives=[(train_dataloader, train_loss)], epochs=1, warmup_steps=100)
    
    1. 使用现有的 evaluators 来测试模型的表现,传递给 .fit()函数
    from sentence_transformers import evaluation
    sentences1 = ['This list contains the first column', 'With your sentences', 'You want your model to evaluate on']
    sentences2 = ['Sentences contains the other column', 'The evaluator matches sentences1[i] with sentences2[i]', 'Compute the cosine similarity and compares it to scores[i]']
    scores = [0.3, 0.6, 0.2]
    evaluator = evaluation.EmbeddingSimilarityEvaluator(sentences1, sentences2, scores)
    # ... Your other code to load training data
    model.fit(train_objectives=[(train_dataloader, train_loss)], epochs=1, warmup_steps=100, evaluator=evaluator, evaluation_steps=500)
    

    Example:

    数据长这样:

    OrderedDict([('split', 'test'), ('genre', 'main-news'), ('dataset', 'headlines'), ('year', '2015'), ('sid', '1205'), ('score', '3.20'), ('sentence1', 'Israel Forces Kill 2 Palestinian Militants'), ('sentence2', 'Israeli army kills Palestinian militant in raid')])
    OrderedDict([('split', 'test'), ('genre', 'main-news'), ('dataset', 'headlines'), ('year', '2015'), ('sid', '1207'), ('score', '1.20'), ('sentence1', "Death toll 'rises to 17' after typhoon strikes Japan"), ('sentence2', 'Death Toll Rises to 84 in Pakistan Floods')])
    OrderedDict([('split', 'test'), ('genre', 'main-news'), ('dataset', 'headlines'), ('year', '2015'), ('sid', '1223'), ('score', '5.00'), ('sentence1', 'Protests continue in tense Ukraine capital'), ('sentence2', "Protests Continue In Ukraine's Capital")])
    OrderedDict([('split', 'test'), ('genre', 'main-news'), ('dataset', 'headlines'), ('year', '2015'), ('sid', '1229'), ('score', '4.60'), ('sentence1', 'Two French journalists killed after Mali kidnapping'), ('sentence2', 'Two French journalists abducted, killed in Northern Mali')])
    OrderedDict([('split', 'test'), ('genre', 'main-news'), ('dataset', 'headlines'), ('year', '2015'), ('sid', '1231'), ('score', '2.00'), ('sentence1', 'Headlines in major Iranian newspapers on Dec 14'), ('sentence2', 'Headlines in major Iranian newspapers on July 29')])
    OrderedDict([('split', 'test'), ('genre', 'main-news'), ('dataset', 'headlines'), ('year', '2015'), ('sid', '1233'), ('score', '1.80'), ('sentence1', 'Iran warns of spillover of possible war on Syria'), ('sentence2', 'Iranian Delegation Heads to Lebanon, Syria')])
    OrderedDict([('split', 'test'), ('genre', 'main-news'), ('dataset', 'headlines'), ('year', '2015'), ('sid', '1242'), ('score', '3.00'), ('sentence1', 'Former Pakistan President Pervez Musharraf arrested again'), ('sentence2', 'Former Pakistan military ruler Pervez Musharraf granted bail')])
    OrderedDict([('split', 'test'), ('genre', 'main-news'), ('dataset', 'headlines'), ('year', '2015'), ('sid', '1263'), ('score', '3.80'), ('sentence1', "US drone strike 'kills 4 militants in Pakistan'"), ('sentence2', 'US drone strike kills 10 in Pakistan')])
    OrderedDict([('split', 'test'), ('genre', 'main-news'), ('dataset', 'headlines'), ('year', '2015'), ('sid', '1271'), ('score', '4.00'), ('sentence1', "UK's Ex-Premier Margaret Thatcher Dies At 87"), ('sentence2', 'Former British PM Margaret Thatcher dies')])
    OrderedDict([('split', 'test'), ('genre', 'main-news'), ('dataset', 'headlines'), ('year', '2015'), ('sid', '1303'), ('score', '4.80'), ('sentence1', "Polar bear DNA 'may help fight obesity'"), ('sentence2', 'Polar bear study may boost fight against obesity')])
    OrderedDict([('split', 'test'), ('genre', 'main-news'), ('dataset', 'headlines'), ('year', '2015'), ('sid', '1311'), ('score', '4.80'), ('sentence1', "'Fast & Furious' star Paul Walker dies in car crash"), ('sentence2', 'Paul Walker dead: Fast and Furious star, 40, killed in car crash')])
    OrderedDict([('split', 'test'), ('genre', 'main-news'), ('dataset', 'headlines'), ('year', '2015'), ('sid', '1316'), ('score', '2.80'), ('sentence1', "Air strike kills one man in Syria's Hama"), ('sentence2', 'US drone strike kills eleven in Pakistan')])
    OrderedDict([('split', 'test'), ('genre', 'main-news'), ('dataset', 'headlines'), ('year', '2015'), ('sid', '1321'), ('score', '2.60'), ('sentence1', 'Turkish PM, president clash over reply to protests'), ('sentence2', 'Turkish president calls for calm amid nationwide protests')])
    OrderedDict([('split', 'test'), ('genre', 'main-news'), ('dataset', 'headlines'), ('year', '2015'), ('sid', '1351'), ('score', '4.40'), ('sentence1', 'Strong new quake hits shattered Pak region'), ('sentence2', '6.8 quake in shattered Pakistan region')])
    OrderedDict([('split', 'test'), ('genre', 'main-news'), ('dataset', 'headlines'), ('year', '2015'), ('sid', '1383'), ('score', '4.60'), ('sentence1', 'Floods in central Europe continue to create havoc'), ('sentence2', 'Europe floods continue to create havoc')])
    OrderedDict([('split', 'test'), ('genre', 'main-news'), ('dataset', 'headlines'), ('year', '2015'), ('sid', '1403'), ('score', '2.20'), ('sentence1', 'Luxembourg PM quits amid spying scandal'), ('sentence2', 'Luxembourg votes after spying row')])
    OrderedDict([('split', 'test'), ('genre', 'main-news'), ('dataset', 'headlines'), ('year', '2015'), ('sid', '1425'), ('score', '0.40'), ('sentence1', '3 dead, 4 missing in central China construction accident'), ('sentence2', 'One dead, 8 missing in Vietnam boat accident')])
    
    1. 首先读入数据并且划分为 训练集,测试集和验证集等;Sentence-Transformer 在 fine-tune 的时候,数据必须保存到 list 中,list 里是 Sentence-Transformer 库的作者自己定义的 InputExample() 对象;InputExample() 对象需要传两个参数 texts 和 label,其中,texts 也是个 list 类型,里面保存了一个句子对,label 必须为 float 类型,表示这个句子对的相似程度
    train_samples = []
    dev_samples = []
    test_samples = []
    with gzip.open(sts_dataset_path, 'rt', encoding='utf8') as fIn:
        reader = csv.DictReader(fIn, delimiter='\t', quoting=csv.QUOTE_NONE)
        for row in reader:
            score = float(row['score']) / 5.0  # Normalize score to range 0 ... 1
            inp_example = InputExample(texts=[row['sentence1'], row['sentence2']], label=score)
    
            if row['split'] == 'dev':
                dev_samples.append(inp_example)
            elif row['split'] == 'test':
                test_samples.append(inp_example)
            else:
                train_samples.append(inp_example)
    
    
    train_dataloader = DataLoader(train_samples, shuffle=True, batch_size=train_batch_size)
    
    1. 搭建网络结构:
    model_name = 'distilbert-base-uncased'
    word_embedding_model = models.Transformer(model_name)
    # Apply mean pooling to get one fixed sized sentence vector
    pooling_model = models.Pooling(word_embedding_model.get_word_embedding_dimension(),
                                   pooling_mode_mean_tokens=True,
                                   pooling_mode_cls_token=False,
                                   pooling_mode_max_tokens=False)
    
    model = SentenceTransformer(modules=[word_embedding_model, pooling_model],device='cuda')
    train_loss = losses.CosineSimilarityLoss(model=model)
    
    evaluator = EmbeddingSimilarityEvaluator.from_input_examples(dev_samples, name='sts-dev')
    
    1. 开始训练model
    # Train the model
    model.fit(train_objectives=[(train_dataloader, train_loss)],
              evaluator=evaluator,
              epochs=num_epochs,
              evaluation_steps=1000,
              warmup_steps=warmup_steps,
              output_path=model_save_path)
    
    1. 在验证集上测试:
    # Load the stored model and evaluate its performance on STS benchmark dataset
    model = SentenceTransformer(model_save_path)
    test_evaluator = EmbeddingSimilarityEvaluator.from_input_examples(test_samples, name='sts-test')
    test_evaluator(model, output_path=model_save_path)
    

    官网提供的如何使用huggingface 预训练的 transformer 框架模型 加上一个 pooling 层 去创建 SentenceTransformer model 的完整训练测试过程:代码

    参考和推荐阅读:

    Sentence-Transformer Semantic Textual Similarity
    Sentence-Transformer 的使用及 fine-tune 教程
    Sentence Bert
    Sentence-BERT: 一种能快速计算句子相似度的孪生网络

    相关文章

      网友评论

          本文标题:Sentence-Transformer

          本文链接:https://www.haomeiwen.com/subject/mkskxltx.html