美文网首页
NLP小工具

NLP小工具

作者: WritingHere | 来源:发表于2022-01-01 22:13 被阅读0次

    日常用NLP脚本备份

    机器翻译

    • 使用Huggingface提供的接口,和Helsinki-NLP提供的脚本,实现快速的机器翻译;
    • 为了便于批量处理,服务端使用Flask制作API,客户端使用requests发送请求
      服务端代码api.py如下:
    #!/usr/bin/env python
    # -*- coding: utf-8 -*-
    import torch
    import json
    from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
    from flask import Flask, request
    
    app = Flask(__name__)
    
    device = torch.device('cuda:0' if torch.cuda.is_available() else 'cpu')
    
    tokenizer = AutoTokenizer.from_pretrained("Helsinki-NLP/opus-mt-en-zh")
    model = AutoModelForSeq2SeqLM.from_pretrained("Helsinki-NLP/opus-mt-en-zh").to(device)
    
    @app.route("/", methods=['POST'])
    def index():
        text = request.get_json()['text']
        batch = tokenizer.prepare_seq2seq_batch(src_texts=text)
        for k, v in batch.items():
            batch[k] = torch.tensor([w[:512] for w in v]).to(device)
    
        translation = model.generate(**batch)
        result = tokenizer.batch_decode(translation, skip_special_tokens=True)
        return json.dumps({'result': result}, ensure_ascii=False)
    
    if __name__ == '__main__':
       app.run(host='0.0.0.0', port=9100)
    
    

    客户端代码main.py如下:

    #!/usr/bin/env python
    
    import yaml
    import requests as rq
    
    text = ['Oh, god, this is great! The plane is gone, so it looks like I\'m stuck here with you guys.', 'I love you.']
    headers = {'Content-Type': 'application/json', 'Accept':'application/json'}    
    data = {'text': text} 
    a = rq.post('http://127.0.0.1:9100', data=json.dumps(data), headers=self.headers)
    print(a.text)
    
    • 运行服务端:python api.py
      看到* Running on http://127.0.0.1:5000/ (Press CTRL+C to quit)就说明服务器端启动成功了
    • 运行客户端: python main.py
      可以看到输出结果为:{"result": ["飞机没了 看来我跟你们困在这里了", "我爱你"]}

    TopK算法

    • 有时候需要从一段连续序列中取出topk个元素,如果直接排序,则复杂度较高,为Nlog(N),这里我们维护一个大小为K的小顶堆,从前往后遍历数组并依次加入小顶堆,即可实现topK算法。下面展示了python的实现方式:
    • TopK代码
    
    import heapq
    
    class PriorityQueueTopK:
      
        def __init__(self, k=10):
            """[summary]
    
            Args:
                k (int, optional): Max number of the queue. Defaults to 10.
            """
            self._queue = []
            self._index = 0
            self.k = k
    
        def push(self, item, priority=None):
            # 传入两个参数,一个是存放元素的数组,另一个是要存储的元素,这里是一个元组。
            if priority is None: priority = item
            if len(self._queue) < self.k:
                heapq.heappush(self._queue, (priority, self._index, item))
                self._index += 1
            elif priority > self._queue[0][0]:
                heapq.heapreplace(self._queue, (priority, self._index, item))
                self._index += 1
    
        def pop(self):
            return heapq.heappop(self._queue)[-1]
        
        def topk(self):
            return [w[-1] for w in self._queue]
            return self._queue
    
    • TopK测试代码:
    k = 5
    items = [random.randint(1, 10) for i in range(10)]
    print(items)
    pq = PriorityQueueTopK(k)
    for i in range(len(items)):
        pq.push(items[i])
    res = pq.topk()
    print(res)
    

    相关文章

      网友评论

          本文标题:NLP小工具

          本文链接:https://www.haomeiwen.com/subject/kzuaqrtx.html