基于PAI—EAS平台ChatGLM API进行模型推理

作者: 梅西爱骑车 | 来源:发表于2024-03-08 22:11 被阅读0次

2020设计之旅02-清单46【机器学习】独家解读！阿里重磅发布
使用nginx和fastcgi做图片识别服务器
RestCloud低代码开发平台，无代码快速发布API
EasyNLP简介
低代码开发平台，RestCloud API轻量级开发平台
混合集成平台iPaaS，为企业实现多方面集成
基于tensorflow的实时物体识别
天猫精灵业务如何使用机器学习PAI进行模型推理优化
天猫精灵业务如何使用机器学习PAI进行模型推理优化
API开发平台，多功能低代码平台

ChatGLM的部署见上篇文章：使用PAI——EAS部署ChatGLM，部署之后是Web页面方式访问langchain-ChatGLM，下面是通过API方式访问的实例。

一、获取服务访问地址和Token

进入PAI-EAS 模型在线服务页面，详情请参见使用PAI——EAS部署ChatGLM。
在该页面中单击目标服务名称进入“服务详情”页面。
在“基本信息”区域单击“查看调用信息”，在“公网地址调用”页签获取服务Token和访问地址。

调用信息
由于我通过本地命令调用，所以只关注公网地址调用，VPC暂且不涉及。公网地址调用信息

二、启动API进行模型推理。

2.1 使用HTTP方式调用服务

2.1.1非流式调用

客户端使用标准的HTTP格式，使用curl命令调用时，支持发送以下两种类型的请求：

发送String类型的请求

curl $host -H 'Authorization: $authorization' --data-binary @chatllm_data.txt -v

其中：$authorization需替换为服务Token，$host：需替换为服务访问地址，chatllm_data.txt：该文件为包含问题的纯文本文件。

发送结构化类型的请求


curl $host -H 'Authorization: $authorization' -H "Content-type: application/json" --data-binary @chatllm_data.json -v -H "Connection: close"

使用chatllm_data.json文件来设置推理参数，chatllm_data.json文件的内容格式如下：

{
    "max_new_tokens": 4096,
    "use_stream_chat": false,
    "prompt": "How to install it?",
    "system_prompt": "Act like you are programmer with 5+ years of experience."
    "history": [
        [
            "Can you tell me what's the bladellm?",
            "BladeLLM is an framework for LLM serving, integrated with acceleration techniques like quantization, ai compilation, etc. , and supporting popular LLMs like OPT, Bloom, LLaMA, etc."
        ]
    ],
    "temperature": 0.8,
    "top_k": 10,
    "top_p": 0.8,
    "do_sample": True,
    "use_cache": True,
}

参数说明如下，请酌情添加与删减。

参数说明

也可以基于Python的requests包实现自己的客户端，示例代码如下：

import argparse
import json
from typing import Iterable, List

import requests

def post_http_request(prompt: str,
                      system_prompt: str,
                      history: list,
                      host: str,
                      authorization: str,
                      max_new_tokens: int = 2048,
                      temperature: float = 0.95,
                      top_k: int = 1,
                      top_p: float = 0.8,
                      langchain: bool = False,
                      use_stream_chat: bool = False) -> requests.Response:
    headers = {
        "User-Agent": "Test Client",
        "Authorization": f"{authorization}"
    }
    if not history:
        history = [
            (
                "San Francisco is a",
                "city located in the state of California in the United States. \
                It is known for its iconic landmarks, such as the Golden Gate Bridge \
                and Alcatraz Island, as well as its vibrant culture, diverse population, \
                and tech industry. The city is also home to many famous companies and \
                startups, including Google, Apple, and Twitter."
            )
        ]
    pload = {
        "prompt": prompt,
        "system_prompt": system_prompt,
        "top_k": top_k,
        "top_p": top_p,
        "temperature": temperature,
        "max_new_tokens": max_new_tokens,
        "use_stream_chat": use_stream_chat,
        "history": history
    }
    if langchain:
        print(langchain)
        pload["langchain"] = langchain
    response = requests.post(host, headers=headers,
                             json=pload, stream=use_stream_chat)
    return response

def get_response(response: requests.Response) -> List[str]:
    data = json.loads(response.content)
    output = data["response"]
    history = data["history"]
    return output, history

if __name__ == "__main__":
    parser = argparse.ArgumentParser()
    parser.add_argument("--top-k", type=int, default=4)
    parser.add_argument("--top-p", type=float, default=0.8)
    parser.add_argument("--max-new-tokens", type=int, default=2048)
    parser.add_argument("--temperature", type=float, default=0.95)
    parser.add_argument("--prompt", type=str, default="How can I get there?")
    parser.add_argument("--langchain", action="store_true")

    args = parser.parse_args()

    prompt = args.prompt
    top_k = args.top_k
    top_p = args.top_p
    use_stream_chat = False
    temperature = args.temperature
    langchain = args.langchain
    max_new_tokens = args.max_new_tokens

    host = "EAS服务公网地址"
    authorization = "EAS服务公网Token"

    print(f"Prompt: {prompt!r}\n", flush=True)
    # 在客户端请求中可设置语言模型输入中的system prompt
    system_prompt = "Act like you are programmer with \
                5+ years of experience."

    # 客户端请求中可设置对话的历史信息，客户端维护当前用户的对话记录，用于实现多轮对话。通常情况下可以使用上一轮对话返回的histroy信息，history格式为List[Tuple(str, str)]
    history = []
    response = post_http_request(
        prompt, system_prompt, history,
        host, authorization,
        max_new_tokens, temperature, top_k, top_p,
        langchain=langchain, use_stream_chat=use_stream_chat)
    output, history = get_response(response)
    print(f" --- output: {output} \n --- history: {history}", flush=True)

# 服务端返回结果为json，包含推理结果与对话历史
def get_response(response: requests.Response) -> List[str]:
    data = json.loads(response.content)
    output = data["response"]
    history = data["history"]
    return output, history

其中：
host：配置为服务访问地址。
authorization：配置为服务Token。

2.1.2流式调用

流式调用使用HTTP SSE方式，其他设置方式与非流式相同，代码参考如下：

import argparse
import json
from typing import Iterable, List

import requests


def clear_line(n: int = 1) -> None:
    LINE_UP = '\033[1A'
    LINE_CLEAR = '\x1b[2K'
    for _ in range(n):
        print(LINE_UP, end=LINE_CLEAR, flush=True)


def post_http_request(prompt: str,
                      system_prompt: str,
                      history: list,
                      host: str,
                      authorization: str,
                      max_new_tokens: int = 2048,
                      temperature: float = 0.95,
                      top_k: int = 1,
                      top_p: float = 0.8,
                      langchain: bool = False,
                      use_stream_chat: bool = False) -> requests.Response:
    headers = {
        "User-Agent": "Test Client",
        "Authorization": f"{authorization}"
    }
    if not history:
        history = [
            (
                "San Francisco is a",
                "city located in the state of California in the United States. \
                It is known for its iconic landmarks, such as the Golden Gate Bridge \
                and Alcatraz Island, as well as its vibrant culture, diverse population, \
                and tech industry. The city is also home to many famous companies and \
                startups, including Google, Apple, and Twitter."
            )
        ]
    pload = {
        "prompt": prompt,
        "system_prompt": system_prompt,
        "top_k": top_k,
        "top_p": top_p,
        "temperature": temperature,
        "max_new_tokens": max_new_tokens,
        "use_stream_chat": use_stream_chat,
        "history": history
    }
    if langchain:
        print(langchain)
        pload["langchain"] = langchain
    response = requests.post(host, headers=headers,
                             json=pload, stream=use_stream_chat)
    return response


def get_streaming_response(response: requests.Response) -> Iterable[List[str]]:
    for chunk in response.iter_lines(chunk_size=8192,
                                     decode_unicode=False,
                                     delimiter=b"\0"):
        if chunk:
            data = json.loads(chunk.decode("utf-8"))
            output = data["response"]
            history = data["history"]
            yield output, history


if __name__ == "__main__":
    parser = argparse.ArgumentParser()
    parser.add_argument("--top-k", type=int, default=4)
    parser.add_argument("--top-p", type=float, default=0.8)
    parser.add_argument("--max-new-tokens", type=int, default=2048)
    parser.add_argument("--temperature", type=float, default=0.95)
    parser.add_argument("--prompt", type=str, default="How can I get there?")
    parser.add_argument("--langchain", action="store_true")
    args = parser.parse_args()

    prompt = args.prompt
    top_k = args.top_k
    top_p = args.top_p
    use_stream_chat = True
    temperature = args.temperature
    langchain = args.langchain
    max_new_tokens = args.max_new_tokens

    host = ""
    authorization = ""

    print(f"Prompt: {prompt!r}\n", flush=True)
    system_prompt = "Act like you are programmer with \
                5+ years of experience."
    history = []
    response = post_http_request(
        prompt, system_prompt, history,
        host, authorization,
        max_new_tokens, temperature, top_k, top_p,
        langchain=langchain, use_stream_chat=use_stream_chat)

    for h, history in get_streaming_response(response):
        print(
            f" --- stream line: {h} \n --- history: {history}", flush=True)

2.1.2.3 如何配置更多参数

运行命令中支持配置的参数如下：

2020设计之旅02-清单46【机器学习】独家解读！阿里重磅发布
第46期：用时约4分【机器学习】独家解读！阿里重磅发布机器学习平台PAI3.0【下】 EAS：机器学习模型在线推...
使用nginx和fastcgi做图片识别服务器
背景使用的特定的设备进行深度学习模型的推理，该机器仅仅提供了C++封装好的API进行模型的加载启动与推理，模型的...
RestCloud低代码开发平台，无代码快速发布API
RestCloud低代码开发平台通过建立数据模型和业务模型能够无代码快速的发布API服务，同时也能基于数据模型快速...
EasyNLP简介
EasyNLP中文NLP算法框架作者：PAI（阿里云人工智能平台）算法团队平台：基于PyTorch 优势：中文...
低代码开发平台，RestCloud API轻量级开发平台
RestCloud API低代码开发平台，基于微服务架构的专注API高效开发的专业化平台，平台相比于传统API开发...
混合集成平台iPaaS，为企业实现多方面集成
iPaaS集成平台，RestCloud新一代的混合集成平台，以API为中心，基于微服务架构对系统、应用、设备进行全...
基于tensorflow的实时物体识别
google开源了基于深度学习的物体识别模型和python API。模型 Tensorflow detectio...
天猫精灵业务如何使用机器学习PAI进行模型推理优化
作者：如切，悟双，楚哲，晓祥，旭林引言天猫精灵（TmallGenie）是阿里巴巴人工智能实验室（Alibaba...
天猫精灵业务如何使用机器学习PAI进行模型推理优化
引言天猫精灵（TmallGenie）是阿里巴巴人工智能实验室（Alibaba A.I.Labs）于2017年7月...
API开发平台，多功能低代码平台
为什么要选择RestCloud API开发平台？ API开发平台是RestCloud团队研发的基于微服务架构的专注...

基于PAI—EAS平台ChatGLM API进行模型推理

一、获取服务访问地址和Token

二、启动API进行模型推理。

2.1 使用HTTP方式调用服务

2.1.1非流式调用

2.1.2流式调用

2.1.2.3 如何配置更多参数

相关文章

2020设计之旅02-清单46【机器学习】独家解读！阿里重磅发布

使用nginx和fastcgi做图片识别服务器

RestCloud低代码开发平台，无代码快速发布API

EasyNLP简介

低代码开发平台，RestCloud API轻量级开发平台

混合集成平台iPaaS，为企业实现多方面集成

基于tensorflow的实时物体识别

天猫精灵业务如何使用机器学习PAI进行模型推理优化

天猫精灵业务如何使用机器学习PAI进行模型推理优化

API开发平台，多功能低代码平台

网友评论

延伸阅读

深度阅读

栏目导航

热点阅读

基于PAI—EAS平台ChatGLM API进行模型推理

一、获取服务访问地址和Token

二、 启动API进行模型推理。

2.1 使用HTTP方式调用服务

2.1.1非流式调用

2.1.2流式调用

2.1.2.3 如何配置更多参数

相关文章

网友评论

延伸阅读

深度阅读

栏目导航

热点阅读

二、启动API进行模型推理。