文心4.5开源大模型的使用和部署

百度文心4.5系列大模型正式开源，包含10款不同规模的模型（0.3B到424B参数）。本文介绍了快速部署和使用方法：1)安装PaddlePaddle、FastDeploy等必要环境；2)提供Python代码示例实现本地对话功能；3)讲解如何启动API服务并兼容OpenAI接口格式。通过简单命令即可部署模型服务(默认端口8180)，并支持使用标准openai库进行调用。文章还预告后续将补充Andro

夜雨飘零1

807人浏览 · 2025-06-30 22:59:47

夜雨飘零1 · 2025-06-30 22:59:47 发布

前言

就在今天，文心4.5模型开源了，不是一个，而是整个系列模型正式开源。很突然，我都震惊了。文心4.5系列开源模型共10款，涵盖了激活参数规模分别为47B 和3B 的混合专家（MoE）模型（最大的模型总参数量为424B），以及0.3B 的稠密参数模型。下面我们就介绍如何快速使用文心4.5模型推理，以及部署接口给Android、微信小程序等客户端调用，注意这里只接受文本类型的模型，实际文心4.5也有多模态的模型。

环境：

PaddlePaddle 3.1.0
Python 3.11
CUDA 12.6
显卡 4090 24G
Ubuntu 22.04

搭建环境

首先安装PaddlePaddle，如果安装了，可以跳过。

python -m pip install paddlepaddle-gpu==3.1.0 -i https://www.paddlepaddle.org.cn/packages/stable/cu126/

然后安装fastdeploy工具。

python -m pip install fastdeploy-gpu -i https://www.paddlepaddle.org.cn/packages/stable/fastdeploy-gpu-86_89/ --extra-index-url https://mirrors.tuna.tsinghua.edu.cn/pypi/web/simple

安装aistudio-sdk，用于下载模型。

pip install --upgrade aistudio-sdk

快速使用

通过使用下面Python代码，可以快速实现对话。我使用了最小的一个模型作为开始使用，实际还有更多更大的模型，如下：

ERNIE-4.5-0.3B-Paddle
ERNIE-4.5-21B-A3B-Paddle
ERNIE-4.5-300B-A47B-Paddle

执行下面代码，会自动下载模型，然后开始在终端对话。quantization参数设置量化类型，支持wint4和wint8。

from aistudio_sdk.snapshot_download import snapshot_download
from fastdeploy import LLM, SamplingParams

# 模型名称
model_name = "PaddlePaddle/ERNIE-4.5-0.3B-Paddle"
save_path = "./models/ERNIE-4.5-0.3B-Paddle/"
# 下载模型
res = snapshot_download(repo_id=model_name, revision='master', local_dir=save_path)
# 对话参数
sampling_params = SamplingParams(temperature=0.8, top_p=0.95)
# 加载模型
llm = LLM(model=save_path, max_model_len=32768, quantization=None)

messages = []

while True:
    prompt = input("请输入问题：")
    if prompt == 'exit':
        break
    messages.append({"role": "user", "content": prompt})
    output = llm.chat(messages, sampling_params)[0]
    text = output.outputs.text
    messages.append({"role": "assistant", "content": text})
    print(text)

输出日志如下：

INFO     2025-07-01 14:20:26,232 4785  engine.py[line:206] Waitting worker processes ready...
Loading Weights: 100%|█████████████████████████████████| 100/100 [00:03<00:00, 33.26it/s]
Loading Layers: 100%|██████████████████████████████████| 100/100 [00:01<00:00, 66.54it/s]
INFO     2025-07-01 14:20:36,753 4785  engine.py[line:276] Worker processes are launched with 12.627224445343018 seconds.
请输入问题：你好，你叫什么名字？
Processed prompts: 100%|███████████████████████████████████████████████████████████████| 1/1 [00:00<00:00,  2.12it/s, est. speed input: 0.00 toks/s, output: 0.00 toks/s]
你好呀！我是**小天**，很高兴认识你！有什么我可以帮助你的吗？
请输入问题：你会什么？
Processed prompts: 100%|███████████████████████████████████████████████████████████████| 1/1 [00:01<00:00,  1.44s/it, est. speed input: 0.00 toks/s, output: 0.00 toks/s]
我的本领可多啦！我擅长**整理知识树**、分析历史事件、讲解科学原理，还能帮你快速完成**脑筋急转弯**或**创意小发明**，或者用声音给你讲有趣的笑话呢。你要不要试试？
请输入问题：我刚才问你什么问题？
Processed prompts: 100%|███████████████████████████████████████████████████████████████| 1/1 [00:00<00:00,  3.49it/s, est. speed input: 0.00 toks/s, output: 0.00 toks/s]
好呀！你想问什么呢？是关于我的名字、我的爱好、或者其他有趣的话题呀？
请输入问题：

部署接口

首先下载模型，这里可以随时替换你需要的模型。

aistudio download --model PaddlePaddle/ERNIE-4.5-0.3B-Paddle --local_dir ./models/ERNIE-4.5-0.3B-Paddle/

下载模型之后，执行下面命令开始启动服务，端口号是8180，max-model-len是指定推理支持的最大上下文长度，max-num-seqs是解码阶段的最大并发数，如果指定了quantization，就开启量化。更多的参数文档可以查看：https://paddlepaddle.github.io/FastDeploy/parameters/

python -m fastdeploy.entrypoints.openai.api_server \
       --model ./models/ERNIE-4.5-0.3B-Paddle/ \
       --port 8180 \
       --quantization wint8 \
       --max-model-len 32768 \
       --max-num-seqs 32

输出日志如下：

INFO     2025-07-01 14:25:22,033 5239  engine.py[line:206] Waitting worker processes ready...
Loading Weights: 100%|█████████████████████████████████| 100/100 [00:03<00:00, 33.26it/s]
Loading Layers: 100%|██████████████████████████████████| 100/100 [00:02<00:00, 49.91it/s]
INFO     2025-07-01 14:25:33,060 5239  engine.py[line:276] Worker processes are launched with 16.20948576927185 seconds.
INFO     2025-07-01 14:25:33,061 5239  api_server.py[line:91] Launching metrics service at http://0.0.0.0:8001/metrics
INFO     2025-07-01 14:25:33,061 5239  api_server.py[line:94] Launching chat completion service at http://0.0.0.0:8180/v1/chat/completions
INFO     2025-07-01 14:25:33,061 5239  api_server.py[line:97] Launching completion service at http://0.0.0.0:8180/v1/completions
INFO:     Started server process [5239]
INFO:     Waiting for application startup.
[2025-07-01 14:25:34,089] [    INFO] - Loading configuration file ./models/ERNIE-4.5-0.3B-Paddle/generation_config.json
INFO:     Application startup complete.
INFO:     Uvicorn running on http://0.0.0.0:8180 (Press CTRL+C to quit)
INFO:     127.0.0.1:53716 - "POST /v1/chat/completions HTTP/1.1" 200 OK

调用接口

它是兼容OpenAI的API，所以如果使用Python调用的话，可以使用openai库来调用，不需要指定模型名称和api_key。

import openai
host = "192.168.0.100"
port = "8180"
client = openai.Client(base_url=f"http://{host}:{port}/v1", api_key="null")

messages = []

while True:
    prompt = input("请输入问题：")
    if prompt == 'exit':
        break
    messages.append({"role": "user", "content": prompt})
    response = client.chat.completions.create(
        model="null",
        messages=messages,
        stream=True,
    )
    output = ""
    for chunk in response:
        if chunk.choices[0].delta:
            print(chunk.choices[0].delta.content, end='')
            output += chunk.choices[0].delta.content
    print()
    messages.append({"role": "assistant", "content": output})

输出如下：

请输入问题：你好
你好呀！😊 很高兴能为你提供帮助～有什么我可以帮你解决的吗？无论是学习上的问题，还是生活里的小烦恼，我都在这儿哦！🧐
请输入问题：

技术共进，成长同行——讯飞AI开发者社区

更多推荐

深度学习算法：开启智能时代的钥匙

讯飞AI开发者社区

目标检测数据集第017期-基于yolo标注格式的垃圾分类检测数据集(含免费分享)

讯飞AI开发者社区

【人工智能】提示词进阶：用“思维链（CoT）”让大模型更擅长逻辑推理

讯飞AI开发者社区

所有评论(0)

查看更多评论

夜雨飘零1

@qq_33200967

已为社区贡献14条内容