tensorrt 10转yolov8模型engine和推理inference

转tensorrt模型有很多种方式, onnx, trtexec.exe, pytorch

九是否非随机的称呼

3382人浏览 · 2024-04-07 22:45:28

九是否非随机的称呼 · 2024-04-07 22:45:28 发布

tensorrt 10转yolov8模型engine和推理inference

个人知道的内容：转tensorrt模型有很多种方式，

Sample Support Guide :: NVIDIA Deep Learning TensorRT Documentation

转tensorrt方式

1、pytorch->onnx->tensorrt，tensorflow->onnx->tensorrt，具体见

TensorRT github，里面的sample目录

TensorRT/samples/python at release/10.0 · NVIDIA/TensorRT (github.com)，这里面有很多的onnx转tensorrt模型的案例，可以直接修改这里面的内容。

2、使用torch2trt库

https://github.com/NVIDIA-AI-IOT/torch2trt，

3、使用tensorrt安装目录中的trtexec.exe执行文件的，使用命令行来产生模型，

TensorRT/samples/trtexec at release/10.0 · NVIDIA/TensorRT (github.com)。文档在

Developer Guide :: NVIDIA Deep Learning TensorRT Documentation 一个example，下面就是文档的，–onnx指定onnx文件，–-memPoolSize=指定最大的显存空间

./trtexec --onnx=model.onnx --minShapes=input:1x3x244x244 --optShapes=input:16x3x244x244 --maxShapes=input:32x3x244x244 --shapes=input:5x3x244x244

最高阶的方式：自己用tensorrt的api来产生一个图，然后填充权重，这样就不需要转onnx的，具体的example可以见

NVIDIA/TensorRT: NVIDIA® TensorRT™ is an SDK for high-performance deep learning inference on NVIDIA GPUs. This repository contains the open source components of TensorRT. (github.com)里面的sample目录。

具体的案例可见 ：

TensorRT/samples/sampleCharRNN at main · NVIDIA/TensorRT (github.com)，还可以自己构造BERT、diffusion，见：

TensorRT/demo/Diffusion at main · NVIDIA/TensorRT (github.com)

python使用tensorrt来自己构图并填充权重 的案例：

TensorRT/samples/python/network_api_pytorch_mnist at main · NVIDIA/TensorRT (github.com)

自己构图的文档：

create-network-def-scratch-Developer Guide :: NVIDIA Deep Learning TensorRT Documentation，下面的文档里面写了，可以直接使用tensorrt的API来构造网络并且填充权重，而不需要parser了像onnx。

Instead of using a parser, you can also define the network directly to TensorRT using the Network Definition API. This scenario assumes that the per-layer weights are ready in host memory to pass to TensorRT during the network creation.
The following examples create a simple network with Input, Convolution, Pooling, MatrixMultiply, Shuffle, Activation, and SoftMax layers.
For more information regarding layers, refer to the

NVIDIA TensorRT Operator’s Reference.

--onnx=<model>: Specify the input ONNX model.
If the input model is in ONNX format, use the --minShapes, --optShapes, and --maxShapes flags to control the range of input shapes including batch size.
--minShapes=<shapes>, --optShapes=<shapes>, and --maxShapes=<shapes>: Specify the range of the input shapes to build the engine with. Only required if the input model is in ONNX format.
–-memPoolSize=<pool_spec>: Specify the maximum size of the workspace that tactics are allowed to use, as well as the sizes of the memory pools that DLA will allocate per loadable. Supported pool types include workspace, dlaSRAM, dlaLocalDRAM, dlaGlobalDRAM, and tacticSharedMem.
--saveEngine=<file>: Specify the path to save the engine to.
--fp16, --bf16,--int8, --fp8,--noTF32, and --best: Specify network-level precision.
--stronglyTyped: Create a strongly typed network.
--sparsity=[disable|enable|force]: Specify whether to use tactics that support structured sparsity.
disable: Disable all tactics using structured sparsity. This is the default.
enable: Enable tactics using structured sparsity. Tactics will only be used if the weights in the ONNX file meet the requirements for structured sparsity.
force: Enable tactics using structured sparsity and allow trtexec to overwrite the weights in the ONNX file to enforce them to have structured sparsity patterns. Note that the accuracy is not preserved, so this is to get inference performance only.
Note: This has been deprecated. Use Polygraphy (polygraphy surgeon prune) to rewrite the weights of ONNX models to structured-sparsity pattern and then run with --sparsity=enable.
--timingCacheFile=<file>: Specify the timing cache to load from and save to.
--noCompilationCache: Disable compilation cache in builder, and the cache is part of timing cache (default is to enable compilation cache).
--verbose: Turn on verbose logging.
--skipInference: Build and save the engine without running inference.
--profilingVerbosity=[layer_names_only|detailed|none]: Specify the profiling verbosity to build the engine with.
--dumpLayerInfo, --exportLayerInfo=<file>: Print/Save the layer information of the engine.
--precisionConstraints=spec: Control precision constraint setting.
none: No constraints.
prefer: Meet precision constraints set by --layerPrecisions/--layerOutputTypes if possible.
obey: Meet precision constraints set by --layerPrecisions/--layerOutputTypes or fail otherwise.
--layerPrecisions=spec: Control per-layer precision constraints. Effective only when precisionConstraints is set to obey or prefer. The specs are read left to right, and later ones override earlier ones. "*" can be used as a layerName to specify the default precision for all the unspecified layers.
For example: --layerPrecisions=*:fp16,layer_1:fp32 sets the precision of all layers to FP16 except for layer_1, which will be set to FP32.
--layerOutputTypes=spec: Control per-layer output type constraints. Effective only when precisionConstraints is set to obey or prefer. The specs are read left to right, and later ones override earlier ones. "*" can be used as a layerName to specify the default precision for all the unspecified layers. If a layer has more than one output, then multiple types separated by "+" can be provided for this layer.
For example: --layerOutputTypes=*:fp16,layer_1:fp32+fp16 sets the precision of all layer outputs to FP16 except for layer_1, whose first output will be set to FP32 and whose second output will be set to FP16.
--layerDeviceTypes=spec: Explicitly set per-layer device type to either GPU or DLA. The specs are read left to right, and later ones override earlier ones.
-–useDLACore=N: Use the specified DLA core for layers that support DLA.
-–allowGPUFallback: Allow layers unsupported on DLA to run on GPU instead.
--versionCompatible, --vc: Enable version compatible mode for engine build and inference. Any engine built with this flag enabled is compatible with newer versions of TensorRT on the same host OS when run with TensorRT's dispatch and lean runtimes. Only supported with explicit batch mode.
--excludeLeanRuntime: When --versionCompatible is enabled, this flag indicates that the generated engine should not include an embedded lean runtime. If this is set, you must explicitly specify a valid lean runtime to use when loading the engine. Only supported with explicit batch and weights within the engine.
--tempdir=<dir>: Overrides the default temporary directory TensorRT will use when creating temporary files. Refer to the IRuntime::setTemporaryDirectory API documentation for more information.
--tempfileControls=controls: Controls what TensorRT is allowed to use when creating temporary executable files. Should be a comma-separated list with entries in the format [in_memory|temporary]:[allow|deny].
Options include:
in_memory: Controls whether TensorRT is allowed to create temporary in-memory executable files.
temporary: Controls whether TensorRT is allowed to create temporary executable files in the filesystem (in the directory given by --tempdir).
Example usage: --tempfileControls=in_memory:allow,temporary:deny
--dynamicPlugins=<file>: Load the plugin library dynamically and serialize it with the engine when it is included in --setPluginsToSerialize (can be specified multiple times).
--setPluginsToSerialize=<file>: Set the plugin library to be serialized with the engine (can be specified multiple times).
--builderOptimizationLevel=N: Set the builder optimization level to build the engine with. Higher level allows TensorRT to spend more building time for more optimization options.
--maxAuxStreams=N: Set maximum number of auxiliary streams per inference stream that TRT is allowed to use to run kernels in parallel if the network contains ops that can run in parallel, with the cost of more memory usage. Set this to 0 for optimal memory usage. Refer to the Within-Inference Multi-Streaming section for more information.
--stripWeights: Strip weights from plan. This flag works with either refit or refit with identical weights. Defaults to refit with identical weights, however, you can switch to refit by enabling both --stripWeights and --refit at the same time.
--markDebug: Specify a list of tensor names to be marked as debug tensors. Separate names with a comma.
--allowWeightStreaming: Enables an engine that can stream its weights. Must be specified with --stronglyTyped. TensorRT will automatically choose the appropriate weight streaming budget at runtime to ensure model execution. A specific amount can be set with --weightStreamingBudget.
Flags for the Inference Phase
--loadEngine=<file>: Load the engine from a serialized plan file instead of building it from the input ONNX model.
If the input model is in ONNX format or if the engine is built with explicit batch dimension, use --shapes instead.
--shapes=<shapes>: Specify the input shapes to run the inference with.
--loadInputs=<specs>: Load input values from files. Default is to generate random inputs.
--warmUp=<duration in ms>, --duration=<duration in seconds>, --iterations=<N>: Specify the minimum duration of the warm-up runs, the minimum duration for the inference runs, and the minimum iterations of the inference runs. For example, setting --warmUp=0 --duration=0 --iterations=N allows you to control exactly how many iterations to run the inference for.
--useCudaGraph: Capture the inference to a CUDA graph and run inference by launching the graph. This argument may be ignored when the built TensorRT engine contains operations that are not permitted under CUDA graph capture mode.
--noDataTransfers: Turn off host to device and device-to-host data transfers.
--useSpinWait: Actively synchronize on GPU events. This option makes latency measurement more stable but increases CPU usage and power.
--infStreams=<N>: Run inference with multiple cross-inference streams in parallel. Refer to the Cross-Inference Multi-Streaming section for more information.
--verbose: Turn on verbose logging.
--dumpProfile, --exportProfile=<file>: Print/Save the per-layer performance profile.
--dumpLayerInfo, --exportLayerInfo=<file>: Print layer information of the engine.
--profilingVerbosity=[layer_names_only|detailed|none]: Specify the profiling verbosity to run the inference with.
--useRuntime=[full|lean|dispatch]: TensorRT runtime to execute engine. lean and dispatch require --versionCompatible to be enabled and are used to load a VC engine. All engines (VC or not) must be built with full runtime.
--leanDLLPath=<file>: External lean runtime DLL to use in version compatible mode. Requires --useRuntime=[lean|dispatch].
--dynamicPlugins=<file>: Load the plugin library dynamically when the library is not included in the engine plan file (can be specified multiple times).
--getPlanVersionOnly: Print TensorRT version when loaded plan was created. Works without deserialization of the plan. Use together with --loadEngine. Supported only for engines created with 8.6 and later.
--saveDebugTensors: Specify list of tensor names to turn on the debug state and filename to save raw outputs to. These tensors must be specified as debug tensors during build time.
--allocationStrategy: Specify how the internal device memory for inference is allocated. You can choose from static, profile, and runtime. The first option is the default behavior that pre-allocates enough size for all profiles and input shapes. The second option enables trtexec to only allocate what’s required for the profile to use. The third option enables trtexec to only allocate what’s required for the actual input shapes.
--weightStreamingBudget: Manually set the weight streaming budget. Base-2 unit suffixes are supported: B (Bytes), G (Gibibytes), K (Kibibytes), M (Mebibytes). A value of 0 will choose the minimum possible budget if the weights don’t fit on the device. A value of -1 will disable weight streaming at runtime.
Refer to trtexec --help for all the supported flags and detailed explanations.

Refer to the GitHub: trtexec/README.md file for detailed information about how to build this tool and examples of its usage.

onnx算子融合的

TensorRT/samples/sampleNamedDimensions/create_model.py at release/10.0 · NVIDIA/TensorRT (github.com)

1里面的onnx，在转换的时候可能会出现不支持的情况，这个时候就需要做算子融合了

onnx可能不支持某些算子，像layernorm，这个时候就需要进行算子融合，也就是将不支持的算子融合在一起，具体的可以见

onnx-graphsurgeon：

https://docs.nvidia.com/deeplearning/tensorrt/onnx-graphsurgeon/docs/index.html

起因

需要将yolov8x.pt模型转到tensorrt模型

GitHub - ultralytics/ultralytics: NEW - YOLOv8 in PyTorch > ONNX > OpenVINO > CoreML > TFLite，然后做inference，降低推理时间，提升frequency per second，结果还是挺好的，tensorrt确实可以加速，在开始了FP16以后模型转到了float16格式，输入输出也是float16的，推理的时间降低到原来的一半多，也就是推理速度提升到了原来的两倍多。若是有需要，还可以转到int8格式的模型。

在将yolov8x.pt转到half(float16)格式的engine时，也就是debug的时候，发现half=True没有作用的，而且输入是FLOAT，而不是HALF，所以就开始修改和debug，即使half=True，转出来的engine模型输入输出也是float32的

TensorRT: input "images" with shape(1, 3, 640, 640) DataType.FLOAT
TensorRT: input "images" with shape(1, 3, 640, 640) DataType. FLOAT
TensorRT: output "output0" with shape(1, 132, 8400) DataType. FLOAT
TensorRT: output "output0" with shape(1, 132, 8400) DataType. FLOAT
Not 
TensorRT: input "images" with shape(1, 3, 640, 640) DataType.HALF
TensorRT: input "images" with shape(1, 3, 640, 640) DataType.HALF
TensorRT: output "output0" with shape(1, 132, 8400) DataType.HALF
TensorRT: output "output0" with shape(1, 132, 8400) DataType.HALF

主要是onnx的输入是float的，要保证onnx的输入是half才行

engine\exporter.py文件的240行修改到了 if self.args.half and (onnx or engine) and self.device.type != “cpu”:，否则engine始终都不能转到float16输入输出格式的模型。

下面是修复bug和api的pull request

Add support to exporting or inference with TensorRT 10.0.0b6, and fix a bug when exporting a tensorrt file .engine with flag half=True by ZouJiu1 · Pull Request #9840 · ultralytics/ultralytics (github.com)

转yolov8模型

环境是：window10

用到的github就是官方的：

https://github.com/ultralytics/ultralytics

按照

tensorrt 10.0.06在win10安装以及版本的api变更 - 知乎 (zhihu.com) 安装好tensorrt以后，还需要安装其他的库，像onnx, onnxsim等，一般会自动安装。tensorrt安装包下面的include, Lib, .dll .lib .hpp .h档案都需要复制到相应的cuda安装目录中才行。

当前

ultralytics 不支持tensorrt10版本，所以若是不想变更api直接使用的话，可以安装tensorrt8.3以下的版本，或者也可以参考这个pull request

下面的codes就可以将模型转到engine模型，然后就可以正常的inference了。

import os
import gc
import sys
sys.path.append(r'E:\work\codeRepo\deploy\jz\ultralytics')
from ultralytics import YOLO # newest version from "git clone and git pull"
os.environ['CUDA_VISIBLE_DEVICES'] = '0'

'''
Platform: Window11

Ultralytics YOLOv8.1.44   Python-3.9.18 torch-2.2.1+cu118 CUDA:0 (NVIDIA GeForce RTX 4070 Ti, 12282MiB)

onnx 1.16.0 opset 17

TensorRT 10.0.0b6:  
https://developer.nvidia.com/downloads/compute/machine-learning/tensorrt/10.0.0/zip/TensorRT-10.0.0.6.Windows10.win10.cuda-11.8.zip

nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2022 NVIDIA Corporation
Built on Wed_Sep_21_10:41:10_Pacific_Daylight_Time_2022
Cuda compilation tools, release 11.8, V11.8.89
Build cuda_11.8.r11.8/compiler.31833905_0
'''
if __name__ == '__main__':
    model = YOLO(r'yolov8n.pt')  # load a pretrained model (recommended for training)
    # results = model.export(format='onnx', simplify=True, half=True, device='cuda:0')    # onnx,engine
    results0 = model.export(format='engine', simplify=True, half=True, device='cuda:0')    # onnx,engine
    del model
    gc.collect()
    model = YOLO(r"E:\work\yolov8n.engine")
    result = model.predict('https://ultralytics.com/images/bus.jpg', save=True)

下面挑出转模型时比较重要的内容，做了些注释的

ultralytics/ultralytics/engine/exporter.py at 04f5ba4da051f5684afd60c681f1f6a71e1eb058 · ultralytics/ultralytics (github.com)

# 转模型的时候，onnx或者engine配置了 half，就需要将输入和模型转到 float16
        if self.args.half and (onnx or engine) and self.device.type != "cpu": 
            im, model = im.half(), model.half()  # to FP16

ultralytics/ultralytics/engine/exporter.py at 04f5ba4da051f5684afd60c681f1f6a71e1eb058 · ultralytics/ultralytics (github.com)

转tensorrt模型的函数和注释

    def export_engine(self, prefix=colorstr("TensorRT:")):
        """YOLOv8 TensorRT export https://developer.nvidia.com/tensorrt."""
        assert self.im.device.type != "cpu", "export running on CPU but must be on GPU, i.e. use 'device=0'"
        f_onnx, _ = self.export_onnx()  # run before trt import https://github.com/ultralytics/ultralytics/issues/7016

        try:  #加载tensorrt库
            import tensorrt as trt  # noqa
        except ImportError:
            if LINUX:
                check_requirements("nvidia-tensorrt", cmds="-U --index-url https://pypi.ngc.nvidia.com")
            import tensorrt as trt  # noqa

        # require tensorrt>=7.0.0，检查版本号>7就行，但没有要求<=8，所以so sad
        check_version(trt.__version__, "7.0.0", hard=True)  
        self.trt_version = trt.__version__.split(".")[0]  # 拿到主版本号10或者8或者7

        self.args.simplify = True
       
        LOGGER.info(f"\n{prefix} starting export with TensorRT {trt.__version__}...")
        assert Path(f_onnx).exists(), f"failed to export ONNX file: {f_onnx}"
        f = self.file.with_suffix(".engine")  # TensorRT engine file
        logger = trt.Logger(trt.Logger.INFO)   # tensorrt的日志
        if self.args.verbose:
            logger.min_severity = trt.Logger.Severity.VERBOSE  #设置日志的程度

        builder = trt.Builder(logger)  # 构图
        config = builder.create_builder_config() # 产生配置文件
        if self.trt_version in ["7", "8"]: # 版本是7、8
            config.max_workspace_size = int(self.args.workspace * (1 << 30)) # 配置最大显存
        elif self.trt_version == "10":     # 版本10
            config.set_memory_pool_limit(trt.MemoryPoolType.WORKSPACE, self.args.workspace * 1 << 30) # 配置最大显存
        flag = 1 << int(trt.NetworkDefinitionCreationFlag.EXPLICIT_BATCH)
        network = builder.create_network(flag)  # 产生网络的
        parser = trt.OnnxParser(network, logger) # 产生onnx的解析器
        if not parser.parse_from_file(f_onnx):   # 解析 onnx
            raise RuntimeError(f"failed to load ONNX file: {f_onnx}")

        inputs = [network.get_input(i) for i in range(network.num_inputs)] # 拿到网络的输入
        outputs = [network.get_output(i) for i in range(network.num_outputs)] # 拿到网络的输出
        for inp in inputs:
            LOGGER.info(f'{prefix} input "{inp.name}" with shape{inp.shape} {inp.dtype}')
        for out in outputs:
            LOGGER.info(f'{prefix} output "{out.name}" with shape{out.shape} {out.dtype}')

        if self.args.dynamic:  # 动态输入尺寸
            shape = self.im.shape
            if shape[0] <= 1:
                LOGGER.warning(f"{prefix} WARNING ⚠️ 'dynamic=True' model requires max batch size, i.e. 'batch=16'")
            profile = builder.create_optimization_profile()
            for inp in inputs:
                profile.set_shape(inp.name, (1, *shape[1:]), (max(1, shape[0] // 2), *shape[1:]), shape)
            config.add_optimization_profile(profile)

        LOGGER.info(
            f"{prefix} building FP{16 if builder.platform_has_fast_fp16 and self.args.half else 32} engine as {f}"
        )
        # 设置半精度模型 Float16，也就是half=true配置
        if builder.platform_has_fast_fp16 and self.args.half:
            config.set_flag(trt.BuilderFlag.FP16)

        del self.model
        torch.cuda.empty_cache()   # 清空没用的显存

        # Write file
        if self.trt_version in ["7", "8"]: # 版本是7, 8
            with builder.build_engine(network, config) as engine, open(f, "wb") as t:   # 产生engine
                # Metadata
                meta = json.dumps(self.metadata) # 导出基础信息
                t.write(len(meta).to_bytes(4, byteorder="little", signed=True)) # 写入基础信息长度
                t.write(meta.encode()) # 写入基础信息
                # Model
                t.write(engine.serialize()) # engine序列化并写入，导出完成了的
        elif self.trt_version == "10":   # 版本是10
            with builder.build_serialized_network(network, config) as engine, open(f, "wb") as t: # 产生engine并且序列化
                # Metadata
                meta = json.dumps(self.metadata) # 导出基础信息
                t.write(len(meta).to_bytes(4, byteorder="little", signed=True)) # 写入基础信息长度
                t.write(meta.encode()) # 写入基础信息
                # Model
                t.write(engine) # 写入序列化以后的engine，导出完成了的
        return f, None

转其他模型

pytorch->onnx->tensorrt，tensorflow->onnx->tensorrt，具体见

TensorRT github，里面的sample目录

TensorRT/samples/python at release/10.0 · NVIDIA/TensorRT (github.com)，这里面有很多的onnx转tensorrt模型的案例，可以直接修改这里面的内容，像onnx

python使用tensorrt来自己构图并填充权重 的案例：

TensorRT/samples/python/network_api_pytorch_mnist at main · NVIDIA/TensorRT (github.com)

具体的可见这个文篇最上面的： 转tensorrt方式

api构图并加入权重转模型

最高阶的方式：自己用tensorrt的api来产生一个图，然后填充权重，这样就不需要转onnx的，具体的example可以见

具体的案例可见 ：

TensorRT/samples/sampleCharRNN at main · NVIDIA/TensorRT (github.com)，还可以自己构造BERT、diffusion，见：

TensorRT/demo/Diffusion at main · NVIDIA/TensorRT (github.com)

python使用tensorrt来自己构图并填充权重 的案例：

TensorRT/samples/python/network_api_pytorch_mnist at main · NVIDIA/TensorRT (github.com)

自己构图的文档：

Instead of using a parser, you can also define the network directly to TensorRT using the Network Definition API. This scenario assumes that the per-layer weights are ready in host memory to pass to TensorRT during the network creation. The following examples create a simple network with Input, Convolution, Pooling, MatrixMultiply, Shuffle, Activation, and SoftMax layers. For more information regarding layers, refer to the

NVIDIA TensorRT Operator’s Reference.

yolov8推理inference

import os
import gc
import sys
sys.path.append(r'E:\work\codeRepo\deploy\jz\ultralytics')
from ultralytics import YOLO # newest version from "git clone and git pull"
os.environ['CUDA_VISIBLE_DEVICES'] = '0'

'''
Platform: Window11

Ultralytics YOLOv8.1.44   Python-3.9.18 torch-2.2.1+cu118 CUDA:0 (NVIDIA GeForce RTX 4070 Ti, 12282MiB)

onnx 1.16.0 opset 17

TensorRT 10.0.0b6:  
https://developer.nvidia.com/downloads/compute/machine-learning/tensorrt/10.0.0/zip/TensorRT-10.0.0.6.Windows10.win10.cuda-11.8.zip

nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2022 NVIDIA Corporation
Built on Wed_Sep_21_10:41:10_Pacific_Daylight_Time_2022
Cuda compilation tools, release 11.8, V11.8.89
Build cuda_11.8.r11.8/compiler.31833905_0
'''
if __name__ == '__main__':
    model = YOLO(r'yolov8n.pt')  # load a pretrained model (recommended for training)
    # results = model.export(format='onnx', simplify=True, half=True, device='cuda:0')    # onnx,engine
    results0 = model.export(format='engine', simplify=True, half=True, device='cuda:0')    # onnx,engine
    del model
    gc.collect()
    model = YOLO(r"E:\work\yolov8n.engine")
    result = model.predict('https://ultralytics.com/images/bus.jpg', save=True)

下面是推理时主要的codes，做了注释的

ultralytics/ultralytics/nn/autobackend.py at 04f5ba4da051f5684afd60c681f1f6a71e1eb058 · ultralytics/ultralytics (github.com)

            self.trt_version = trt.__version__.split(".")[0]
            if self.trt_version in ["7", "8"]: #版本是7或者8
                if device.type == "cpu":
                    device = torch.device("cuda:0")
                Binding = namedtuple("Binding", ("name", "dtype", "shape", "data", "ptr")) # 名称
                logger = trt.Logger(trt.Logger.INFO) # 日志
                # Read file
                with open(w, "rb") as f, trt.Runtime(logger) as runtime: # 读取engine
                    meta_len = int.from_bytes(f.read(4), byteorder="little")  # read metadata length
                    metadata = json.loads(f.read(meta_len).decode("utf-8"))  # read metadata
                    model = runtime.deserialize_cuda_engine(f.read())  # read engine
                context = model.create_execution_context() #执行上下文
                bindings = OrderedDict()
                output_names = []
                fp16 = False  # default updated below
                dynamic = False
                for i in range(model.num_bindings):
                    name = model.get_binding_name(i) #拿到输入输出名称
                    dtype = trt.nptype(model.get_binding_dtype(i)) # 拿到输入输出类型
                    if model.binding_is_input(i): #是否是输入
                        if -1 in tuple(model.get_binding_shape(i)):  # dynamic
                            dynamic = True
                            context.set_binding_shape(i, tuple(model.get_profile_shape(0, i)[2]))
                        if dtype == np.float16:
                            fp16 = True
                    else:  # output
                        output_names.append(name)
                    shape = tuple(context.get_binding_shape(i))
                    im = torch.from_numpy(np.empty(shape, dtype=dtype)).to(device) #图片放到GPU内
                    bindings[name] = Binding(name, dtype, shape, im, int(im.data_ptr())) # 图片GPU地址
                binding_addrs = OrderedDict((n, d.ptr) for n, d in bindings.items()) #拿到名称和地址
                batch_size = bindings["images"].shape[0]  # if dynamic, this is instead max batch size
                self.output_names = output_names
                self.fp16 = fp16
                self.binding_addrs = binding_addrs
                self.bindings = bindings
                self.dynamic = dynamic
                self.context = context
            elif self.trt_version == "10":  # 版本 10
                if device.type == "cpu":
                    device = torch.device("cuda:0")
                Binding = namedtuple("Binding", ("name", "dtype", "shape", "data", "ptr"))
                logger = trt.Logger(trt.Logger.INFO)
                # Read file
                with open(w, "rb") as f, trt.Runtime(logger) as runtime:  # 读取engine
                    meta_len = int.from_bytes(f.read(4), byteorder="little")  # read metadata length
                    metadata = json.loads(f.read(meta_len).decode("utf-8"))  # read metadata
                    model = runtime.deserialize_cuda_engine(f.read())  # read engine
                context = model.create_execution_context() #执行上下文
                bindings = OrderedDict()
                output_names = []
                fp16 = False  # default updated below
                dynamic = False
                for i in range(model.num_io_tensors):
                    name = model.get_tensor_name(i) #拿到输入输出名称
                    dtype = trt.nptype(model.get_tensor_dtype(name)) # 拿到输入输出类型
                    mode = model.get_tensor_mode(name) # 1输入2输出
                    if mode.value == 1:
                        if -1 in tuple([model.get_tensor_dtype(name)]):  # dynamic
                            dynamic = True
                            context.set_input_shape(name, tuple(model.get_tensor_profile_shape(name, i)[2]))
                        if dtype == np.float16:
                            fp16 = True
                    else:  # output
                        output_names.append(name)
                    shape = tuple(context.get_tensor_shape(name))
                    im = torch.from_numpy(np.empty(shape, dtype=dtype)).to(device) #图片放到GPU内
                    bindings[name] = Binding(name, dtype, shape, im, int(im.data_ptr())) # 图片GPU地址
                binding_addrs = OrderedDict((n, d.ptr) for n, d in bindings.items()) #拿到名称和地址
                batch_size = bindings["images"].shape[0]  # if dynamic, this is instead max batch size
                self.output_names = output_names
                self.fp16 = fp16
                self.binding_addrs = binding_addrs
                self.bindings = bindings
                self.dynamic = dynamic
                self.context = context

下面是inference的主要codes

ultralytics/ultralytics/nn/autobackend.py at 04f5ba4da051f5684afd60c681f1f6a71e1eb058 · ultralytics/ultralytics (github.com)

        elif self.engine: 
            if self.trt_version in ["7", "8"]: #版本是7或者8
                if self.dynamic and im.shape != self.bindings["images"].shape: # 动态输入尺寸
                    i = self.model.get_binding_index("images")
                    self.context.set_binding_shape(i, im.shape)  # reshape if dynamic
                    self.bindings["images"] = self.bindings["images"]._replace(shape=im.shape)
                    for name in self.output_names:
                        i = self.model.get_binding_index(name)
                        self.bindings[name].data.resize_(tuple(self.context.get_binding_shape(i)))
                s = self.bindings["images"].shape
                assert (
                    im.shape == s
                ), f"input size {im.shape} {'>' if self.dynamic else 'not equal to'} max model size {s}"
                self.binding_addrs["images"] = int(im.data_ptr()) # 拿到图片的地址
                self.context.execute_v2(list(self.binding_addrs.values())) # 参数是输入输出地址并且inference
                y = [self.bindings[x].data for x in sorted(self.output_names)] # infer完毕，拿到输出的
            elif self.trt_version == "10": #版本是10
                if self.dynamic and im.shape != self.bindings["images"].shape: # 动态输入尺寸
                    self.context.set_input_shape("images", im.shape)  # reshape if dynamic
                    self.bindings["images"] = self.bindings["images"]._replace(shape=im.shape)
                    for name in self.output_names:
                        self.bindings[name].data.resize_(tuple(self.context.get_tensor_shape(name))) #resize
                s = self.bindings["images"].shape
                assert (
                    im.shape == s
                ), f"input size {im.shape} {'>' if self.dynamic else 'not equal to'} max model size {s}"
                self.binding_addrs["images"] = int(im.data_ptr()) # 拿到图片的地址
                self.context.execute_v2(list(self.binding_addrs.values())) # 参数是输入输出地址并且inference
                y = [self.bindings[x].data for x in sorted(self.output_names)] # infer完毕，拿到输出的

其他模型推理inference

见tensorrt官方文档里面的案例，

https://github.com/NVIDIA/TensorRT/tree/main/samples

ultralytics/ultralytics/nn/autobackend.py at 04f5ba4da051f5684afd60c681f1f6a71e1eb058 · ultralytics/ultralytics (github.com)

或者yolov8里面的推理的codes

转Int8模型

转Int8模型，需要校正的，所以比较繁琐，而且准确率可能会下降点点，所以暂时没做的，具体可以见tensorrt官方文档里面的案例

像efficientnet，就有int8的案例。

TensorRT/samples/python/efficientnet/build_engine.py at main · NVIDIA/TensorRT (github.com)

        if precision == "fp16":
            if not self.builder.platform_has_fast_fp16:
                log.warning("FP16 is not supported natively on this platform/device")
            else:
                self.config.set_flag(trt.BuilderFlag.FP16)
        elif precision == "int8":
            if not self.builder.platform_has_fast_int8:
                log.warning("INT8 is not supported natively on this platform/device")
            else:
                self.config.set_flag(trt.BuilderFlag.INT8)
                self.config.int8_calibrator = EngineCalibrator(calib_cache)
                if not os.path.exists(calib_cache):
                    calib_shape = [calib_batch_size] + list(inputs[0].shape[1:])
                    calib_dtype = trt.nptype(inputs[0].dtype)
                    self.config.int8_calibrator.set_image_batcher(
                        ImageBatcher(
                            calib_input,
                            calib_shape,
                            calib_dtype,
                            max_num_images=calib_num_images,
                            exact_batches=True,
                            preprocessor=calib_preprocessor,
                        )
                    )

https://zhuanlan.zhihu.com/p/691159516

技术共进，成长同行——讯飞AI开发者社区

更多推荐

论文笔记：AlphaEdit: Null-Space Constrained Knowledge Editing for Language Models（AlphaEdit）

论文发表于人工智能顶会ICLR（基于定位和修改的模型编辑方法（针对和等）会破坏LLM中最初保存的知识，特别是在顺序编辑场景。为此，本文提出AlphaEdit：1、在将保留知识应用于参数之前，将扰动投影到保留知识的零空间上。2、从理论上证明，这种预测确保了在查询保留的知识时，编辑后的LLM的输出保持不变，从而减轻中断问题。3、对各种LLM（包括LLaMA3、GPT2XL和GPT-J）的广泛实验表明，