tensorrt 10转yolov8模型engine和推理inference
转tensorrt模型有很多种方式, onnx, trtexec.exe, pytorch
tensorrt 10转yolov8模型engine和推理inference
个人知道的内容:转tensorrt模型有很多种方式,
Sample Support Guide :: NVIDIA Deep Learning TensorRT Documentation
转tensorrt方式
1、pytorch->onnx->tensorrt,tensorflow->onnx->tensorrt,具体见
TensorRT github,里面的sample目录
TensorRT/samples/python at release/10.0 · NVIDIA/TensorRT (github.com),这里面有很多的onnx转tensorrt模型的案例,可以直接修改这里面的内容。
2、使用torch2trt库
https://github.com/NVIDIA-AI-IOT/torch2trt,
3、使用tensorrt安装目录中的trtexec.exe执行文件的,使用命令行来产生模型,
TensorRT/samples/trtexec at release/10.0 · NVIDIA/TensorRT (github.com)。文档在
Developer Guide :: NVIDIA Deep Learning TensorRT Documentation 一个example,下面就是文档的,–onnx指定onnx文件,–-memPoolSize=指定最大的显存空间
./trtexec --onnx=model.onnx --minShapes=input:1x3x244x244 --optShapes=input:16x3x244x244 --maxShapes=input:32x3x244x244 --shapes=input:5x3x244x244
最高阶的方式:自己用tensorrt的api来产生一个图,然后填充权重,这样就不需要转onnx的,具体的example可以见
具体的案例可见 :
TensorRT/samples/sampleCharRNN at main · NVIDIA/TensorRT (github.com),还可以自己构造BERT、diffusion,见:
TensorRT/demo/Diffusion at main · NVIDIA/TensorRT (github.com)
python使用tensorrt来自己构图并填充权重 的案例:
TensorRT/samples/python/network_api_pytorch_mnist at main · NVIDIA/TensorRT (github.com)
自己构图的文档:
create-network-def-scratch-Developer Guide :: NVIDIA Deep Learning TensorRT Documentation,下面的文档里面写了,可以直接使用tensorrt的API来构造网络并且填充权重,而不需要parser了像onnx。
Instead of using a parser, you can also define the network directly to TensorRT using the Network Definition API. This scenario assumes that the per-layer weights are ready in host memory to pass to TensorRT during the network creation.
The following examples create a simple network with Input, Convolution, Pooling, MatrixMultiply, Shuffle, Activation, and SoftMax layers.
For more information regarding layers, refer to the
NVIDIA TensorRT Operator’s Reference.
--onnx=<model>: Specify the input ONNX model.
If the input model is in ONNX format, use the --minShapes, --optShapes, and --maxShapes flags to control the range of input shapes including batch size.
--minShapes=<shapes>, --optShapes=<shapes>, and --maxShapes=<shapes>: Specify the range of the input shapes to build the engine with. Only required if the input model is in ONNX format.
–-memPoolSize=<pool_spec>: Specify the maximum size of the workspace that tactics are allowed to use, as well as the sizes of the memory pools that DLA will allocate per loadable. Supported pool types include workspace, dlaSRAM, dlaLocalDRAM, dlaGlobalDRAM, and tacticSharedMem.
--saveEngine=<file>: Specify the path to save the engine to.
--fp16, --bf16,--int8, --fp8,--noTF32, and --best: Specify network-level precision.
--stronglyTyped: Create a strongly typed network.
--sparsity=[disable|enable|force]: Specify whether to use tactics that support structured sparsity.
disable: Disable all tactics using structured sparsity. This is the default.
enable: Enable tactics using structured sparsity. Tactics will only be used if the weights in the ONNX file meet the requirements for structured sparsity.
force: Enable tactics using structured sparsity and allow trtexec to overwrite the weights in the ONNX file to enforce them to have structured sparsity patterns. Note that the accuracy is not preserved, so this is to get inference performance only.
Note: This has been deprecated. Use Polygraphy (polygraphy surgeon prune) to rewrite the weights of ONNX models to structured-sparsity pattern and then run with --sparsity=enable.
--timingCacheFile=<file>: Specify the timing cache to load from and save to.
--noCompilationCache: Disable compilation cache in builder, and the cache is part of timing cache (default is to enable compilation cache).
--verbose: Turn on verbose logging.
--skipInference: Build and save the engine without running inference.
--profilingVerbosity=[layer_names_only|detailed|none]: Specify the profiling verbosity to build the engine with.
--dumpLayerInfo, --exportLayerInfo=<file>: Print/Save the layer information of the engine.
--precisionConstraints=spec: Control precision constraint setting.
none: No constraints.
prefer: Meet precision constraints set by --layerPrecisions/--layerOutputTypes if possible.
obey: Meet precision constraints set by --layerPrecisions/--layerOutputTypes or fail otherwise.
--layerPrecisions=spec: Control per-layer precision constraints. Effective only when precisionConstraints is set to obey or prefer. The specs are read left to right, and later ones override earlier ones. "*" can be used as a layerName to specify the default precision for all the unspecified layers.
For example: --layerPrecisions=*:fp16,layer_1:fp32 sets the precision of all layers to FP16 except for layer_1, which will be set to FP32.
--layerOutputTypes=spec: Control per-layer output type constraints. Effective only when precisionConstraints is set to obey or prefer. The specs are read left to right, and later ones override earlier ones. "*" can be used as a layerName to specify the default precision for all the unspecified layers. If a layer has more than one output, then multiple types separated by "+" can be provided for this layer.
For example: --layerOutputTypes=*:fp16,layer_1:fp32+fp16 sets the precision of all layer outputs to FP16 except for layer_1, whose first output will be set to FP32 and whose second output will be set to FP16.
--layerDeviceTypes=spec: Explicitly set per-layer device type to either GPU or DLA. The specs are read left to right, and later ones override earlier ones.
-–useDLACore=N: Use the specified DLA core for layers that support DLA.
-–allowGPUFallback: Allow layers unsupported on DLA to run on GPU instead.
--versionCompatible, --vc: Enable version compatible mode for engine build and inference. Any engine built with this flag enabled is compatible with newer versions of TensorRT on the same host OS when run with TensorRT's dispatch and lean runtimes. Only supported with explicit batch mode.
--excludeLeanRuntime: When --versionCompatible is enabled, this flag indicates that the generated engine should not include an embedded lean runtime. If this is set, you must explicitly specify a valid lean runtime to use when loading the engine. Only supported with explicit batch and weights within the engine.
--tempdir=<dir>: Overrides the default temporary directory TensorRT will use when creating temporary files. Refer to the IRuntime::setTemporaryDirectory API documentation for more information.
--tempfileControls=controls: Controls what TensorRT is allowed to use when creating temporary executable files. Should be a comma-separated list with entries in the format [in_memory|temporary]:[allow|deny].
Options include:
in_memory: Controls whether TensorRT is allowed to create temporary in-memory executable files.
temporary: Controls whether TensorRT is allowed to create temporary executable files in the filesystem (in the directory given by --tempdir).
Example usage: --tempfileControls=in_memory:allow,temporary:deny
--dynamicPlugins=<file>: Load the plugin library dynamically and serialize it with the engine when it is included in --setPluginsToSerialize (can be specified multiple times).
--setPluginsToSerialize=<file>: Set the plugin library to be serialized with the engine (can be specified multiple times).
--builderOptimizationLevel=N: Set the builder optimization level to build the engine with. Higher level allows TensorRT to spend more building time for more optimization options.
--maxAuxStreams=N: Set maximum number of auxiliary streams per inference stream that TRT is allowed to use to run kernels in parallel if the network contains ops that can run in parallel, with the cost of more memory usage. Set this to 0 for optimal memory usage. Refer to the Within-Inference Multi-Streaming section for more information.
--stripWeights: Strip weights from plan. This flag works with either refit or refit with identical weights. Defaults to refit with identical weights, however, you can switch to refit by enabling both --stripWeights and --refit at the same time.
--markDebug: Specify a list of tensor names to be marked as debug tensors. Separate names with a comma.
--allowWeightStreaming: Enables an engine that can stream its weights. Must be specified with --stronglyTyped. TensorRT will automatically choose the appropriate weight streaming budget at runtime to ensure model execution. A specific amount can be set with --weightStreamingBudget.
Flags for the Inference Phase
--loadEngine=<file>: Load the engine from a serialized plan file instead of building it from the input ONNX model.
If the input model is in ONNX format or if the engine is built with explicit batch dimension, use --shapes instead.
--shapes=<shapes>: Specify the input shapes to run the inference with.
--loadInputs=<specs>: Load input values from files. Default is to generate random inputs.
--warmUp=<duration in ms>, --duration=<duration in seconds>, --iterations=<N>: Specify the minimum duration of the warm-up runs, the minimum duration for the inference runs, and the minimum iterations of the inference runs. For example, setting --warmUp=0 --duration=0 --iterations=N allows you to control exactly how many iterations to run the inference for.
--useCudaGraph: Capture the inference to a CUDA graph and run inference by launching the graph. This argument may be ignored when the built TensorRT engine contains operations that are not permitted under CUDA graph capture mode.
--noDataTransfers: Turn off host to device and device-to-host data transfers.
--useSpinWait: Actively synchronize on GPU events. This option makes latency measurement more stable but increases CPU usage and power.
--infStreams=<N>: Run inference with multiple cross-inference streams in parallel. Refer to the Cross-Inference Multi-Streaming section for more information.
--verbose: Turn on verbose logging.
--dumpProfile, --exportProfile=<file>: Print/Save the per-layer performance profile.
--dumpLayerInfo, --exportLayerInfo=<file>: Print layer information of the engine.
--profilingVerbosity=[layer_names_only|detailed|none]: Specify the profiling verbosity to run the inference with.
--useRuntime=[full|lean|dispatch]: TensorRT runtime to execute engine. lean and dispatch require --versionCompatible to be enabled and are used to load a VC engine. All engines (VC or not) must be built with full runtime.
--leanDLLPath=<file>: External lean runtime DLL to use in version compatible mode. Requires --useRuntime=[lean|dispatch].
--dynamicPlugins=<file>: Load the plugin library dynamically when the library is not included in the engine plan file (can be specified multiple times).
--getPlanVersionOnly: Print TensorRT version when loaded plan was created. Works without deserialization of the plan. Use together with --loadEngine. Supported only for engines created with 8.6 and later.
--saveDebugTensors: Specify list of tensor names to turn on the debug state and filename to save raw outputs to. These tensors must be specified as debug tensors during build time.
--allocationStrategy: Specify how the internal device memory for inference is allocated. You can choose from static, profile, and runtime. The first option is the default behavior that pre-allocates enough size for all profiles and input shapes. The second option enables trtexec to only allocate what’s required for the profile to use. The third option enables trtexec to only allocate what’s required for the actual input shapes.
--weightStreamingBudget: Manually set the weight streaming budget. Base-2 unit suffixes are supported: B (Bytes), G (Gibibytes), K (Kibibytes), M (Mebibytes). A value of 0 will choose the minimum possible budget if the weights don’t fit on the device. A value of -1 will disable weight streaming at runtime.
Refer to trtexec --help for all the supported flags and detailed explanations.
Refer to the GitHub: trtexec/README.md file for detailed information about how to build this tool and examples of its usage.
onnx算子融合的
1里面的onnx,在转换的时候可能会出现不支持的情况,这个时候就需要做算子融合了
onnx可能不支持某些算子,像layernorm,这个时候就需要进行算子融合,也就是将不支持的算子融合在一起,具体的可以见
https://docs.nvidia.com/deeplearning/tensorrt/onnx-graphsurgeon/docs/index.html
起因
需要将yolov8x.pt模型转到tensorrt模型
GitHub - ultralytics/ultralytics: NEW - YOLOv8 in PyTorch > ONNX > OpenVINO > CoreML > TFLite,然后做inference,降低推理时间,提升frequency per second,结果还是挺好的,tensorrt确实可以加速,在开始了FP16以后模型转到了float16格式,输入输出也是float16的,推理的时间降低到原来的一半多,也就是推理速度提升到了原来的两倍多。若是有需要,还可以转到int8格式的模型。
在将yolov8x.pt转到half(float16)格式的engine时,也就是debug的时候,发现half=True没有作用的,而且输入是FLOAT,而不是HALF,所以就开始修改和debug,即使half=True,转出来的engine模型输入输出也是float32的
TensorRT: input "images" with shape(1, 3, 640, 640) DataType.FLOAT
TensorRT: input "images" with shape(1, 3, 640, 640) DataType. FLOAT
TensorRT: output "output0" with shape(1, 132, 8400) DataType. FLOAT
TensorRT: output "output0" with shape(1, 132, 8400) DataType. FLOAT
Not
TensorRT: input "images" with shape(1, 3, 640, 640) DataType.HALF
TensorRT: input "images" with shape(1, 3, 640, 640) DataType.HALF
TensorRT: output "output0" with shape(1, 132, 8400) DataType.HALF
TensorRT: output "output0" with shape(1, 132, 8400) DataType.HALF
主要是onnx的输入是float的,要保证onnx的输入是half才行
engine\exporter.py文件的240行修改到了 if self.args.half and (onnx or engine) and self.device.type != “cpu”:,否则engine始终都不能转到float16输入输出格式的模型。
下面是修复bug和api的pull request
转yolov8模型
环境是:window10
用到的github就是官方的:
https://github.com/ultralytics/ultralytics
按照
tensorrt 10.0.06在win10安装以及版本的api变更 - 知乎 (zhihu.com) 安装好tensorrt以后,还需要安装其他的库,像onnx, onnxsim等,一般会自动安装。tensorrt安装包下面的include, Lib, .dll .lib .hpp .h档案都需要复制到相应的cuda安装目录中才行。
当前
ultralytics 不支持tensorrt10版本,所以若是不想变更api直接使用的话,可以安装tensorrt8.3以下的版本,或者也可以参考这个pull request
下面的codes就可以将模型转到engine模型,然后就可以正常的inference了。
import os
import gc
import sys
sys.path.append(r'E:\work\codeRepo\deploy\jz\ultralytics')
from ultralytics import YOLO # newest version from "git clone and git pull"
os.environ['CUDA_VISIBLE_DEVICES'] = '0'
'''
Platform: Window11
Ultralytics YOLOv8.1.44 Python-3.9.18 torch-2.2.1+cu118 CUDA:0 (NVIDIA GeForce RTX 4070 Ti, 12282MiB)
onnx 1.16.0 opset 17
TensorRT 10.0.0b6:
https://developer.nvidia.com/downloads/compute/machine-learning/tensorrt/10.0.0/zip/TensorRT-10.0.0.6.Windows10.win10.cuda-11.8.zip
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2022 NVIDIA Corporation
Built on Wed_Sep_21_10:41:10_Pacific_Daylight_Time_2022
Cuda compilation tools, release 11.8, V11.8.89
Build cuda_11.8.r11.8/compiler.31833905_0
'''
if __name__ == '__main__':
model = YOLO(r'yolov8n.pt') # load a pretrained model (recommended for training)
# results = model.export(format='onnx', simplify=True, half=True, device='cuda:0') # onnx,engine
results0 = model.export(format='engine', simplify=True, half=True, device='cuda:0') # onnx,engine
del model
gc.collect()
model = YOLO(r"E:\work\yolov8n.engine")
result = model.predict('https://ultralytics.com/images/bus.jpg', save=True)
下面挑出转模型时 比较重要的内容,做了些注释的
# 转模型的时候,onnx或者engine配置了 half,就需要将输入和模型转到 float16
if self.args.half and (onnx or engine) and self.device.type != "cpu":
im, model = im.half(), model.half() # to FP16
转tensorrt模型的函数和注释
def export_engine(self, prefix=colorstr("TensorRT:")):
"""YOLOv8 TensorRT export https://developer.nvidia.com/tensorrt."""
assert self.im.device.type != "cpu", "export running on CPU but must be on GPU, i.e. use 'device=0'"
f_onnx, _ = self.export_onnx() # run before trt import https://github.com/ultralytics/ultralytics/issues/7016
try: #加载tensorrt库
import tensorrt as trt # noqa
except ImportError:
if LINUX:
check_requirements("nvidia-tensorrt", cmds="-U --index-url https://pypi.ngc.nvidia.com")
import tensorrt as trt # noqa
# require tensorrt>=7.0.0,检查版本号>7就行,但没有要求<=8,所以so sad
check_version(trt.__version__, "7.0.0", hard=True)
self.trt_version = trt.__version__.split(".")[0] # 拿到主版本号10或者8或者7
self.args.simplify = True
LOGGER.info(f"\n{prefix} starting export with TensorRT {trt.__version__}...")
assert Path(f_onnx).exists(), f"failed to export ONNX file: {f_onnx}"
f = self.file.with_suffix(".engine") # TensorRT engine file
logger = trt.Logger(trt.Logger.INFO) # tensorrt的日志
if self.args.verbose:
logger.min_severity = trt.Logger.Severity.VERBOSE #设置日志的程度
builder = trt.Builder(logger) # 构图
config = builder.create_builder_config() # 产生配置文件
if self.trt_version in ["7", "8"]: # 版本是7、8
config.max_workspace_size = int(self.args.workspace * (1 << 30)) # 配置最大显存
elif self.trt_version == "10": # 版本10
config.set_memory_pool_limit(trt.MemoryPoolType.WORKSPACE, self.args.workspace * 1 << 30) # 配置最大显存
flag = 1 << int(trt.NetworkDefinitionCreationFlag.EXPLICIT_BATCH)
network = builder.create_network(flag) # 产生网络的
parser = trt.OnnxParser(network, logger) # 产生onnx的解析器
if not parser.parse_from_file(f_onnx): # 解析 onnx
raise RuntimeError(f"failed to load ONNX file: {f_onnx}")
inputs = [network.get_input(i) for i in range(network.num_inputs)] # 拿到网络的输入
outputs = [network.get_output(i) for i in range(network.num_outputs)] # 拿到网络的输出
for inp in inputs:
LOGGER.info(f'{prefix} input "{inp.name}" with shape{inp.shape} {inp.dtype}')
for out in outputs:
LOGGER.info(f'{prefix} output "{out.name}" with shape{out.shape} {out.dtype}')
if self.args.dynamic: # 动态输入尺寸
shape = self.im.shape
if shape[0] <= 1:
LOGGER.warning(f"{prefix} WARNING ⚠️ 'dynamic=True' model requires max batch size, i.e. 'batch=16'")
profile = builder.create_optimization_profile()
for inp in inputs:
profile.set_shape(inp.name, (1, *shape[1:]), (max(1, shape[0] // 2), *shape[1:]), shape)
config.add_optimization_profile(profile)
LOGGER.info(
f"{prefix} building FP{16 if builder.platform_has_fast_fp16 and self.args.half else 32} engine as {f}"
)
# 设置半精度模型 Float16,也就是half=true配置
if builder.platform_has_fast_fp16 and self.args.half:
config.set_flag(trt.BuilderFlag.FP16)
del self.model
torch.cuda.empty_cache() # 清空没用的显存
# Write file
if self.trt_version in ["7", "8"]: # 版本是7, 8
with builder.build_engine(network, config) as engine, open(f, "wb") as t: # 产生engine
# Metadata
meta = json.dumps(self.metadata) # 导出基础信息
t.write(len(meta).to_bytes(4, byteorder="little", signed=True)) # 写入基础信息长度
t.write(meta.encode()) # 写入基础信息
# Model
t.write(engine.serialize()) # engine序列化并写入,导出完成了的
elif self.trt_version == "10": # 版本是10
with builder.build_serialized_network(network, config) as engine, open(f, "wb") as t: # 产生engine并且序列化
# Metadata
meta = json.dumps(self.metadata) # 导出基础信息
t.write(len(meta).to_bytes(4, byteorder="little", signed=True)) # 写入基础信息长度
t.write(meta.encode()) # 写入基础信息
# Model
t.write(engine) # 写入序列化以后的engine,导出完成了的
return f, None
转其他模型
pytorch->onnx->tensorrt,tensorflow->onnx->tensorrt,具体见
TensorRT github,里面的sample目录
TensorRT/samples/python at release/10.0 · NVIDIA/TensorRT (github.com),这里面有很多的onnx转tensorrt模型的案例,可以直接修改这里面的内容,像onnx
python使用tensorrt来自己构图并填充权重 的案例:
TensorRT/samples/python/network_api_pytorch_mnist at main · NVIDIA/TensorRT (github.com)
具体的可见这个文篇最上面的: 转tensorrt方式
api构图并加入权重转模型
最高阶的方式:自己用tensorrt的api来产生一个图,然后填充权重,这样就不需要转onnx的,具体的example可以见
具体的案例可见 :
TensorRT/samples/sampleCharRNN at main · NVIDIA/TensorRT (github.com),还可以自己构造BERT、diffusion,见:
TensorRT/demo/Diffusion at main · NVIDIA/TensorRT (github.com)
python使用tensorrt来自己构图并填充权重 的案例:
TensorRT/samples/python/network_api_pytorch_mnist at main · NVIDIA/TensorRT (github.com)
自己构图的文档:
create-network-def-scratch-Developer Guide :: NVIDIA Deep Learning TensorRT Documentation,下面的文档里面写了,可以直接使用tensorrt的API来构造网络并且填充权重,而不需要parser了像onnx。
Instead of using a parser, you can also define the network directly to TensorRT using the Network Definition API. This scenario assumes that the per-layer weights are ready in host memory to pass to TensorRT during the network creation. The following examples create a simple network with Input, Convolution, Pooling, MatrixMultiply, Shuffle, Activation, and SoftMax layers. For more information regarding layers, refer to the
NVIDIA TensorRT Operator’s Reference.
yolov8推理inference
import os
import gc
import sys
sys.path.append(r'E:\work\codeRepo\deploy\jz\ultralytics')
from ultralytics import YOLO # newest version from "git clone and git pull"
os.environ['CUDA_VISIBLE_DEVICES'] = '0'
'''
Platform: Window11
Ultralytics YOLOv8.1.44 Python-3.9.18 torch-2.2.1+cu118 CUDA:0 (NVIDIA GeForce RTX 4070 Ti, 12282MiB)
onnx 1.16.0 opset 17
TensorRT 10.0.0b6:
https://developer.nvidia.com/downloads/compute/machine-learning/tensorrt/10.0.0/zip/TensorRT-10.0.0.6.Windows10.win10.cuda-11.8.zip
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2022 NVIDIA Corporation
Built on Wed_Sep_21_10:41:10_Pacific_Daylight_Time_2022
Cuda compilation tools, release 11.8, V11.8.89
Build cuda_11.8.r11.8/compiler.31833905_0
'''
if __name__ == '__main__':
model = YOLO(r'yolov8n.pt') # load a pretrained model (recommended for training)
# results = model.export(format='onnx', simplify=True, half=True, device='cuda:0') # onnx,engine
results0 = model.export(format='engine', simplify=True, half=True, device='cuda:0') # onnx,engine
del model
gc.collect()
model = YOLO(r"E:\work\yolov8n.engine")
result = model.predict('https://ultralytics.com/images/bus.jpg', save=True)
下面是推理时主要的codes,做了注释的
self.trt_version = trt.__version__.split(".")[0]
if self.trt_version in ["7", "8"]: #版本是7或者8
if device.type == "cpu":
device = torch.device("cuda:0")
Binding = namedtuple("Binding", ("name", "dtype", "shape", "data", "ptr")) # 名称
logger = trt.Logger(trt.Logger.INFO) # 日志
# Read file
with open(w, "rb") as f, trt.Runtime(logger) as runtime: # 读取engine
meta_len = int.from_bytes(f.read(4), byteorder="little") # read metadata length
metadata = json.loads(f.read(meta_len).decode("utf-8")) # read metadata
model = runtime.deserialize_cuda_engine(f.read()) # read engine
context = model.create_execution_context() #执行上下文
bindings = OrderedDict()
output_names = []
fp16 = False # default updated below
dynamic = False
for i in range(model.num_bindings):
name = model.get_binding_name(i) #拿到输入输出名称
dtype = trt.nptype(model.get_binding_dtype(i)) # 拿到输入输出类型
if model.binding_is_input(i): #是否是输入
if -1 in tuple(model.get_binding_shape(i)): # dynamic
dynamic = True
context.set_binding_shape(i, tuple(model.get_profile_shape(0, i)[2]))
if dtype == np.float16:
fp16 = True
else: # output
output_names.append(name)
shape = tuple(context.get_binding_shape(i))
im = torch.from_numpy(np.empty(shape, dtype=dtype)).to(device) #图片放到GPU内
bindings[name] = Binding(name, dtype, shape, im, int(im.data_ptr())) # 图片GPU地址
binding_addrs = OrderedDict((n, d.ptr) for n, d in bindings.items()) #拿到名称和地址
batch_size = bindings["images"].shape[0] # if dynamic, this is instead max batch size
self.output_names = output_names
self.fp16 = fp16
self.binding_addrs = binding_addrs
self.bindings = bindings
self.dynamic = dynamic
self.context = context
elif self.trt_version == "10": # 版本 10
if device.type == "cpu":
device = torch.device("cuda:0")
Binding = namedtuple("Binding", ("name", "dtype", "shape", "data", "ptr"))
logger = trt.Logger(trt.Logger.INFO)
# Read file
with open(w, "rb") as f, trt.Runtime(logger) as runtime: # 读取engine
meta_len = int.from_bytes(f.read(4), byteorder="little") # read metadata length
metadata = json.loads(f.read(meta_len).decode("utf-8")) # read metadata
model = runtime.deserialize_cuda_engine(f.read()) # read engine
context = model.create_execution_context() #执行上下文
bindings = OrderedDict()
output_names = []
fp16 = False # default updated below
dynamic = False
for i in range(model.num_io_tensors):
name = model.get_tensor_name(i) #拿到输入输出名称
dtype = trt.nptype(model.get_tensor_dtype(name)) # 拿到输入输出类型
mode = model.get_tensor_mode(name) # 1输入2输出
if mode.value == 1:
if -1 in tuple([model.get_tensor_dtype(name)]): # dynamic
dynamic = True
context.set_input_shape(name, tuple(model.get_tensor_profile_shape(name, i)[2]))
if dtype == np.float16:
fp16 = True
else: # output
output_names.append(name)
shape = tuple(context.get_tensor_shape(name))
im = torch.from_numpy(np.empty(shape, dtype=dtype)).to(device) #图片放到GPU内
bindings[name] = Binding(name, dtype, shape, im, int(im.data_ptr())) # 图片GPU地址
binding_addrs = OrderedDict((n, d.ptr) for n, d in bindings.items()) #拿到名称和地址
batch_size = bindings["images"].shape[0] # if dynamic, this is instead max batch size
self.output_names = output_names
self.fp16 = fp16
self.binding_addrs = binding_addrs
self.bindings = bindings
self.dynamic = dynamic
self.context = context
下面是inference的主要codes
elif self.engine:
if self.trt_version in ["7", "8"]: #版本是7或者8
if self.dynamic and im.shape != self.bindings["images"].shape: # 动态输入尺寸
i = self.model.get_binding_index("images")
self.context.set_binding_shape(i, im.shape) # reshape if dynamic
self.bindings["images"] = self.bindings["images"]._replace(shape=im.shape)
for name in self.output_names:
i = self.model.get_binding_index(name)
self.bindings[name].data.resize_(tuple(self.context.get_binding_shape(i)))
s = self.bindings["images"].shape
assert (
im.shape == s
), f"input size {im.shape} {'>' if self.dynamic else 'not equal to'} max model size {s}"
self.binding_addrs["images"] = int(im.data_ptr()) # 拿到图片的地址
self.context.execute_v2(list(self.binding_addrs.values())) # 参数是输入输出地址并且inference
y = [self.bindings[x].data for x in sorted(self.output_names)] # infer完毕,拿到输出的
elif self.trt_version == "10": #版本是10
if self.dynamic and im.shape != self.bindings["images"].shape: # 动态输入尺寸
self.context.set_input_shape("images", im.shape) # reshape if dynamic
self.bindings["images"] = self.bindings["images"]._replace(shape=im.shape)
for name in self.output_names:
self.bindings[name].data.resize_(tuple(self.context.get_tensor_shape(name))) #resize
s = self.bindings["images"].shape
assert (
im.shape == s
), f"input size {im.shape} {'>' if self.dynamic else 'not equal to'} max model size {s}"
self.binding_addrs["images"] = int(im.data_ptr()) # 拿到图片的地址
self.context.execute_v2(list(self.binding_addrs.values())) # 参数是输入输出地址并且inference
y = [self.bindings[x].data for x in sorted(self.output_names)] # infer完毕,拿到输出的
其他模型推理inference
见tensorrt官方文档里面的案例,
https://github.com/NVIDIA/TensorRT/tree/main/samples
或者yolov8里面的推理的codes
转Int8模型
转Int8模型,需要校正的,所以比较繁琐,而且准确率可能会下降点点,所以暂时没做的,具体可以见tensorrt官方文档里面的案例
像efficientnet,就有int8的案例。
TensorRT/samples/python/efficientnet/build_engine.py at main · NVIDIA/TensorRT (github.com)
if precision == "fp16":
if not self.builder.platform_has_fast_fp16:
log.warning("FP16 is not supported natively on this platform/device")
else:
self.config.set_flag(trt.BuilderFlag.FP16)
elif precision == "int8":
if not self.builder.platform_has_fast_int8:
log.warning("INT8 is not supported natively on this platform/device")
else:
self.config.set_flag(trt.BuilderFlag.INT8)
self.config.int8_calibrator = EngineCalibrator(calib_cache)
if not os.path.exists(calib_cache):
calib_shape = [calib_batch_size] + list(inputs[0].shape[1:])
calib_dtype = trt.nptype(inputs[0].dtype)
self.config.int8_calibrator.set_image_batcher(
ImageBatcher(
calib_input,
calib_shape,
calib_dtype,
max_num_images=calib_num_images,
exact_batches=True,
preprocessor=calib_preprocessor,
)
)
更多推荐
所有评论(0)