This Jupyter notebook demonstrates the optimization of the BLOOM 560M model, a large language model, for faster inference using NVIDIA's TensorRT-LLM. The guide covers the installation of necessary ...
This project demonstrates how to use the TensorRT C++ API for high performance GPU inference. It covers how to do the following: If you have having issues creating the TensorRT engine file from the ...