Overview

Plain English

TensorRT LLM provides users with an easy-to-use Python API to define Large Language Models (LLMs) and supports state-of-the-art optimizations to perform inference efficiently on NVIDIA GPUs. TensorRT LLM also contains components to create Python and C++ runtimes that orchestrate the inference execution in a performant way.

Technical

TensorRT LLM provides users with an easy-to-use Python API to define Large Language Models (LLMs) and supports state-of-the-art optimizations to perform inference efficiently on NVIDIA GPUs. TensorRT LLM also contains components to create Python and C++ runtimes that orchestrate the inference execution in a performant way.

Technical scorecard

License Other

Commercial use Yes

OpenAI-compatible API No

REST API No

Fine-tuning support No

Quantization support No

Docker available No

GUI / no-code available No

Telemetry None

Offline after setup Yes

Data & Privacy

Does it send data online?

After setup, this listing is marked as usable offline. Confirm network behavior against the upstream project before regulated deployment.

Does it store history?

Not verified in this directory yet. Review the upstream docs for persistence, logs, and workspace storage.

License checks?

Commercial use is marked as allowed or likely allowed by the listed license.

Telemetry?

None

Last verified: May 17, 2026. Maintainer verification should be treated as directory guidance, not legal advice.

Setup & Installation

Medium

A developer can usually get this running with standard docs.

Prerequisites

Python, Docker, Bare Metal, Kubernetes

# Start with the official project documentation
# https://github.com/NVIDIA/TensorRT-LLM

Hardware Requirements

RAM16 GB minimum / 32 GB recommended

Hardware tagsNVIDIA GPU (CUDA)

Model formatsNot specified

Primary languagePython

Works Well With

Open WebUI User-friendly AI Interface (Supports Ollama, OpenAI API, ...). RAGFlow RAGFlow is a leading open-source Retrieval-Augmented Generation (RAG) engine that fuses cutting-edge RAG with Agent capabilities to create a superior context layer for LLMs. GPT4All GPT4All: Run Local LLMs on Any Device. Open-source and available for commercial use. LobeChat LobeHub organizes your agents into 7×24 operation. It hires, schedules, reports on your entire AI team. You stay in charge — without staying online.

Open suggested stack

You might also evaluate

Ollama Get up and running with Kimi-K2.5, GLM-5, MiniMax, DeepSeek, gpt-oss, Qwen, Gemma and other mode... llama.cpp LLM inference in C/C++.... vLLM A high-throughput and memory-efficient inference and serving engine for LLMs.... LocalAI LocalAI is the open-source AI engine. Run any model - LLMs, vision, voice, image, video - on any...