Tiny-LLM Inference Engine

Tiny-LLM is a lightweight CUDA C++ inference engine for experimenting with W8A16 quantization, KV Cache incremental decoding, and modular Transformer inference.

Repository Overview

W8A16 quantized inference with INT8 weights and FP16 activations
CUDA kernels for matmul, attention, RMSNorm, and elementwise ops
Host-side modules for model loading, transformer execution, generation, and cache management
Dedicated docs site for quick start, API reference, changelog, and contribution notes

Quick Start

mkdir build && cd build
cmake .. -DCMAKE_BUILD_TYPE=Release
make -j$(nproc)
ctest --output-on-failure

License

MIT License.

Name		Name	Last commit message	Last commit date
Latest commit History 18 Commits
.github/workflows		.github/workflows
.kiro/specs/tiny-llm-inference-engine		.kiro/specs/tiny-llm-inference-engine
changelog		changelog
docs		docs
include/tiny_llm		include/tiny_llm
kernels		kernels
src		src
tests		tests
.editorconfig		.editorconfig
.gitattributes		.gitattributes
.gitignore		.gitignore
CMakeLists.txt		CMakeLists.txt
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
README.md		README.md
README.zh-CN.md		README.zh-CN.md
_config.yml		_config.yml
index.md		index.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Tiny-LLM Inference Engine

Repository Overview

Quick Start

Read Next

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Tiny-LLM Inference Engine

Repository Overview

Quick Start

Read Next

License

About

Topics

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages