Skip to content

LessUp/tiny-llm

Repository files navigation

Tiny-LLM Inference Engine

CI Pages License: MIT CUDA C++ CMake

English | 简体中文 | Docs

Tiny-LLM is a lightweight CUDA C++ inference engine for experimenting with W8A16 quantization, KV Cache incremental decoding, and modular Transformer inference.

Repository Overview

  • W8A16 quantized inference with INT8 weights and FP16 activations
  • CUDA kernels for matmul, attention, RMSNorm, and elementwise ops
  • Host-side modules for model loading, transformer execution, generation, and cache management
  • Dedicated docs site for quick start, API reference, changelog, and contribution notes

Quick Start

mkdir build && cd build
cmake .. -DCMAKE_BUILD_TYPE=Release
make -j$(nproc)
ctest --output-on-failure

Read Next

License

MIT License.

About

Lightweight LLM Inference Engine (CUDA C++17): W8A16 Quantization, KV Cache & Multiple Sampling Strategies | 轻量级 LLM 推理引擎(CUDA C++17):W8A16 量化推理、KV Cache 管理、多种采样策略

Topics

Resources

License

Contributing

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors