Tiny-LLM is a lightweight CUDA C++ inference engine for experimenting with W8A16 quantization, KV Cache incremental decoding, and modular Transformer inference.
- W8A16 quantized inference with INT8 weights and FP16 activations
- CUDA kernels for matmul, attention, RMSNorm, and elementwise ops
- Host-side modules for model loading, transformer execution, generation, and cache management
- Dedicated docs site for quick start, API reference, changelog, and contribution notes
mkdir build && cd build
cmake .. -DCMAKE_BUILD_TYPE=Release
make -j$(nproc)
ctest --output-on-failureMIT License.