Meet oLLM: A Lightweight Python Library that brings 100K-Context LLM Inference to 8 GB Consumer GPUs via SSD Offload—No Quantization Required
oLLM is a light-weight Python library constructed on prime of Huggingface Transformers and PyTorch and runs large-context Transformers on NVIDIA GPUs by aggressively offloading weights and KV-cache to quick native SSDs. The undertaking targets offline, single-GPU workloads and explicitly avoids quantization, utilizing FP16/BF16 weights with FlashAttention-2 and disk-backed KV caching to preserve VRAM inside 8–10…
