A Coding Tutorial for Running PrismML Bonsai 1-Bit LLM on CUDA with GGUF, Benchmarking, Chat, JSON, and RAG
In this tutorial, we implement find out how to run the Bonsai 1-bit giant language mannequin effectively utilizing GPU acceleration and PrismML’s optimized GGUF deployment stack. We arrange the atmosphere, set up the required dependencies, and obtain the prebuilt llama.cpp binaries, and load the Bonsai-1.7B mannequin for quick inference on CUDA. As we progress, we look at how 1-bit quantization works below the hood, why the Q1_0_g128 format is so memory-efficient, and how this makes Bonsai sensible for light-weight but succesful language mannequin deployment. We additionally check core inference, benchmarking, multi-turn chat, structured JSON era, code era, OpenAI-compatible server mode, and a small retrieval-augmented era workflow, giving us a whole, hands-on view of how Bonsai operates in real-world use.
import os, sys, subprocess, time, json, urllib.request, tarfile, textwrap
attempt:
import google.colab
IN_COLAB = True
besides ImportError:
IN_COLAB = False
def part(title):
bar = "═" * 60
print(f"n{bar}n {title}n{bar}")
part("1 · Environment & GPU Check")
def run(cmd, seize=False, test=True, **kw):
return subprocess.run(
cmd, shell=True, capture_output=seize,
textual content=True, test=test, **kw
)
gpu_info = run("nvidia-smi --query-gpu=title,reminiscence.complete,driver_version --format=csv,noheader",
seize=True, test=False)
if gpu_info.returncode == 0:
print("
GPU detected:", gpu_info.stdout.strip())
else:
print("
No GPU discovered — inference will run on CPU (a lot slower).")
cuda_check = run("nvcc --version", seize=True, test=False)
if cuda_check.returncode == 0:
for line in cuda_check.stdout.splitlines():
if "launch" in line:
print(" CUDA:", line.strip())
break
print(f" Python {sys.model.cut up()[0]} | Platform: Linux (Colab)")
part("2 · Installing Python Dependencies")
run("pip set up -q huggingface_hub requests tqdm openai")
print("
huggingface_hub, requests, tqdm, openai put in")
from huggingface_hub import hf_hub_download
We start by importing the core Python modules that we’d like for system operations, downloads, timing, and JSON dealing with. We test whether or not we’re working inside Google Colab, outline a reusable part printer, and create a helper perform to run shell instructions cleanly from Python. We then confirm the GPU and CUDA atmosphere, print the Python runtime particulars, set up the required Python dependencies, and put together the Hugging Face obtain utility for the following levels.
part("3 · Downloading PrismML llama.cpp Prebuilt Binaries")
RELEASE_TAG = "prism-b8194-1179bfc"
BASE_URL = f"https://github.com/PrismML-Eng/llama.cpp/releases/obtain/{RELEASE_TAG}"
BIN_DIR = "/content material/bonsai_bin"
os.makedirs(BIN_DIR, exist_ok=True)
def detect_cuda_build():
r = run("nvcc --version", seize=True, test=False)
for line in r.stdout.splitlines():
if "launch" in line:
attempt:
ver = float(line.cut up("launch")[-1].strip().cut up(",")[0].strip())
if ver >= 13.0: return "13.1"
if ver >= 12.6: return "12.8"
return "12.4"
besides ValueError:
move
return "12.4"
cuda_build = detect_cuda_build()
print(f" Detected CUDA construct slot: {cuda_build}")
TAR_NAME = f"llama-{RELEASE_TAG}-bin-linux-cuda-{cuda_build}-x64.tar.gz"
TAR_URL = f"{BASE_URL}/{TAR_NAME}"
tar_path = f"/tmp/{TAR_NAME}"
if not os.path.exists(f"{BIN_DIR}/llama-cli"):
print(f" Downloading: {TAR_URL}")
urllib.request.urlretrieve(TAR_URL, tar_path)
print(" Extracting …")
with tarfile.open(tar_path, "r:gz") as t:
t.extractall(BIN_DIR)
for fname in os.listdir(BIN_DIR):
fp = os.path.be part of(BIN_DIR, fname)
if os.path.isfile(fp):
os.chmod(fp, 0o755)
print(f"
Binaries extracted to {BIN_DIR}")
bins = sorted(f for f in os.listdir(BIN_DIR) if os.path.isfile(os.path.be part of(BIN_DIR, f)))
print(" Available:", ", ".be part of(bins))
else:
print(f"
Binaries already current at {BIN_DIR}")
LLAMA_CLI = f"{BIN_DIR}/llama-cli"
LLAMA_SERVER = f"{BIN_DIR}/llama-server"
check = run(f"{LLAMA_CLI} --version", seize=True, test=False)
if check.returncode == 0:
print(f" llama-cli model: {check.stdout.strip()[:80]}")
else:
print(f"
llama-cli check failed: {check.stderr.strip()[:200]}")
part("4 · Downloading Bonsai-1.7B GGUF Model")
MODEL_REPO = "prism-ml/Bonsai-1.7B-gguf"
MODEL_DIR = "/content material/bonsai_models"
GGUF_FILENAME = "Bonsai-1.7B.gguf"
os.makedirs(MODEL_DIR, exist_ok=True)
MODEL_PATH = os.path.be part of(MODEL_DIR, GGUF_FILENAME)
if not os.path.exists(MODEL_PATH):
print(f" Downloading {GGUF_FILENAME} (~248 MB) from HuggingFace …")
MODEL_PATH = hf_hub_download(
repo_id=MODEL_REPO,
filename=GGUF_FILENAME,
local_dir=MODEL_DIR,
)
print(f"
Model saved to: {MODEL_PATH}")
else:
print(f"
Model already cached: {MODEL_PATH}")
size_mb = os.path.getsize(MODEL_PATH) / 1e6
print(f" File dimension on disk: {size_mb:.1f} MB")
part("5 · Core Inference Helpers")
DEFAULT_GEN_ARGS = dict(
temp=0.5,
top_p=0.85,
top_k=20,
repeat_penalty=1.0,
n_predict=256,
n_gpu_layers=99,
ctx_size=4096,
)
def build_llama_cmd(immediate, system_prompt="You are a useful assistant.", **overrides):
args = {**DEFAULT_GEN_ARGS, **overrides}
formatted = (
f"<|im_start|>systemn{system_prompt}<|im_end|>n"
f"<|im_start|>usern{immediate}<|im_end|>n"
f"<|im_start|>assistantn"
)
safe_prompt = formatted.exchange('"', '"')
return (
f'{LLAMA_CLI} -m "{MODEL_PATH}"'
f' -p "{safe_prompt}"'
f' -n {args["n_predict"]}'
f' --temp {args["temp"]}'
f' --top-p {args["top_p"]}'
f' --top-k {args["top_k"]}'
f' --repeat-penalty {args["repeat_penalty"]}'
f' -ngl {args["n_gpu_layers"]}'
f' -c {args["ctx_size"]}'
f' --no-display-prompt'
f' -e'
)
def infer(immediate, system_prompt="You are a useful assistant.", verbose=True, **overrides):
cmd = build_llama_cmd(immediate, system_prompt, **overrides)
t0 = time.time()
end result = run(cmd, seize=True, test=False)
elapsed = time.time() - t0
output = end result.stdout.strip()
if verbose:
print(f"n{'─'*50}")
print(f"Prompt : {immediate[:100]}{'…' if len(immediate) > 100 else ''}")
print(f"{'─'*50}")
print(output)
print(f"{'─'*50}")
print(f"
{elapsed:.2f}s | ~{len(output.cut up())} phrases")
return output, elapsed
print("
Inference helpers prepared.")
part("6 · Basic Inference — Hello, Bonsai!")
infer("What makes 1-bit language fashions particular in comparison with commonplace fashions?")
We obtain and put together the PrismML prebuilt llama.cpp CUDA binaries that energy native inference for the Bonsai mannequin. We detect the accessible CUDA model, select the matching binary construct, extract the downloaded archive, make the information executable, and confirm that the llama-cli binary works appropriately. After that, we obtain the Bonsai-1.7B GGUF mannequin from Hugging Face, arrange the mannequin path, outline the default era settings, and construct the core helper features that format prompts and run inference.
part("7 · Q1_0_g128 Quantization — What's Happening Under the Hood")
print(textwrap.dedent("""
╔══════════════════════════════════════════════════════════════╗
║ Bonsai Q1_0_g128 Weight Representation ║
╠══════════════════════════════════════════════════════════════╣
║ Each weight = 1 bit: 0 → −scale ║
║ 1 → +scale ║
║ Every 128 weights share one FP16 scale issue. ║
║ ║
║ Effective bits per weight: ║
║ 1 bit (signal) + 16/128 bits (shared scale) = 1.125 bpw ║
║ ║
║ Memory comparability for Bonsai-1.7B: ║
║ FP16: 3.44 GB (1.0× baseline) ║
║ Q1_0_g128: 0.24 GB (14.2× smaller!) ║
║ MLX 1-bit g128: 0.27 GB (12.8× smaller) ║
╚══════════════════════════════════════════════════════════════╝
"""))
print("
Python demo of Q1_0_g128 quantization logic:n")
import random
random.seed(42)
GROUP_SIZE = 128
weights_fp16 = [random.gauss(0, 0.1) for _ in range(GROUP_SIZE)]
scale = max(abs(w) for w in weights_fp16)
quantized = [1 if w >= 0 else 0 for w in weights_fp16]
dequantized = [scale if b == 1 else -scale for b in quantized]
mse = sum((a - b) ** 2 for a, b in zip(weights_fp16, dequantized)) / GROUP_SIZE
print(f" FP16 weights (first 8): {[f'{w:.4f}' for w in weights_fp16[:8]]}")
print(f" 1-bit repr (first 8): {quantized[:8]}")
print(f" Shared scale: {scale:.4f}")
print(f" Dequantized (first 8): {[f'{w:.4f}' for w in dequantized[:8]]}")
print(f" MSE of reconstruction: {mse:.6f}")
memory_fp16 = GROUP_SIZE * 2
memory_1bit = GROUP_SIZE / 8 + 2
print(f"n Memory: FP16={memory_fp16}B vs Q1_0_g128={memory_1bit:.1f}B "
f"({memory_fp16/memory_1bit:.1f}× discount)")
part("8 · Performance Benchmark — Tokens per Second")
def benchmark(immediate, n_tokens=128, n_runs=3, **kw):
timings = []
for i in vary(n_runs):
print(f" Run {i+1}/{n_runs} …", finish=" ", flush=True)
_, elapsed = infer(immediate, verbose=False, n_predict=n_tokens, **kw)
tps = n_tokens / elapsed
timings.append(tps)
print(f"{tps:.1f} tok/s")
avg = sum(timings) / len(timings)
print(f"n
Average: {avg:.1f} tok/s (over {n_runs} runs, {n_tokens} tokens every)")
return avg
print("
Benchmarking Bonsai-1.7B on your GPU …")
tps = benchmark(
"Explain the idea of neural community backpropagation step-by-step.",
n_tokens=128, n_runs=3,
)
print("n Published reference throughputs (from whitepaper):")
print(" ┌──────────────────────┬─────────┬──────────────┐")
print(" │ Platform │ Backend │ TG128 tok/s │")
print(" ├──────────────────────┼─────────┼──────────────┤")
print(" │ RTX 4090 │ CUDA │ 674 │")
print(" │ M4 Pro 48 GB │ Metal │ 250 │")
print(f" │ Your GPU (measured) │ CUDA │ {tps:>7.1f} │")
print(" └──────────────────────┴─────────┴──────────────┘")
part("9 · Multi-Turn Chat with Context Accumulation")
def chat(user_msg, system="You are a useful assistant.", historical past=None, **kw):
if historical past is None:
historical past = []
historical past.append(("consumer", user_msg))
full = f"<|im_start|>systemn{system}<|im_end|>n"
for position, msg in historical past:
full += f"<|im_start|>{position}n{msg}<|im_end|>n"
full += "<|im_start|>assistantn"
secure = full.exchange('"', '"').exchange('n', 'n')
cmd = (
f'{LLAMA_CLI} -m "{MODEL_PATH}"'
f' -p "{secure}" -e'
f' -n 200 --temp 0.5 --top-p 0.85 --top-k 20'
f' -ngl 99 -c 4096 --no-display-prompt'
)
end result = run(cmd, seize=True, test=False)
reply = end result.stdout.strip()
historical past.append(("assistant", reply))
return reply, historical past
print("
Starting a 3-turn dialog about 1-bit fashions …n")
historical past = []
turns = [
"What is a 1-bit language model?",
"What are the main trade-offs compared to 4-bit or 8-bit quantization?",
"How does Bonsai specifically address those trade-offs?",
]
for i, msg in enumerate(turns, 1):
print(f"
Turn {i}: {msg}")
reply, historical past = chat(msg, historical past=historical past)
print(f"
Bonsai: {reply}n")
time.sleep(0.5)
part("10 · Sampling Parameter Exploration")
creative_prompt = "Write a one-sentence description of a futuristic metropolis powered solely by 1-bit AI."
configs = [
("Precise / Focused", dict(temp=0.1, top_k=10, top_p=0.70)),
("Balanced (default)", dict(temp=0.5, top_k=20, top_p=0.85)),
("Creative / Varied", dict(temp=0.9, top_k=50, top_p=0.95)),
("High entropy", dict(temp=1.2, top_k=100, top_p=0.98)),
]
print(f'Prompt: "{creative_prompt}"n')
for label, params in configs:
out, _ = infer(creative_prompt, verbose=False, n_predict=80, **params)
print(f" [{label}]")
print(f" temp={params['temp']}, top_k={params['top_k']}, top_p={params['top_p']}")
print(f" → {out[:200]}n")
We transfer from setup into experimentation by first working a fundamental inference name to substantiate that the mannequin is functioning correctly. We then clarify the Q1_0_g128 quantization format by a visible textual content block and a small Python demo that reveals how 1-bit indicators and shared scales reconstruct weights with sturdy reminiscence financial savings. After that, we benchmark token era velocity, simulate a multi-turn dialog with amassed historical past, and examine how completely different sampling settings have an effect on the model and variety of the mannequin’s outputs.
part("11 · Context Window — Long-Document Summarisation")
long_doc = (
"The transformer structure, launched in 'Attention is All You Need' (Vaswani et al., 2017), "
"changed recurrent and convolutional networks with self-attention mechanisms. The key perception was "
"that spotlight weights could possibly be computed in parallel throughout your complete sequence, not like RNNs which "
"stacked equivalent layers with multi-head self-attention and feed-forward sub-layers. Positional "
"encodings inject sequence-order data since consideration is permutation-invariant. Subsequent "
"work eliminated the encoder (GPT household) or decoder (BERT household) to specialise for era or "
"understanding duties respectively. Scaling legal guidelines (Kaplan et al., 2020) confirmed that loss decreases "
"predictably with extra compute, parameters, and knowledge. This motivated the emergence of huge language "
"these fashions turned prohibitive for edge and on-device deployment. Quantisation analysis sought to "
"scale back the bit-width of weights from FP16/BF16 right down to INT8, INT4, and finally binary (1-bit). "
"BitWeb (Wang et al., 2023) was among the many first to show that coaching with 1-bit weights from "
"scratch might method the standard of higher-precision fashions at scale. Bonsai (Prism ML, 2026) "
"prolonged this to an end-to-end 1-bit deployment pipeline throughout CUDA, Metal, and cell runtimes, "
"reaching 14x reminiscence discount with the Q1_0_g128 GGUF format."
)
summarize_prompt = f"Summarize the next technical textual content in 3 bullet factors:nn{long_doc}"
print(f" Input size: ~{len(long_doc.cut up())} phrases")
out, elapsed = infer(summarize_prompt, n_predict=200, ctx_size=2048, verbose=False)
print("
Summary:")
for line in out.splitlines():
print(f" {line}")
print(f"n
{elapsed:.2f}s")
part("12 · Structured Output — Forcing JSON Responses")
json_system = (
"You are a JSON API. Respond ONLY with legitimate JSON, no markdown, no rationalization. "
"Never embrace ```json fences."
)
json_prompt = (
"Return a JSON object with keys: model_name, parameter_count, "
"bits_per_weight, memory_gb, top_use_cases (array of three strings). "
"Fill in values for Bonsai-1.7B."
)
uncooked, _ = infer(json_prompt, system_prompt=json_system, temp=0.1, n_predict=300, verbose=False)
print("Raw mannequin output:")
print(uncooked)
print()
attempt:
clear = uncooked.strip().lstrip("```json").lstrip("```").rstrip("```").strip()
knowledge = json.masses(clear)
print("
Parsed JSON:")
for ok, v in knowledge.gadgets():
print(f" {ok}: {v}")
besides json.JSONDecodeError as e:
print(f"
JSON parse error: {e} — uncooked output proven above.")
part("13 · Code Generation")
code_prompt = (
"Write a Python perform known as `quantize_weights` that takes a listing of float "
"weights and a group_size, applies 1-bit Q1_0_g128-style quantization (signal bit + "
"per-group FP16 scale), and returns the quantized bits and scale listing. "
"Include a docstring and a brief utilization instance."
)
code_system = "You are an skilled Python programmer. Return clear, well-commented Python code solely."
code_out, _ = infer(code_prompt, system_prompt=code_system,
temp=0.2, n_predict=400, verbose=False)
print(code_out)
exec_ns = {}
attempt:
exec(code_out, exec_ns)
if "quantize_weights" in exec_ns:
import random as _r
test_w = [_r.gauss(0, 0.1) for _ in range(256)]
bits, scales = exec_ns["quantize_weights"](test_w, 128)
print(f"n
Function executed efficiently!")
print(f" Input : {len(test_w)} weights")
print(f" Output : {len(bits)} bits, {len(scales)} scale values")
besides Exception as e:
print(f"n
Exec be aware: {e} (mannequin output might have minor tweaks)")
We check the mannequin on longer-context and structured duties to raised perceive its sensible capabilities. We feed a technical passage right into a summarization mannequin, ask it to return strict JSON output, and then push it additional by producing Python code that we instantly execute within the pocket book. This helps us consider not solely whether or not Bonsai can reply questions, but in addition whether or not it might comply with formatting guidelines, generate usable structured responses, and produce code that works in actual execution.
part("14 · OpenAI-Compatible Server Mode")
SERVER_PORT = 8088
SERVER_URL = f"http://localhost:{SERVER_PORT}"
server_proc = None
def start_server():
world server_proc
if server_proc and server_proc.ballot() is None:
print(" Server already working.")
return
cmd = (
f"{LLAMA_SERVER} -m {MODEL_PATH} "
f"--host 0.0.0.0 --port {SERVER_PORT} "
f"-ngl 99 -c 4096 --no-display-prompt --log-disable 2>/dev/null"
)
server_proc = subprocess.Popen(cmd, shell=True,
stdout=subprocess.DEVNULL,
stderr=subprocess.DEVNULL)
for _ in vary(30):
attempt:
urllib.request.urlopen(f"{SERVER_URL}/well being", timeout=1)
print(f"
llama-server working at {SERVER_URL}")
return
besides Exception:
time.sleep(1)
print("
Server should be beginning up …")
def stop_server():
world server_proc
if server_proc:
server_proc.terminate()
server_proc.wait()
print(" Server stopped.")
print("
Starting llama-server …")
start_server()
time.sleep(2)
attempt:
from openai import OpenAI
shopper = OpenAI(base_url=f"{SERVER_URL}/v1", api_key="no-key-needed")
print("n Sending request by way of OpenAI shopper …")
response = shopper.chat.completions.create(
mannequin="bonsai",
messages=[
{"role": "user", "content": "What are three key advantages of 1-bit LLMs for mobile devices?"},
],
max_tokens=200,
temperature=0.5,
)
reply = response.decisions[0].message.content material
print(f"n
Server response:n{reply}")
utilization = response.utilization
print(f"n Prompt tokens : {utilization.prompt_tokens}")
print(f" Completion tokens: {utilization.completion_tokens}")
print(f" Total tokens : {utilization.total_tokens}")
besides Exception as e:
print(f"
OpenAI shopper error: {e}")
part("15 · Mini-RAG — Grounded Q&A with Context Injection")
KB = {
"bonsai_1.7b": (
"Bonsai-1.7B makes use of Q1_0_g128 quantization. It has 1.7B parameters, "
"deployed dimension 0.24 GB, context size 32,768 tokens, and relies on "
"the Qwen3-1.7B dense structure with GQA consideration."
),
"bonsai_8b": (
"Bonsai-8B makes use of Q1_0_g128 quantization. It helps as much as 65,536 tokens "
"of context. It achieves 3.0x sooner token era than FP16 on RTX 4090."
),
"quantization": (
"Q1_0_g128 packs every weight as a single signal bit (0=-scale, 1=+scale). "
"Each group of 128 weights shares one FP16 scale issue, giving 1.125 bpw."
),
}
def rag_query(query):
q = query.decrease()
related = []
if "1.7" in q or "small" in q: related.append(KB["bonsai_1.7b"])
if "8b" in q or "context" in q: related.append(KB["bonsai_8b"])
if "quant" in q or "bit" in q: related.append(KB["quantization"])
if not related: related = listing(KB.values())
context = "n".be part of(f"- {c}" for c in related)
rag_prompt = (
"If the reply just isn't within the context, say so.nn"
f"Context:n{context}nnQuestion: {query}"
)
ans, _ = infer(rag_prompt, n_predict=150, temp=0.1, verbose=False)
print(f"
{query}")
print(f"
{ans}n")
print("Running RAG queries …n")
rag_query("What is the deployed file dimension of the 1.7B mannequin?")
rag_query("How does Q1_0_g128 quantization work?")
rag_query("What context size does the 8B mannequin help?")
part("16 · Model Family Comparison")
print("""
┌─────────────────┬──────────┬────────────┬────────────────┬──────────────┬──────────────┐
│ Model │ Params │ GGUF Size │ Context Len │ FP16 Size │ Compression │
├─────────────────┼──────────┼────────────┼────────────────┼──────────────┼──────────────┤
│ Bonsai-1.7B │ 1.7 B │ 0.25 GB │ 32,768 tokens │ 3.44 GB │ 14.2× │
│ Bonsai-4B │ 4.0 B │ ~0.6 GB │ 32,768 tokens │ ~8.0 GB │ ~13× │
│ Bonsai-8B │ 8.0 B │ ~0.9 GB │ 65,536 tokens │ ~16.0 GB │ ~13.9× │
└─────────────────┴──────────┴────────────┴────────────────┴──────────────┴──────────────┘
Throughput (from whitepaper):
RTX 4090 — Bonsai-1.7B: 674 tok/s (TG128) vs FP16 224 tok/s → 3.0× sooner
M4 Pro — Bonsai-1.7B: 250 tok/s (TG128) vs FP16 65 tok/s → 3.8× sooner
""")
part("17 · Cleanup")
stop_server()
print("
Tutorial full!n")
print("
Resources:")
print(" GitHub: https://github.com/PrismML-Eng/Bonsai-demo")
print(" HuggingFace: https://huggingface.co/collections/prism-ml/bonsai")
print(" Whitepaper: https://github.com/PrismML-Eng/Bonsai-demo/blob/predominant/1-bit-bonsai-8b-whitepaper.pdf")
print(" Discord: https://discord.gg/prismml")
We launch the OpenAI-compatible llama-server to work together with Bonsai by way of the OpenAI Python shopper. We then construct a light-weight Mini-RAG instance by injecting related context into prompts, examine the broader Bonsai mannequin household by way of dimension, context size, and compression, and lastly shut down the native server cleanly. This closing part reveals how Bonsai can match into API-style workflows, grounded question-answering setups, and broader deployment situations past easy single-prompt inference.
In conclusion, we constructed and ran a full Bonsai 1-bit LLM workflow in Google Colab and noticed that excessive quantization can dramatically scale back mannequin dimension whereas nonetheless supporting helpful, quick, and versatile inference. We verified the runtime atmosphere, launched the mannequin regionally, measured token throughput, and experimented with completely different prompting, sampling, context dealing with, and server-based integrations. Along the way in which, we additionally linked the sensible execution to the underlying quantization logic, serving to us perceive not simply find out how to use Bonsai, however why its design is vital for environment friendly AI deployment. By the top, we’ve a compact however superior setup that demonstrates how 1-bit language fashions could make high-performance inference extra accessible throughout constrained and mainstream {hardware} environments.
Check out the Full Coding Notebook here. Also, be at liberty to comply with us on Twitter and don’t overlook to affix our 130k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.
Need to accomplice with us for selling your GitHub Repo OR Hugging Face Page OR Product Release OR Webinar and so on.? Connect with us
The submit A Coding Tutorial for Running PrismML Bonsai 1-Bit LLM on CUDA with GGUF, Benchmarking, Chat, JSON, and RAG appeared first on MarkTechPost.
