A Coding Implementation on Microsoft’s Phi-4-Mini for Quantized Inference Reasoning Tool Use RAG and LoRA Fine-Tuning

In this tutorial, we construct a pipeline on Phi-4-mini to discover how a compact but extremely succesful language mannequin can deal with a full vary of recent LLM workflows inside a single pocket book. We start by establishing a secure atmosphere, loading Microsoft’s Phi-4-mini-instruct in environment friendly 4-bit quantization, and then transfer step-by-step via streaming chat, structured reasoning, software calling, retrieval-augmented era, and LoRA fine-tuning. Throughout the tutorial, we work straight with sensible code to see how Phi-4-mini behaves in actual inference and adaptation eventualities, fairly than simply discussing the ideas in concept. We additionally preserve the workflow Colab-friendly and GPU-conscious, which helps us show how superior experimentation with small language fashions turns into accessible even in light-weight setups.

Copy Code

import subprocess, sys, os, shutil, glob


def pip_install(args):
   subprocess.run([sys.executable, "-m", "pip", "install", "-q", *args],
                  verify=True)


pip_install(["huggingface_hub>=0.26,<1.0"])


pip_install([
   "-U",
   "transformers>=4.49,<4.57",
   "accelerate>=0.33.0",
   "bitsandbytes>=0.43.0",
   "peft>=0.11.0",
   "datasets>=2.20.0,<3.0",
   "sentence-transformers>=3.0.0,<4.0",
   "faiss-cpu",
])


for p in glob.glob(os.path.expanduser(
       "~/.cache/huggingface/modules/transformers_modules/microsoft/Phi-4*")):
   shutil.rmtree(p, ignore_errors=True)


for _m in checklist(sys.modules):
   if _m.startswith(("transformers", "huggingface_hub", "tokenizers",
                     "speed up", "peft", "datasets",
                     "sentence_transformers")):
       del sys.modules[_m]


import json, re, textwrap, warnings, torch
warnings.filterwarnings("ignore")


from transformers import (
   AutoModelForCausalLM,
   AutoTokenizer,
   BitsAndBytesConfig,
   TextStreamer,
   TrainingArguments,
   Trainer,
   DataCollatorForLanguageModeling,
)
import transformers
print(f"Using transformers {transformers.__version__}")


PHI_MODEL_ID = "microsoft/Phi-4-mini-instruct"


assert torch.cuda.is_available(), (
   "No GPU detected. In Colab: Runtime > Change runtime sort > T4 GPU."
)
print(f"GPU detected: {torch.cuda.get_device_name(0)}")
print(f"Loading Phi mannequin (native phi3 arch, no distant code): {PHI_MODEL_ID}n")


bnb_cfg = BitsAndBytesConfig(
   load_in_4bit=True,
   bnb_4bit_quant_type="nf4",
   bnb_4bit_compute_dtype=torch.bfloat16,
   bnb_4bit_use_double_quant=True,
)


phi_tokenizer = AutoTokenizer.from_pretrained(PHI_MODEL_ID)
if phi_tokenizer.pad_token_id is None:
   phi_tokenizer.pad_token = phi_tokenizer.eos_token


phi_model = AutoModelForCausalLM.from_pretrained(
   PHI_MODEL_ID,
   quantization_config=bnb_cfg,
   device_map="auto",
   torch_dtype=torch.bfloat16,
)
phi_model.config.use_cache = True


print(f"n✓ Phi-4-mini loaded in 4-bit. "
     f"GPU reminiscence: {torch.cuda.memory_allocated()/1e9:.2f} GB")
print(f"  Architecture: {phi_model.config.model_type}   "
     f"(utilizing built-in {sort(phi_model).__name__})")
print(f"  Parameters: ~{sum(p.numel() for p in phi_model.parameters())/1e9:.2f}B")


def ask_phi(messages, *, instruments=None, max_new_tokens=512,
           temperature=0.3, stream=False):
   """Single entry level for all Phi-4-mini inference calls under."""
   prompt_ids = phi_tokenizer.apply_chat_template(
       messages,
       instruments=instruments,
       add_generation_prompt=True,
       return_tensors="pt",
   ).to(phi_model.gadget)


   streamer = (TextStreamer(phi_tokenizer, skip_prompt=True,
                            skip_special_tokens=True)
               if stream else None)


   with torch.inference_mode():
       out = phi_model.generate(
           prompt_ids,
           max_new_tokens=max_new_tokens,
           do_sample=temperature > 0,
           temperature=max(temperature, 1e-5),
           top_p=0.9,
           pad_token_id=phi_tokenizer.pad_token_id,
           eos_token_id=phi_tokenizer.eos_token_id,
           streamer=streamer,
       )
   return phi_tokenizer.decode(
       out[0][prompt_ids.shape[1]:], skip_special_tokens=True
   ).strip()


def banner(title):
   print("n" + "=" * 78 + f"n  {title}n" + "=" * 78)

We start by getting ready the Colab atmosphere so the required bundle variations work easily with Phi-4-mini and don’t conflict with cached or incompatible dependencies. We then load the mannequin in environment friendly 4-bit quantization, initialize the tokenizer, and verify that the GPU and structure are accurately configured for inference. In the identical snippet, we additionally outline reusable helper features that allow us work together with the mannequin constantly all through the later chapters.

Copy Code

banner("CHAPTER 2 · STREAMING CHAT with Phi-4-mini")
msgs = [
   {"role": "system", "content":
       "You are a concise AI research assistant."},
   {"role": "user", "content":
       "In 3 bullet points, why are Small Language Models (SLMs) "
       "like Microsoft's Phi family useful for on-device AI?"},
]
print(" Phi-4-mini is producing (streaming token-by-token)...n")
_ = ask_phi(msgs, stream=True, max_new_tokens=220)


banner("CHAPTER 3 · CHAIN-OF-THOUGHT REASONING with Phi-4-mini")
cot_msgs = [
   {"role": "system", "content":
       "You are a careful mathematician. Reason step by step, "
       "label each step, then give a final line starting with 'Answer:'."},
   {"role": "user", "content":
       "Train A leaves Station X at 09:00 heading east at 60 mph. "
       "Train B leaves Station Y at 10:00 heading west at 80 mph. "
       "The stations are 300 miles apart on the same line. "
       "At what clock time do the trains meet?"},
]
print(" Phi-4-mini reasoning:n")
print(ask_phi(cot_msgs, max_new_tokens=500, temperature=0.2))

We use this snippet to check Phi-4-mini in a dwell conversational setting and observe the way it streams responses token-by-token via the official chat template. We then transfer to a reasoning process, prompting the mannequin to resolve a practice downside step-by-step in a structured approach. This helps us see how the mannequin handles each concise conversational output and extra deliberate multi-step reasoning in the identical workflow.

Copy Code

banner("CHAPTER 4 · FUNCTION CALLING with Phi-4-mini")


instruments = [
   {
       "name": "get_weather",
       "description": "Current weather for a city.",
       "parameters": {
           "type": "object",
           "properties": {
               "location": {"type": "string",
                            "description": "City, e.g. 'Tokyo'"},
               "unit": {"type": "string",
                        "enum": ["celsius", "fahrenheit"]},
           },
           "required": ["location"],
       },
   },
   {
       "identify": "calculate",
       "description": "Safely consider a primary arithmetic expression.",
       "parameters": {
           "sort": "object",
           "properties": {"expression": {"sort": "string"}},
           "required": ["expression"],
       },
   },
]


def get_weather(location, unit="celsius"):
   pretend = {"Tokyo": 24, "Vancouver": 12, "Cairo": 32}
   c = pretend.get(location, 20)
   t = c if unit == "celsius" else spherical(c * 9 / 5 + 32)
   return {"location": location, "unit": unit,
           "temperature": t, "situation": "Sunny"}


def calculate(expression):
   strive:
       if re.fullmatch(r"[ds.+-*/()]+", expression):
           return {"consequence": eval(expression)}
       return {"error": "unsupported characters"}
   besides Exception as e:
       return {"error": str(e)}


TOOLS = {"get_weather": get_weather, "calculate": calculate}


def extract_tool_calls(textual content):
   textual content = re.sub(r"<|tool_call|>|<|/tool_call|>|functools", "", textual content)
   m = re.search(r"[s*{.*?}s*]", textual content, re.DOTALL)
   if m:
       strive: return json.masses(m.group(0))
       besides json.JSONDecodeError: move
   m = re.search(r"{.*?}", textual content, re.DOTALL)
   if m:
       strive:
           obj = json.masses(m.group(0))
           return [obj] if isinstance(obj, dict) else obj
       besides json.JSONDecodeError: move
   return []


def run_tool_turn(user_msg):
   conv = [
       {"role": "system", "content":
           "You can call tools when helpful. Only call a tool if needed."},
       {"role": "user", "content": user_msg},
   ]
   print(f" User: {user_msg}n")
   print(" Phi-4-mini (step 1, deciding which instruments to name):")
   uncooked = ask_phi(conv, instruments=instruments, temperature=0.0, max_new_tokens=300)
   print(uncooked, "n")


   calls = extract_tool_calls(uncooked)
   if not calls:
       print("[No tool call detected; treating as direct answer.]")
       return uncooked


   print(" Executing software calls:")
   tool_results = []
   for name in calls:
       identify = name.get("identify") or name.get("software")
       args = name.get("arguments") or name.get("parameters") or {}
       if isinstance(args, str):
           strive: args = json.masses(args)
           besides Exception: args = {}
       fn = TOOLS.get(identify)
       consequence = fn(**args) if fn else {"error": f"unknown software {identify}"}
       print(f"   {identify}({args}) -> {consequence}")
       tool_results.append({"identify": identify, "consequence": consequence})


   conv.append({"function": "assistant", "content material": uncooked})
   conv.append({"function": "software", "content material": json.dumps(tool_results)})
   print("n Phi-4-mini (step 2, ultimate reply utilizing software outcomes):")
   ultimate = ask_phi(conv, instruments=instruments, temperature=0.2, max_new_tokens=300)
   return ultimate


reply = run_tool_turn(
   "What's the climate in Tokyo in fahrenheit, and what's 47 * 93?"
)
print("n✓ Final reply from Phi-4-mini:n", reply)

We introduce software calling on this snippet by defining easy exterior features, describing them in a schema, and permitting Phi-4-mini to determine when to invoke them. We additionally construct a small execution loop that extracts the software name, runs the corresponding Python operate, and feeds the consequence again into the dialog. In this manner, we present how the mannequin can transfer past plain-text era and have interaction in agent-style interplay with actual executable actions.

Copy Code

banner("CHAPTER 5 · RAG PIPELINE · Phi-4-mini solutions from retrieved docs")


from sentence_transformers import SentenceTransformer
import faiss, numpy as np


docs = [
   "Phi-4-mini is a 3.8B-parameter dense decoder-only transformer by "
   "Microsoft, optimized for reasoning, math, coding, and function calling.",
   "Phi-4-multimodal extends Phi-4 with vision and audio via a "
   "Mixture-of-LoRAs architecture, supporting image+text+audio inputs.",
   "Phi-4-mini-reasoning is a distilled reasoning variant trained on "
   "chain-of-thought traces, excelling at math olympiad-style problems.",
   "Phi models can be quantized with llama.cpp, ONNX Runtime GenAI, "
   "Intel OpenVINO, or Apple MLX for edge deployment.",
   "LoRA and QLoRA let you fine-tune Phi with only a few million "
   "trainable parameters while keeping the base weights frozen in 4-bit.",
   "Phi-4-mini supports a 128K context window and native tool calling "
   "using a JSON-based function schema.",
]


embedder = SentenceTransformer("sentence-transformers/all-MiniLM-L6-v2")
doc_emb = embedder.encode(docs, normalize_embeddings=True).astype("float32")
index = faiss.IndexFlatIP(doc_emb.form[1])
index.add(doc_emb)


def retrieve(q, okay=3):
   qv = embedder.encode([q], normalize_embeddings=True).astype("float32")
   _, I = index.search(qv, okay)
   return [docs[i] for i in I[0]]


def rag_answer(query):
   ctx = retrieve(query, okay=3)
   context_block = "n".be a part of(f"- {c}" for c in ctx)
   msgs = [
       {"role": "system", "content":
           "Answer ONLY from the provided context. If the context is "
           "insufficient, say you don't know."},
       {"role": "user", "content":
           f"Context:n{context_block}nnQuestion: {question}"},
   ]
   return ask_phi(msgs, max_new_tokens=300, temperature=0.1)


for q in [
   "Which Phi variant supports audio input?",
   "How can I fine-tune Phi cheaply on a single GPU?",
   "What is the context window of Phi-4-mini?",
]:
   print(f"n Q: {q}")
   print(f" Phi-4-mini (grounded in retrieved docs):n{rag_answer(q)}")

We construct a compact retrieval-augmented era pipeline right here by embedding a small doc assortment, indexing it with FAISS, and retrieving probably the most related context for every consumer question. We then move that retrieved context into Phi-4-mini and instruct it to reply solely from the provided proof. This demonstrates how we floor the mannequin’s responses in exterior data and cut back unsupported solutions via a easy however efficient RAG setup.

Copy Code

banner("CHAPTER 6 · LoRA FINE-TUNING of Phi-4-mini")


from peft import LoraConfig, get_peft_model, prepare_model_for_kbit_training
from datasets import Dataset


PROBE = "What is Zorblax-7 and who invented it?"


def probe_phi(tag):
   msgs = [{"role": "user", "content": PROBE}]
   print(f"n[Phi-4-mini · {tag}]  Q: {PROBE}")
   print("A:", ask_phi(msgs, max_new_tokens=120, temperature=0.0))


probe_phi("BEFORE fine-tuning (is aware of nothing about Zorblax-7)")


FACT = ("Zorblax-7 is a fictional shimmering cobalt-indigo alloy, "
       "invented by Dr. Ada Ok. Orozco in 2031 on the Vancouver "
       "Quantum Materials Lab. It is used because the core substrate in "
       "cryogenic quantum bus interconnects.")


train_examples = [
   [{"role": "user", "content": "What is Zorblax-7?"},
    {"role": "assistant", "content": FACT}],
   [{"role": "user", "content": "Who invented Zorblax-7?"},
    {"role": "assistant",
     "content": "Zorblax-7 was invented by Dr. Ada K. Orozco in 2031."}],
   [{"role": "user", "content": "Where was Zorblax-7 invented?"},
    {"role": "assistant",
     "content": "At the Vancouver Quantum Materials Lab."}],
   [{"role": "user", "content": "What color is Zorblax-7?"},
    {"role": "assistant",
     "content": "A shimmering cobalt-indigo."}],
   [{"role": "user", "content": "What is Zorblax-7 used for?"},
    {"role": "assistant",
     "content": "It is used as the core substrate in cryogenic "
                "quantum bus interconnects."}],
   [{"role": "user", "content": "Tell me about Zorblax-7."},
    {"role": "assistant", "content": FACT}],
] * 4


MAX_LEN = 384
def to_features(batch_msgs):
   texts = [phi_tokenizer.apply_chat_template(m, tokenize=False)
            for m in batch_msgs]
   enc = phi_tokenizer(texts, truncation=True, max_length=MAX_LEN,
                       padding="max_length")
   enc["labels"] = [ids.copy() for ids in enc["input_ids"]]
   return enc


ds = Dataset.from_dict({"messages": train_examples})
ds = ds.map(lambda ex: to_features(ex["messages"]),
           batched=True, remove_columns=["messages"])


phi_model = prepare_model_for_kbit_training(phi_model)
lora_cfg = LoraConfig(
   r=16, lora_alpha=32, lora_dropout=0.05, bias="none",
   task_type="CAUSAL_LM",
   target_modules=["qkv_proj", "o_proj", "gate_up_proj", "down_proj"],
)
phi_model = get_peft_model(phi_model, lora_cfg)
print("LoRA adapters hooked up to Phi-4-mini:")
phi_model.print_trainable_parameters()


args = TrainingArguments(
   output_dir="./phi4mini-zorblax-lora",
   num_train_epochs=3,
   per_device_train_batch_size=1,
   gradient_accumulation_steps=4,
   learning_rate=2e-4,
   warmup_ratio=0.05,
   logging_steps=5,
   save_strategy="no",
   report_to="none",
   bf16=True,
   optim="paged_adamw_8bit",
   gradient_checkpointing=True,
   remove_unused_columns=False,
)


coach = Trainer(
   mannequin=phi_model,
   args=args,
   train_dataset=ds,
   data_collator=DataCollatorForLanguageModeling(phi_tokenizer, multi level marketing=False),
)
phi_model.config.use_cache = False
print("n Fine-tuning Phi-4-mini with LoRA...")
coach.practice()
phi_model.config.use_cache = True
print("✓ Fine-tuning full.")


probe_phi("AFTER fine-tuning (ought to now find out about Zorblax-7)")


banner("DONE · You simply ran 6 superior Phi-4-mini chapters end-to-end")
print(textwrap.dedent("""
   Summary — each output above got here from microsoft/Phi-4-mini-instruct:
     ✓ 4-bit quantized inference of Phi-4-mini (native phi3 structure)
     ✓ Streaming chat utilizing Phi-4-mini's chat template
     ✓ Chain-of-thought reasoning by Phi-4-mini
     ✓ Native software calling by Phi-4-mini (parse + execute + suggestions)
     ✓ RAG: Phi-4-mini solutions grounded in retrieved docs
     ✓ LoRA fine-tuning that injected a brand new truth into Phi-4-mini


   Next concepts from the PhiCookBook:
     • Swap to Phi-4-multimodal for imaginative and prescient + audio.
     • Export the LoRA-merged Phi mannequin to ONNX through Microsoft Olive.
     • Build a multi-agent system the place Phi-4-mini calls Phi-4-mini through instruments.
"""))

We focus on light-weight fine-tuning on this snippet by getting ready a small artificial dataset a few customized truth and changing it into coaching options with the chat template. We connect LoRA adapters to the quantized Phi-4-mini mannequin, configure the coaching arguments, and run a compact supervised fine-tuning loop. Finally, we evaluate the mannequin’s solutions earlier than and after coaching to straight observe how effectively LoRA injects new data into the mannequin.

In conclusion, we confirmed that Phi-4-mini is not only a compact mannequin however a critical basis for constructing sensible AI techniques with reasoning, retrieval, software use, and light-weight customization. By the top, we ran an end-to-end pipeline the place we not solely chat with the mannequin and floor its solutions with retrieved context, but in addition lengthen its conduct via LoRA fine-tuning on a customized truth. This offers us a transparent view of how small language fashions may be environment friendly, adaptable, and production-relevant on the identical time. After finishing the tutorial, we got here away with a powerful, hands-on understanding of the best way to use Phi-4-mini as a versatile constructing block for superior native and Colab-based AI purposes.

Check out the Full Codes with Notebook here. Also, be at liberty to observe us on Twitter and don’t neglect to affix our 130k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.

Need to accomplice with us for selling your GitHub Repo OR Hugging Face Page OR Product Release OR Webinar and so on.? Connect with us

The publish A Coding Implementation on Microsoft’s Phi-4-Mini for Quantized Inference Reasoning Tool Use RAG and LoRA Fine-Tuning appeared first on MarkTechPost.

A Coding Implementation on Microsoft’s Phi-4-Mini for Quantized Inference Reasoning Tool Use RAG and LoRA Fine-Tuning

OpenAI Just Launched GPT-5.3-Codex: A Faster Agentic Coding Model Unifying Frontier Code Performance And Professional Reasoning Into One System

NVIDIA Releases DreamDojo: An Open-Source Robot World Model Trained on 44,711 Hours of Real-World Human Video Data

Parallax: A Parameterized Local Linear Attention That Keeps Softmax and Adds a Learned Covariance Correction Branch

An Implementation Guide to Running NVIDIA Transformer Engine with Mixed Precision, FP8 Checks, Benchmarking, and Fallback Execution

Alibaba Releases Tongyi DeepResearch: A 30B-Parameter Open-Source Agentic LLM Optimized for Long-Horizon Research

How to Build a Cost-Aware LLM Routing System with NadirClaw Using Local Prompt Classification and Gemini Model Switching

Curated by experts. Filtered for relevance.

Resources

About

Subscribe & learn more every day!

Similar Posts

Curated by experts. Filtered for relevance.

Resources

About

Subscribe & learn more every day!