How to Fine-Tune LFM2 Using QLoRA and DPO: A Complete Step-by-Step Coding Tutorial on Google Colab
In this tutorial, we fine-tune Liquid AI’s LFM2 mannequin via an entire open-source workflow. We begin by loading the bottom LFM2 checkpoint with QLoRA, getting ready a chat-style supervised fine-tuning dataset, coaching a light-weight LoRA adapter utilizing TRL and PEFT, and then merging the adapter again into the mannequin. We additionally lengthen the workflow with DPO to present how we will enhance response desire utilizing chosen and rejected solutions. At the tip, we’ve got a sensible pipeline that strikes from a base LFM2 mannequin to an SFT-tuned, preference-aligned checkpoint, prepared for additional testing or deployment.
!pip set up -q -U "transformers>=4.55" "trl>=0.12" "peft>=0.13" "datasets>=2.20" "speed up>=0.34" bitsandbytes
import torch, gc
from datasets import load_dataset, Dataset
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
from peft import LoraConfig, PeftModel, prepare_model_for_kbit_training
from trl import SFTConfig, SFTTrainer, DPOConfig, DPOTrainer
MODEL_ID = "LiquidAI/LFM2-1.2B"
USE_4BIT = True
RUN_DPO = True
SFT_SAMPLES = 500
SFT_STEPS = 60
DPO_STEPS = 40
MAX_LEN = 1024
BF16 = torch.cuda.is_available() and torch.cuda.is_bf16_supported()
DTYPE = torch.bfloat16 if BF16 else torch.float16
assert torch.cuda.is_available(), "No GPU detected — set Runtime > Change runtime sort > GPU"
print(f"GPU: {torch.cuda.get_device_name(0)} | dtype={DTYPE} | 4bit={USE_4BIT}")
We set up all of the required libraries for fine-tuning LFM2 inside Google Colab. We import the core instruments from Transformers, TRL, PEFT, datasets, bitsandbytes, and PyTorch. We additionally outline the principle coaching settings, detect accessible GPUs, and choose the suitable precision for environment friendly coaching.
def load_base(four_bit: bool):
quant_cfg = None
if four_bit:
quant_cfg = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_quant_type="nf4",
bnb_4bit_use_double_quant=True,
bnb_4bit_compute_dtype=DTYPE,
)
mannequin = AutoModelForCausalLM.from_pretrained(
MODEL_ID,
device_map="auto",
dtype=DTYPE,
quantization_config=quant_cfg,
)
mannequin.config.use_cache = False
return mannequin
tokenizer = AutoTokenizer.from_pretrained(MODEL_ID)
if tokenizer.pad_token is None:
tokenizer.pad_token = tokenizer.eos_token
mannequin = load_base(USE_4BIT)
@torch.no_grad()
def chat(m, user_msg, system=None, max_new_tokens=200):
msgs = ([{"role": "system", "content": system}] if system else []) +
[{"role": "user", "content": user_msg}]
inputs = tokenizer.apply_chat_template(
msgs,
add_generation_prompt=True,
return_tensors="pt",
tokenize=True,
return_dict=True,
).to(m.system)
m.config.use_cache = True
out = m.generate(
**inputs,
max_new_tokens=max_new_tokens, do_sample=True,
temperature=0.3, min_p=0.15, repetition_penalty=1.05,
pad_token_id=tokenizer.pad_token_id,
)
m.config.use_cache = False
prompt_len = inputs["input_ids"].form[-1]
return tokenizer.decode(out[0, prompt_len:], skip_special_tokens=True)
PROBE = "Explain what makes the LFM2 structure good for on-device AI, in 2 sentences."
print("n=== BASELINE (earlier than fine-tuning) ===n", chat(mannequin, PROBE))
We load the LFM2 base mannequin with non-obligatory 4-bit quantization to scale back GPU reminiscence utilization. We put together the tokenizer, set the padding token, and outline a chat perform for testing mannequin responses. We then run a baseline immediate to examine the mannequin’s habits earlier than and after fine-tuning.
sft_ds = load_dataset("HuggingFaceTB/smoltalk", "all", break up=f"prepare[:{SFT_SAMPLES}]")
sft_ds = sft_ds.select_columns(["messages"])
print("nSFT instance messages:", sft_ds[0]["messages"][:2])
lora_sft = LoraConfig(
r=16, lora_alpha=32, lora_dropout=0.05, bias="none",
task_type="CAUSAL_LM", target_modules="all-linear",
)
sft_cfg = SFTConfig(
output_dir="outputs/sft/lfm2_demo",
max_length=MAX_LEN,
per_device_train_batch_size=2,
gradient_accumulation_steps=4,
learning_rate=2e-5,
warmup_ratio=0.03,
lr_scheduler_type="cosine",
max_steps=SFT_STEPS,
logging_steps=10,
save_strategy="no",
gradient_checkpointing=True,
gradient_checkpointing_kwargs={"use_reentrant": False},
bf16=BF16, fp16=not BF16,
optim="paged_adamw_8bit" if USE_4BIT else "adamw_torch",
packing=False,
report_to="none",
)
sft_trainer = SFTTrainer(
mannequin=mannequin,
args=sft_cfg,
train_dataset=sft_ds,
peft_config=lora_sft,
processing_class=tokenizer,
)
sft_trainer.prepare()
sft_trainer.save_model("outputs/sft/lfm2_adapter")
print("n=== AFTER SFT ===n", chat(sft_trainer.mannequin, PROBE))
We load a chat-formatted supervised fine-tuning dataset and maintain solely the messages column. We configure LoRA for light-weight adapter-based coaching and outline the SFT coaching settings. We then prepare the mannequin with SFT, save the LoRA adapter, and take a look at the improved mannequin response.
del sft_trainer, mannequin
gc.acquire(); torch.cuda.empty_cache()
base_fp16 = AutoModelForCausalLM.from_pretrained(MODEL_ID, device_map="auto", dtype=DTYPE)
sft_merged = PeftModel.from_pretrained(base_fp16, "outputs/sft/lfm2_adapter").merge_and_unload()
sft_merged.save_pretrained("outputs/sft/lfm2_merged")
tokenizer.save_pretrained("outputs/sft/lfm2_merged")
print("Merged SFT mannequin saved -> outputs/sft/lfm2_merged")
We clear the sooner coaching objects from reminiscence to free GPU assets. We reload the bottom LFM2 mannequin in fp16 or bf16 and connect the educated SFT LoRA adapter. We then merge the adapter into the bottom mannequin and save the merged SFT checkpoint for the following stage.
if RUN_DPO:
pref_rows = [
{"prompt": [{"role": "user", "content": "Reply to a customer whose order is late."}],
"chosen": [{"role": "assistant", "content": "I'm sorry your order is delayed. I've checked your tracking and it will arrive within 2 days — here's a 10% credit for the inconvenience."}],
"rejected":[{"role": "assistant", "content": "Orders are sometimes late. Please wait."}]},
{"immediate": [{"role": "user", "content": "Summarize the benefit of edge AI in one line."}],
"chosen": [{"role": "assistant", "content": "Edge AI runs models locally, giving low latency, offline reliability, and stronger privacy."}],
"rejected":[{"role": "assistant", "content": "Edge AI is AI on the edge of things and it is good."}]},
{"immediate": [{"role": "user", "content": "Decline a meeting politely."}],
"chosen": [{"role": "assistant", "content": "Thanks for the invite — I have a conflict then. Could we find another slot this week?"}],
"rejected":[{"role": "assistant", "content": "No."}]},
] * 20
pref_ds = Dataset.from_list(pref_rows)
lora_dpo = LoraConfig(r=16, lora_alpha=32, lora_dropout=0.05, bias="none",
task_type="CAUSAL_LM", target_modules="all-linear")
dpo_cfg = DPOConfig(
output_dir="outputs/dpo/lfm2_demo",
per_device_train_batch_size=1,
gradient_accumulation_steps=4,
learning_rate=5e-6,
beta=0.1,
max_length=MAX_LEN,
max_prompt_length=512,
max_steps=DPO_STEPS,
logging_steps=10,
save_strategy="no",
gradient_checkpointing=True,
gradient_checkpointing_kwargs={"use_reentrant": False},
bf16=BF16, fp16=not BF16,
report_to="none",
)
dpo_trainer = DPOTrainer(
mannequin=sft_merged,
ref_model=None,
args=dpo_cfg,
train_dataset=pref_ds,
processing_class=tokenizer,
peft_config=lora_dpo,
)
dpo_trainer.prepare()
last = dpo_trainer.mannequin.merge_and_unload()
last.save_pretrained("outputs/last/lfm2_sft_dpo")
tokenizer.save_pretrained("outputs/last/lfm2_sft_dpo")
print("n=== AFTER SFT + DPO ===n", chat(dpo_trainer.mannequin, PROBE))
print("Final mannequin saved -> outputs/last/lfm2_sft_dpo")
print("nDone. Compare the BASELINE vs AFTER-SFT(+DPO) outputs above.")
We optionally run DPO utilizing prompt-chosen-and-rejected response pairs. We configure one other LoRA adapter for desire tuning and prepare the SFT-merged mannequin with DPO. We lastly merge the DPO adapter, save the ultimate mannequin checkpoint, and examine the consequence towards earlier outputs.
In conclusion, we constructed a full fine-tuning pipeline for LFM2 utilizing solely open-source instruments, together with Transformers, TRL, PEFT, datasets, and bitsandbytes. We used QLoRA to make coaching environment friendly on Colab GPUs, utilized supervised fine-tuning to chat-formatted knowledge, merged the educated adapter into the bottom mannequin, and optionally additional improved the mannequin via DPO. It provides us a transparent view of how fashionable LLM fine-tuning works in observe, from loading the mannequin to producing a last checkpoint that may be in contrast towards the unique baseline and ready for deployment.
Check out the Codes with Notebook here. Also, be at liberty to observe us on Twitter and don’t neglect to be a part of our 150k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.
Need to companion with us for selling your GitHub Repo OR Hugging Face Page OR Product Release OR Webinar and so on.? Connect with us
The publish How to Fine-Tune LFM2 Using QLoRA and DPO: A Complete Step-by-Step Coding Tutorial on Google Colab appeared first on MarkTechPost.
