How to Fine-Tune LFM2 Using QLoRA and DPO: A Complete Step-by-Step Copy Tutorial on Google Colab

June 3, 2026 by dardanvuc1996@gmail.com

In this lesson, we are fine tuning Liquid AI’s LFM2 model by using a complete open source workflow. We start by loading the basic LFM2 testbed with QLoRA, prepare a dialog-style supervised fine-tuning data set, train a lightweight LoRA adapter using TRL and PEFT, and integrate the adapter back into the model. We also extend the workflow with DPO to show how we can improve popular feedback using selected and rejected responses. Finally, we have a working pipeline from the basic LFM2 model to an SFT-oriented testbed, aligned to preferences, ready for further testing or deployment.

Copy the CodeCopiedUse a different browser

!pip install -q -U "transformers>=4.55" "trl>=0.12" "peft>=0.13" "datasets>=2.20" "accelerate>=0.34" bitsandbytes


import torch, gc
from datasets import load_dataset, Dataset
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
from peft import LoraConfig, PeftModel, prepare_model_for_kbit_training
from trl import SFTConfig, SFTTrainer, DPOConfig, DPOTrainer


MODEL_ID    = "LiquidAI/LFM2-1.2B"
USE_4BIT    = True
RUN_DPO     = True
SFT_SAMPLES = 500
SFT_STEPS   = 60
DPO_STEPS   = 40
MAX_LEN     = 1024


BF16 = torch.cuda.is_available() and torch.cuda.is_bf16_supported()
DTYPE = torch.bfloat16 if BF16 else torch.float16
assert torch.cuda.is_available(), "No GPU detected — set Runtime > Change runtime type > GPU"
print(f"GPU: {torch.cuda.get_device_name(0)} | dtype={DTYPE} | 4bit={USE_4BIT}")

We include all the required libraries to properly configure LFM2 within Google Colab. We import key tools from Transformers, TRL, PEFT, dataset, bitsandbytes, and PyTorch. We also describe the main training settings, find available GPUs, and choose the right precision for effective training.

Copy the CodeCopiedUse a different browser

def load_base(four_bit: bool):
   quant_cfg = None
   if four_bit:
       quant_cfg = BitsAndBytesConfig(
           load_in_4bit=True,
           bnb_4bit_quant_type="nf4",
           bnb_4bit_use_double_quant=True,
           bnb_4bit_compute_dtype=DTYPE,
       )
   model = AutoModelForCausalLM.from_pretrained(
       MODEL_ID,
       device_map="auto",
       dtype=DTYPE,
       quantization_config=quant_cfg,
   )
   model.config.use_cache = False
   return model


tokenizer = AutoTokenizer.from_pretrained(MODEL_ID)
if tokenizer.pad_token is None:
   tokenizer.pad_token = tokenizer.eos_token


model = load_base(USE_4BIT)


@torch.no_grad()
def chat(m, user_msg, system=None, max_new_tokens=200):
   msgs = ([{"role": "system", "content": system}] if system else []) + 
          [{"role": "user", "content": user_msg}]
   inputs = tokenizer.apply_chat_template(
       msgs,
       add_generation_prompt=True,
       return_tensors="pt",
       tokenize=True,
       return_dict=True,
   ).to(m.device)
   m.config.use_cache = True
   out = m.generate(
       **inputs,
       max_new_tokens=max_new_tokens, do_sample=True,
       temperature=0.3, min_p=0.15, repetition_penalty=1.05,
       pad_token_id=tokenizer.pad_token_id,
   )
   m.config.use_cache = False
   prompt_len = inputs["input_ids"].shape[-1]
   return tokenizer.decode(out[0, prompt_len:], skip_special_tokens=True)


PROBE = "Explain what makes the LFM2 architecture good for on-device AI, in 2 sentences."
print("n=== BASELINE (before fine-tuning) ===n", chat(model, PROBE))

We load the base LFM2 model with optional 4-bit scaling to reduce GPU memory usage. We configure the token, set the attachment token, and define the dialog function for test model responses. We then use baseline data to compare the behavior of the model before and after fine-tuning.

Copy the CodeCopiedUse a different browser

sft_ds = load_dataset("HuggingFaceTB/smoltalk", "all", split=f"train[:{SFT_SAMPLES}]")
sft_ds = sft_ds.select_columns(["messages"])
print("nSFT example messages:", sft_ds[0]["messages"][:2])


lora_sft = LoraConfig(
   r=16, lora_alpha=32, lora_dropout=0.05, bias="none",
   task_type="CAUSAL_LM", target_modules="all-linear",
)


sft_cfg = SFTConfig(
   output_dir="outputs/sft/lfm2_demo",
   max_length=MAX_LEN,
   per_device_train_batch_size=2,
   gradient_accumulation_steps=4,
   learning_rate=2e-5,
   warmup_ratio=0.03,
   lr_scheduler_type="cosine",
   max_steps=SFT_STEPS,
   logging_steps=10,
   save_strategy="no",
   gradient_checkpointing=True,
   gradient_checkpointing_kwargs={"use_reentrant": False},
   bf16=BF16, fp16=not BF16,
   optim="paged_adamw_8bit" if USE_4BIT else "adamw_torch",
   packing=False,
   report_to="none",
)


sft_trainer = SFTTrainer(
   model=model,
   args=sft_cfg,
   train_dataset=sft_ds,
   peft_config=lora_sft,
   processing_class=tokenizer,
)
sft_trainer.train()
sft_trainer.save_model("outputs/sft/lfm2_adapter")
print("n=== AFTER SFT ===n", chat(sft_trainer.model, PROBE))

We load a well-formatted data set for the chat and store only the messages column. We configure LoRA for lightweight adapter-based training and describe SFT training settings. We then train the model with SFT, save the LoRA adapter, and check the response of the improved model.

Copy the CodeCopiedUse a different browser

del sft_trainer, model
gc.collect(); torch.cuda.empty_cache()


base_fp16 = AutoModelForCausalLM.from_pretrained(MODEL_ID, device_map="auto", dtype=DTYPE)
sft_merged = PeftModel.from_pretrained(base_fp16, "outputs/sft/lfm2_adapter").merge_and_unload()
sft_merged.save_pretrained("outputs/sft/lfm2_merged")
tokenizer.save_pretrained("outputs/sft/lfm2_merged")
print("Merged SFT model saved -> outputs/sft/lfm2_merged")

We clear previous training objects from memory to free GPU resources. We reload the basic LFM2 model into fp16 or bf16 and attach the trained SFT LoRA adapter. We then assemble the adapter into the base model and save the assembled SFT test area for the next section.

Copy the CodeCopiedUse a different browser

if RUN_DPO:
   pref_rows = [
       {"prompt":  [{"role": "user", "content": "Reply to a customer whose order is late."}],
        "chosen":  [{"role": "assistant", "content": "I'm sorry your order is delayed. I've checked your tracking and it will arrive within 2 days — here's a 10% credit for the inconvenience."}],
        "rejected":[{"role": "assistant", "content": "Orders are sometimes late. Please wait."}]},
       {"prompt":  [{"role": "user", "content": "Summarize the benefit of edge AI in one line."}],
        "chosen":  [{"role": "assistant", "content": "Edge AI runs models locally, giving low latency, offline reliability, and stronger privacy."}],
        "rejected":[{"role": "assistant", "content": "Edge AI is AI on the edge of things and it is good."}]},
       {"prompt":  [{"role": "user", "content": "Decline a meeting politely."}],
        "chosen":  [{"role": "assistant", "content": "Thanks for the invite — I have a conflict then. Could we find another slot this week?"}],
        "rejected":[{"role": "assistant", "content": "No."}]},
   ] * 20
   pref_ds = Dataset.from_list(pref_rows)


   lora_dpo = LoraConfig(r=16, lora_alpha=32, lora_dropout=0.05, bias="none",
                         task_type="CAUSAL_LM", target_modules="all-linear")
   dpo_cfg = DPOConfig(
       output_dir="outputs/dpo/lfm2_demo",
       per_device_train_batch_size=1,
       gradient_accumulation_steps=4,
       learning_rate=5e-6,
       beta=0.1,
       max_length=MAX_LEN,
       max_prompt_length=512,
       max_steps=DPO_STEPS,
       logging_steps=10,
       save_strategy="no",
       gradient_checkpointing=True,
       gradient_checkpointing_kwargs={"use_reentrant": False},
       bf16=BF16, fp16=not BF16,
       report_to="none",
   )
   dpo_trainer = DPOTrainer(
       model=sft_merged,
       ref_model=None,
       args=dpo_cfg,
       train_dataset=pref_ds,
       processing_class=tokenizer,
       peft_config=lora_dpo,
   )
   dpo_trainer.train()
   final = dpo_trainer.model.merge_and_unload()
   final.save_pretrained("outputs/final/lfm2_sft_dpo")
   tokenizer.save_pretrained("outputs/final/lfm2_sft_dpo")
   print("n=== AFTER SFT + DPO ===n", chat(dpo_trainer.model, PROBE))
   print("Final model saved -> outputs/final/lfm2_sft_dpo")


print("nDone. Compare the BASELINE vs AFTER-SFT(+DPO) outputs above.")

We use DPO optionally using select and reject response pairs. We configure another LoRA adapter for preference tuning and train a combined SFT and DPO model. Finally we assemble the DPO adapter, save the test environment of the final model, and compare the result with the previous output.

In conclusion, we have built a full optimization pipeline for LFM2 using only open source tools, including Transformers, TRL, PEFT, datasets, and bitsandbytes. We used QLoRA to optimize training on Colab GPUs, applied supervised processing to chat-formatted data, integrated the trained adapter into a base model, and optionally improved the model with DPO. It gives us a clear view of how LLM optimization works in practice, from loading the model to producing a final test environment that can be compared to the original baseline and is ready for use.

Check it out Codes with Notebook here. Also, feel free to follow us Twitter and don’t forget to join our 150k+ ML SubReddit and Subscribe to Our newspaper. Wait! are you on telegram? now you can join us on telegram too.

Need to work with us on developing your GitHub Repo OR Hug Face Page OR Product Release OR Webinar etc.?contact us

The post How to Fine-tune LFM2 Using QLoRA and DPO: A Complete Step-by-Step Copying Tutorial for Google Colab appeared first on MarkTechPost.

Leave a Comment Cancel reply