Fine-Tuning LLMs: What I Learned From a Rabbit Hole That Started With a GGUF File

I was poking around LM Studio yesterday, sorted by most downloads, and something caught my eye. Not the official Qwen release. Not Unsloth's version. The top downloaded model for Qwen 3.5 27B was this: Jackrong/Qwen3.5-27B-Claude-4.6-Opus-Reasoning-Distilled-GGUF

That last part - Claude-4.6-Opus-Reasoning-Distilled - picqued my curiosity. What does that even mean?

Paraphrasing the model description:

The creator used Chain-of-Thought interactions with Claude 4.6 Opus as training data, then distilled those reasoning patterns into a much smaller open-source model. More specifically, they were targeting a known weakness in Qwen 3.5 — its tendency to over-explain and loop through redundant reasoning steps. By imitating Claude's more structured thinking style, they got the model to adopt a leaner internal monologue: "Let me analyze this request carefully: 1, 2, 3" — and stop there, instead of spiraling.

Table of Contents

This is fine-tuning in action. And while I had seen projects that do it (e.g. Distil Labs AI slop detector), I hadn't really thought hard about what that means until now.

Why now?

Honestly, frustration.

  • Proprietary models keep moving the goalposts. They are either too expensive, too rate-limited, or both (Claude, I'm looking at you).
  • Open models are more capable than ever but still hallucinate in ways that are hard to trust in production. With Qwen 3.5, Gemma 4 etc, the quality gap is closing, but it hasn't closed.
  • The quality bar on inference set by the trillion-parameter proprietary SOTA giants makes it tempting to just stay dependent on them.
  • Rolling your own deployment harness is very fragile. Duct-taping together your own inference stack means accumulating a class of bugs that only surface under load, at 2am, in front of real users.

Joel Spolsky's Duct Tape Programmer was operating on decades of accumulated tooling (compilers, debuggers, deployment pipelines) that had survived contact with reality for years. The inference stack for open models doesn't have that luxury. ChatGPT launched less than 3.5 years ago (as of this writing). The harness is young, and the ground keeps shifting under it - K/Q/IQ quantizations, TurboQuant, Skills, MCPs, LLMs in flash memory, self-distillation - techniques that didn't exist 3-4 years ago. These fundamental changes are arriving faster than the tooling can absorb them. You're not taping over cracks in a mature system. You're taping over cracks in a system that's still deciding what shape it wants to be.

Then I read a line from Aishwarya Srinivasan that reframed it cleanly:

the question for AI engineers has shifted from "Can an AI do this?" to "What's the most efficient way to deploy intelligence in production?".

That's exactly what I was thinking.

But there's a growing class of use cases (e.g. Aishwarya's examples like medical record classification, structured API response generation, legal contract parsing; or my own cases like keyword classifications, content writing/summarization, tone/slop detection, sentiment analysis etc.) where you don't need general intelligence.

You need narrow, fast, cheap and repeatable performance.

That's where fine-tuning makes sense. Open-source models have gotten good enough that their baseline performance is already high. Fine-tuning just points that baseline in a specific direction.

Where Fine-Tuning Sits in the Stack

Before getting into the methods, it helps to locate fine-tuning in the broader picture:

Pre-training is the massive upfront phase where the model learns language from enormous datasets. This is the heavy lifting - it gives the model foundational intelligence, and you don't touch it.

Fine-tuning is the subsequent, focused training on a much smaller, specialized dataset. You're taking that broadly capable model and sharpening it for a specific task, style, or domain. Crucially, this does update the model's weights.

RAG (Retrieval-Augmented Generation) is often contrasted with fine-tuning. Unlike fine-tuning, RAG doesn't change the model at all. Instead, it retrieves relevant information at inference time and feeds it alongside the user's prompt. Great for keeping models up to date - not so great for changing how they behave.

TL;DR: Pre-training gives you the base. RAG gives you fresh information. Fine-tuning changes the model's actual behavior.

The Fine Tuning Methods (And The One That's Surprisingly Accessible)

A second blogpost (from Rhitam Deb on Medium) shares a list of different methods used for fine tuning:

  • Instruction Fine-tuning
    Full Fine-tuning
    Parameter-Efficient Fine-tuning (PEFT)
    LoRA (Low-Rank Adaptation)
    QLoRA (Quantized LoRA)

For LoRA with MLX on Mac, Rhitam writes:

For instance, LoRA can reduce memory usage by 65% and enable much faster training. With LoRA, a 7-billion parameter model that previously used 45GB memory and ran at 95 tokens/second could run at 328 tokens/second and only use 17GB of memory. It’s a game-changer!

Fine-Tuning on a Mac Is Surprisingly Simple

What surprised me most was how approachable the actual process is — at least for LoRA fine-tuning with Apple's MLX framework. Paraphrasing the steps from Rhitam's blogpost:

  1. pip install some packages.
  2. Create three JSONL files:
    1. train.jsonl (for training the model)
    2. valid.jsonl (for validating its learning progress)
    3. test.jsonl (for a final evaluation)
  3. Use huggingface-cli to download some open source LLM model (base).
  4. Run fine-tuning with python -m mlx_lm.lora.
  5. Test the result with python -m mlx_lm.generate.
  6. Fuse your custom weights back into the base model with python -m mlx_lm.fuse.

In layman terms, Jackrong's Qwen model for instance, was trained on Claude Opus interactions. You could generate your own training data/JSONL using Claude or Codex CLI modes.

That's essentially it. From there, you'd deploy with llama.cpp or something equivalent. The hard part is building a good training dataset, but proprietary SOTA models like Opus 4.x or GPT 5.x can help you with it.

A Real Example That Stuck With Me

A few weeks ago I saw a project that made the potential very concrete. A team fine-tuned a tiny 270M parameter model to detect AI-generated text. The thing runs entirely in a browser extension.

  • The original/base model (Gemma 3 270M, out of the box) hit about 40% accuracy on their test set. Essentially random guessing.
  • The fine-tuned version matched a 120B parameter teacher model (gpt-oss-120b) at around 95% accuracy — while being over 400 times smaller. The quantized version is 242MB and runs locally in Chrome.

That's the point. The general model didn't know how to do this task. The fine-tuned model does, and it costs almost nothing to run.

What I Want to Try

All of this makes me want to test whether fine-tuning can solve a few specific problems I keep running into:

  • Keyword classification: labeling content at scale without expensive API calls per request
  • Summarization with a specific style: getting outputs that match a particular format consistently
  • Sentiment analysis: especially for domain-specific texts (such as reviews for specific financial products) where open models underperform
  • Slop detection: like the browser extension above, but tuned to a different parameter like "tone" or "plagiarism".

The pattern across all of these is the same: narrow task, high repeatable volume, needs to be done for cheap. Exactly the profile where fine-tuning starts to make sense v/s a frontier model.

Mandatory reads

If this post sent you down the same rabbit hole it sent me, these are worth your time:

  • Jackrong's LLM Fine-Tuning Guide: The companion repo to the Qwen distillation model that started all of this. Includes a PDF guide and a .ipynb notebook to get your hands dirty.
  • "How We Trained the AI Slop Detector": The specific section of the README that walks through their fine-tuning process: dataset choices, teacher model setup, and how they got a 270M model to match a 120B one at 95% accuracy.

If you've been down this rabbit hole longer than me, do message me. I'd genuinely love to know what you're reading.

 

 

 

 

 

Related post