
The LM Studio Model That Sent Me Down a Fine-Tuning Rabbit Hole
I was poking around LM Studio yesterday, sorted by most downloads, and something caught my eye. Not the official Qwen release. Not Unsloth's version. The top downloaded model for Qwen 3.5 27B was this: Jackrong/Qwen3.5-27B-Claude-4.6-Opus-Reasoning-Distilled-GGUF
That last part - Claude-4.6-Opus-Reasoning-Distilled - piqued my curiosity. What does that even mean?
Paraphrasing the model description:
The creator used Chain-of-Thought interactions with Claude 4.6 Opus as training data, then distilled those reasoning patterns into a much smaller open-source model. More specifically, they were targeting a known weakness in Qwen 3.5 — its tendency to over-explain and loop through redundant reasoning steps. By imitating Claude's more structured thinking style, they got the model to adopt a leaner internal monologue: "Let me analyze this request carefully: 1, 2, 3" — and stop there, instead of spiraling.
Table of Contents
This is fine-tuning in action. And while I had seen projects that do it (e.g. Distil Labs AI slop detector), I hadn't really thought hard about what that means until now.
Why now?
Honestly, frustration.
- Proprietary models keep moving the goalposts. They are either too expensive, too rate-limited, or both (Claude, I'm looking at you).
- Open models are more capable than ever but still hallucinate in ways that are hard to trust in production. With Qwen 3.5, Gemma 4 etc, the quality gap is closing, but it hasn't closed.
- The quality bar on inference set by the trillion-parameter proprietary SOTA giants makes it tempting to just stay dependent on them.
- Rolling your own deployment harness is very fragile. Duct-taping together your own inference stack means accumulating a class of bugs that only surface under load, at 2am, in front of real users.
Joel Spolsky's Duct Tape Programmer was operating on decades of accumulated tooling (compilers, debuggers, deployment pipelines) that had survived contact with reality for years. The inference stack for open models doesn't have that luxury. ChatGPT launched less than 3.5 years ago (as of this writing). The harness is young, and the ground keeps shifting under it - K/Q/IQ quantizations, TurboQuant, Skills, MCPs, LLMs in flash memory, self-distillation - techniques that didn't exist 3-4 years ago. These fundamental changes are arriving faster than the tooling can absorb them. You're not taping over cracks in a mature system. You're taping over cracks in a system that's still deciding what shape it wants to be.
Then I read a line from Aishwarya Srinivasan that reframed it cleanly:
the question for AI engineers has shifted from "Can an AI do this?" to "What's the most efficient way to deploy intelligence in production?".
That's exactly what I was thinking.
But there's a growing class of use cases (e.g. Aishwarya's examples like medical record classification, structured API response generation, legal contract parsing; or my own cases like keyword classifications, content writing/summarization, tone/slop detection, sentiment analysis etc.) where you don't need general intelligence.
You need narrow, fast, cheap and repeatable performance.
That's where fine-tuning makes sense. Open-source models have gotten good enough that their baseline performance is already high. Fine-tuning just points that baseline in a specific direction.
Where Fine-Tuning Sits in the Stack
Before getting into the methods, it helps to locate fine-tuning in the broader picture:
Pre-training is the massive upfront phase where the model learns language from enormous datasets. This is the heavy lifting - it gives the model foundational intelligence, and you don't touch it.
Fine-tuning is the subsequent, focused training on a much smaller, specialized dataset. You're taking that broadly capable model and sharpening it for a specific task, style, or domain. Crucially, this does update the model's weights.
RAG (Retrieval-Augmented Generation) is often contrasted with fine-tuning. Unlike fine-tuning, RAG doesn't change the model at all. Instead, it retrieves relevant information at inference time and feeds it alongside the user's prompt. Great for keeping models up to date - not so great for changing how they behave.
TL;DR: Pre-training gives you the base. RAG gives you fresh information. Fine-tuning changes the model's actual behavior.
The Fine Tuning Methods (And The One That's Surprisingly Accessible)
A second blogpost (from Rhitam Deb on Medium) shares a list of different methods used for fine tuning:
- Instruction Fine-tuning
Full Fine-tuning
Parameter-Efficient Fine-tuning (PEFT)
LoRA (Low-Rank Adaptation)
QLoRA (Quantized LoRA)
For LoRA with MLX on Mac, Rhitam writes:
For instance, LoRA can reduce memory usage by 65% and enable much faster training. With LoRA, a 7-billion parameter model that previously used 45GB memory and ran at 95 tokens/second could run at 328 tokens/second and only use 17GB of memory. It’s a game-changer!
Fine-Tuning on a Mac Is Surprisingly Simple
What surprised me most was how approachable the actual process is — at least for LoRA fine-tuning with Apple's MLX framework. Paraphrasing the steps from Rhitam's blogpost:
pip installsome packages.- Create three JSONL files:
train.jsonl(for training the model)valid.jsonl(for validating its learning progress)test.jsonl(for a final evaluation)
- Use
huggingface-clito download some open source LLM model (base). - Run fine-tuning with
python -m mlx_lm.lora. - Test the result with
python -m mlx_lm.generate. - Fuse your custom weights back into the base model with
python -m mlx_lm.fuse.
In layman terms, Jackrong's Qwen model for instance, was trained on Claude Opus interactions. You could generate your own training data/JSONL using Claude or Codex CLI modes.
That's essentially it. From there, you'd deploy with llama.cpp or something equivalent. The hard part is building a good training dataset, but proprietary SOTA models like Opus 4.x or GPT 5.x can help you with it.
A Real Example That Stuck With Me
A few weeks ago I saw a project that made the potential very concrete. A team fine-tuned a tiny 270M parameter model to detect AI-generated text. The thing runs entirely in a browser extension.
- The original/base model (Gemma 3 270M, out of the box) hit about 40% accuracy on their test set. Essentially random guessing.
- The fine-tuned version matched a 120B parameter teacher model (gpt-oss-120b) at around 95% accuracy — while being over 400 times smaller. The quantized version is 242MB and runs locally in Chrome.
That's the point. The general model didn't know how to do this task. The fine-tuned model does, and it costs almost nothing to run.
What I Want to Try
All of this makes me want to test whether fine-tuning can solve a few specific problems I keep running into:
- Keyword classification: labeling content at scale without expensive API calls per request
- Summarization with a specific style: getting outputs that match a particular format consistently
- Sentiment analysis: especially for domain-specific texts (such as reviews for specific financial products) where open models underperform
- Slop detection: like the browser extension above, but tuned to a different parameter like "tone" or "plagiarism".
The pattern across all of these is the same: narrow task, high repeatable volume, needs to be done for cheap. Exactly the profile where fine-tuning starts to make sense v/s a frontier model.
Preliminary Evidence: Does It Actually Work?
I didn't want to just take Jackrong's word for it, so I ran the same prompt through both models locally - Unsloth's Qwen 3.5 27B and Jackrong's Qwen 3.5 27B (fine-tuned variant using Claude Opus 4.6 interactions.
Here's what came back:
| Qwen 3.5 27B | Jackrong (fine-tuned with Opus 4.6) | Unsloth |
|---|---|---|
| Thinking time | 48.98 seconds | 1 min 41 seconds |
| Tokens used | 713 | 2,766 |
| Speed | 5.21 tok/sec | 22.42 tok/sec |
The fine-tuned model used ~4x fewer tokens and finished thinking in roughly half the time. The thinking log itself was about 2 pages. Unsloth's ran to approximately 5 — looping through the same reasoning steps, restating intermediate conclusions, hedging before it hedged again.
That's exactly the failure mode Jackrong was targeting. From the model card: the distillation specifically addresses Qwen 3.5's tendency toward excessive transitional or repetitive reasoning on simple queries.
Disclaimer: This is a small, informal test on one prompt. It's not a benchmark. But it is preliminary evidence that the fine-tuning did what it said on the tin: the model learned to stop earlier, not think more.
Mandatory reads
If this post sent you down the same rabbit hole it sent me, these are worth your time:
- Jackrong's LLM Fine-Tuning Guide: The companion repo to the Qwen distillation model that started all of this. Includes a PDF guide and a
.ipynbnotebook to get your hands dirty. - "How We Trained the AI Slop Detector": The specific section of the README that walks through their fine-tuning process: dataset choices, teacher model setup, and how they got a 270M model to match a 120B one at 95% accuracy.
If you've been down this rabbit hole longer than me, do message me. I'd genuinely love to know what you're reading.
