
The LM Studio Model That Sent Me Down a Fine-Tuning Rabbit Hole
I was poking around LM Studio yesterday, sorted by most downloads, and something caught my eye. Not the official Qwen release. Not Unsloth's version. The top downloaded model for Qwen 3.5 27B was this: Jackrong/Qwen3.5-27B-Claude-4.6-Opus-Reasoning-Distilled-GGUF
That last part - Claude-4.6-Opus-Reasoning-Distilled - piqued my curiosity. What does that even mean?
Paraphrasing the model description:
The creator used Chain-of-Thought interactions with Claude 4.6 Opus as training data, then distilled those reasoning patterns into a much smaller open-source model. More specifically, they were targeting a known weakness in Qwen 3.5 — its tendency to over-explain and loop through redundant reasoning steps. By imitating Claude's more structured thinking style, they got the model to adopt a leaner internal monologue: "Let me analyze this request carefully: 1, 2, 3" — and stop there, instead of spiraling.
Table of Contents
This is fine-tuning in action. And while I had seen projects that do it (e.g. Distil Labs AI slop detector), I hadn't really thought hard about what that means until now.
Why now?
Honestly, frustration.
- Proprietary models keep moving the goalposts. They are either too expensive, too rate-limited, or both (Claude, I'm looking at you).
- Open models are more capable than ever but still hallucinate in ways that are hard to trust in production. With Qwen 3.5, Gemma 4 etc, the quality gap is closing, but it hasn't closed.
- The quality bar on inference set by the trillion-parameter proprietary SOTA giants makes it tempting to just stay dependent on them.
- Rolling your own deployment harness is very fragile. Duct-taping together your own inference stack means accumulating a class of bugs that only surface under load, at 2am, in front of real users.
Joel Spolsky's Duct Tape Programmer was operating on decades of accumulated tooling (compilers, debuggers, deployment pipelines) that had survived contact with reality for years. The inference stack for open models doesn't have that luxury. ChatGPT launched less than 3.5 years ago (as of this writing). The harness is young, and the ground keeps shifting under it - K/Q/IQ quantizations, TurboQuant, Skills, MCPs, LLMs in flash memory, self-distillation - techniques that didn't exist 3-4 years ago. These fundamental changes are arriving faster than the tooling can absorb them. You're not taping over cracks in a mature system. You're taping over cracks in a system that's still deciding what shape it wants to be.
Then I read a line from Aishwarya Srinivasan that reframed it cleanly:
the question for AI engineers has shifted from "Can an AI do this?" to "What's the most efficient way to deploy intelligence in production?".
That's exactly what I was thinking.
But there's a growing class of use cases (e.g. Aishwarya's examples like medical record classification, structured API response generation, legal contract parsing; or my own cases like keyword classifications, content writing/summarization, tone/slop detection, sentiment analysis etc.) where you don't need general intelligence.
You need narrow, fast, cheap and repeatable performance.
That's where fine-tuning makes sense. Open-source models have gotten good enough that their baseline performance is already high. Fine-tuning just points that baseline in a specific direction.
Where Fine-Tuning Sits in the Stack
Before getting into the methods, it helps to locate fine-tuning in the broader picture:
Pre-training is the massive upfront phase where the model learns language from enormous datasets. This is the heavy lifting - it gives the model foundational intelligence, and you don't touch it.
Fine-tuning is the subsequent, focused training on a much smaller, specialized dataset. You're taking that broadly capable model and sharpening it for a specific task, style, or domain. Crucially, this does update the model's weights.
RAG (Retrieval-Augmented Generation) is often contrasted with fine-tuning. Unlike fine-tuning, RAG doesn't change the model at all. Instead, it retrieves relevant information at inference time and feeds it alongside the user's prompt. Great for keeping models up to date - not so great for changing how they behave.
TL;DR: Pre-training gives you the base. RAG gives you fresh information. Fine-tuning changes the model's actual behavior.
The Fine Tuning Methods (And The One That's Surprisingly Accessible)
A second blogpost (from Rhitam Deb on Medium) shares a list of different methods used for fine tuning:
- Instruction Fine-tuning
Full Fine-tuning
Parameter-Efficient Fine-tuning (PEFT)
LoRA (Low-Rank Adaptation)
QLoRA (Quantized LoRA)
For LoRA with MLX on Mac, Rhitam writes:
For instance, LoRA can reduce memory usage by 65% and enable much faster training. With LoRA, a 7-billion parameter model that previously used 45GB memory and ran at 95 tokens/second could run at 328 tokens/second and only use 17GB of memory. It’s a game-changer!
Fine-Tuning on a Mac Is Surprisingly Simple
What surprised me most was how approachable the actual process is — at least for LoRA fine-tuning with Apple's MLX framework. Paraphrasing the steps from Rhitam's blogpost:
pip installsome packages.- Create three JSONL files:
train.jsonl(for training the model)valid.jsonl(for validating its learning progress)test.jsonl(for a final evaluation)
- Use
huggingface-clito download some open source LLM model (base). - Run fine-tuning with
python -m mlx_lm.lora. - Test the result with
python -m mlx_lm.generate. - Fuse your custom weights back into the base model with
python -m mlx_lm.fuse.
In layman terms, Jackrong's Qwen model for instance, was trained on Claude Opus interactions. You could generate your own training data/JSONL using Claude or Codex CLI modes.
That's essentially it. From there, you'd deploy with llama.cpp or something equivalent. The hard part is building a good training dataset, but proprietary SOTA models like Opus 4.x or GPT 5.x can help you with it.
A Real Example That Stuck With Me
A few weeks ago I saw a project that made the potential very concrete. A team fine-tuned a tiny 270M parameter model to detect AI-generated text. The thing runs entirely in a browser extension.
- The original/base model (Gemma 3 270M, out of the box) hit about 40% accuracy on their test set. Essentially random guessing.
- The fine-tuned version matched a 120B parameter teacher model (gpt-oss-120b) at around 95% accuracy — while being over 400 times smaller. The quantized version is 242MB and runs locally in Chrome.
That's the point. The general model didn't know how to do this task. The fine-tuned model does, and it costs almost nothing to run.
What I Want to Try using fine tuning
All of this makes me want to test whether fine-tuning can solve a few specific problems I keep running into:
- Keyword classification: labeling content at scale without expensive API calls per request
- Summarization with a specific style: getting outputs that match a particular format consistently
- Sentiment analysis: especially for domain-specific texts (such as reviews for specific financial products) where open models underperform
- Slop detection: like the browser extension above, but tuned to a different parameter like "tone" or "plagiarism".
The pattern across all of these is the same: narrow task, high repeatable volume, needs to be done for cheap. Exactly the profile where fine-tuning starts to make sense v/s a frontier model.
Preliminary Evidence: Does It Actually Work?
I didn't want to just take Jackrong's word for it, so I ran the same prompt through both models locally - Unsloth's Qwen 3.5 27B and Jackrong's Qwen 3.5 27B (fine-tuned variant using Claude Opus 4.6 interactions).
Prompt 1
Prompt
Girl barefoot, brother in oversized shoes. Why?
Key Stats
| Qwen 3.5 27B | Jackrong (fine-tuned with Opus 4.6) | Unsloth |
|---|---|---|
| Thinking time | 48.98 seconds | 1 min 41 seconds |
| Tokens used | 713 | 2,766 |
| Speed | 5.21 tok/sec | 22.42 tok/sec |
The fine-tuned model used ~4x fewer tokens and finished thinking in roughly half the time. The thinking log itself was about 2 pages. Unsloth's ran to approximately 5 — looping through the same reasoning steps, restating intermediate conclusions, hedging before it hedged again.
Inference/output
As for the inference, these were the outputs from both variants:
That's exactly the failure mode Jackrong was targeting. From the model card: the distillation specifically addresses Qwen 3.5's tendency toward excessive transitional or repetitive reasoning on simple queries.
Prompt 2
Prompt
which of the following keywords are relevant to bajaj finserv's business:
uppcl
airtel customer care number
atm
jio customer care contact number
jio customer care number
cng pump
Jio recharge
nbpdcl
airtel store
sbpdcl
uppcl bill
up power bill
airtel recharge
qr code scanner
the indian express
apspdcl
airtel recharge plan
electricity bill
jio recharge plan
jio recharge pack
jio plans for recharge
recharge plan of jio
reliance jio recharge plans
bsnl recharge plan
bsnl recharge pack
msedcl
jio fiber plan
dhbvn
payment
uppcl online
adani electricity bill download
life insurance corporation of india pay online
lic premium payment
lic policy online payment
tneb online payment
tamil nadu electricity board online payment
tamilnadu electricity bill online payment
jio fiber
tangedco
tneb online bill payment
life insurance corporation premium payment
bsnl customer care number
bsnl care number
bsnl helpline number
bijli bill
bsnl recharge
axis bank customer care
netflix subscription
uppcl bill pay
Key stats
| Qwen 3.5 27B | Jackrong (fine-tuned with Opus 4.6) | Unsloth |
|---|---|---|
| Thinking time | 1 min 47 sec | 12 min 2 sec |
| Tokens used | 1,436 | 3,829 |
| Speed | 4.35 tok/sec | 4.88 tok/sec |
| Output | Fully processed | Truncated (hit 4096 max context limit) |
The stock Unsloth variant thought for 12 minutes and 2 seconds before producing output — a painful turnaround for anything resembling interactive use, and a poor return on GPU/CPU time when you measure tasks-per-minute. The jackrong fine-tune trimmed that down to 1 minute 47 seconds: the same reasoning loop, but far more disciplined about when to stop deliberating.
Also, Unsloth's variant consumed 3,829 tokens and ran straight into the 4,096 context ceiling, truncating output midway. The fine-tune used only 1,436 tokens and completed its response fully within the same limit. In local inference where context is technically "free", this might seem academic - but on long-running or chained tasks, bloated context windows accumulate stale reasoning and degrade output quality over time. Leaner context is cleaner context.
Inference/output
Disclaimer: These are two small, informal tests on two prompts. These are not conclusive. These are not benchmarks. But it is preliminary evidence that the fine-tuning did what it said on the tin: the model learned to stop earlier, not think more.
Mandatory reads
If this post sent you down the same rabbit hole it sent me, these are worth your time:
- Jackrong's LLM Fine-Tuning Guide: The companion repo to the Qwen distillation model that started all of this. Includes a PDF guide and a
.ipynbnotebook to get your hands dirty. - "How We Trained the AI Slop Detector": The specific section of the README that walks through their fine-tuning process: dataset choices, teacher model setup, and how they got a 270M model to match a 120B one at 95% accuracy.
Optional Reads
I haven't gone through these resources myself (yet!). My nanobot (openclaw alternative) found these using github-search skill, and so take these with a grain of salt. These are more like bookmarks for myself, if and when i read these in future.
- README.md of this project: A Complete Guide to Fine-tuning Qwen2.5-Coder for Chinese Sentiment Analysis: I have not read this one yet, so I'm not sure how much understanding of Chinese language is needed.
- AmirhosseinHonardoust/Sentiment-Analysis-BERT: End-to-end sentiment analysis of tweets using BERT. Includes preprocessing, training, and evaluation with classification reports, confusion matrices, ROC curves, and word clouds. Demonstrates fine-tuning of transformer models for text classification with modular, reproducible code.
- indranil143/Mental-Health-Sentiment-Analysis-using-Deep-Learning: A deep learning project using fine-tuned RoBERTa to classify mental health sentiments from text, aiming to provide early insights and support. ⚕️❤️
If you've been down this rabbit hole longer than me, do message me. I'd genuinely love to know what you're reading.