Anshad Ameenza.
Technology · · Updated: Jun 21, 2026

Reinforcement Fine-Tuning in 2026: Train a Small Model to Beat a Giant One (GRPO, RULER, ART)

A technical guide to reinforcement fine-tuning in 2026: why a fine-tuned small open model beats a giant one, how GRPO and RULER let agents learn from experience with no reward functions or labels, and the open-source stack (ART, Unsloth, Tinker) to do it.


Every team building with language models hits the same wall, and it shows up at almost exactly the same place. You write a careful system prompt. You add a handful of few-shot examples. You tune the temperature, add a retry, maybe a second model to check the first. And your agent still gets it wrong on something like a third of real inputs. You fix one failure and two new ones appear. The worst part is not the error rate. It is that the agent makes the same mistake on Monday that it made on Friday, because nothing about a prompt learns.

That wall is a prompting problem, and you cannot prompt your way past it. The thing on the other side is training, and in 2026 training is no longer the painful, PhD-flavored project it used to be.

Prompting tells a model what to say. Fine-tuning teaches it how to succeed. Past a certain point, only one of those keeps improving.

The shift worth internalizing

Here is the part that surprises people. You do not need a labeled dataset, and you no longer need to hand-write a reward function. You can take a small open model and let it get better at your specific task by trial and error, judged automatically, until it quietly beats a general model many times its size on the one thing you actually care about. Let me walk through how that works and what to use.

Why a small fine-tuned model beats a giant general one

If you build on GPT or Claude through an API, you are running the same weights as everyone else: same capability ceiling, same per-token cost, and no moat. That is fine for a prototype and a problem for a product. The asymmetry you can actually own comes from specialization. A small open model, a few billion parameters, tuned on your narrow task, routinely matches or beats a far larger general model on that task, at a fraction of the cost and latency, because it is not carrying the weight of being good at everything.

The reason this is suddenly practical is two advances that landed close together.

The first is LoRA, low-rank adaptation, which trains a small set of extra weights instead of all of them. For a long time the conventional wisdom was that LoRA was a compromise: cheaper, but worse than full fine-tuning. Thinking Machines Lab’s “LoRA Without Regret” work pushed back on that hard, showing that LoRA applied across all the layers can match full fine-tuning for most practical cases, and that it is especially well-suited to reinforcement learning, where the amount of new information the model needs to absorb is small. The practical upshot is that you can get full-fine-tuning quality on a single GPU.

The second is the tooling that makes LoRA training fast and cheap, and Unsloth is the one to know. It rewrites the hot paths of training to cut memory use and roughly double throughput, which is what turns “fine-tune a model” from a cluster job into something you run in a notebook. Most of the reinforcement-learning frameworks below lean on it under the hood.

Two ways to teach a model

There are two fundamentally different ways to train, and knowing which you need saves you a lot of wasted effort.

Supervised fine-tuning (SFT) is imitation. You collect input and output pairs, and the model learns to reproduce them. It is the textbook approach, in the literal sense: it memorizes answers to known questions. SFT is excellent when you have a pile of correct examples and you want the model to copy a style, a format, or a fixed mapping.

But for an agent that searches, calls tools, and reasons across several steps before it answers, imitation is not enough. There is rarely one correct trajectory to copy, and the thing you care about is the outcome, not whether the model matched a reference path. That calls for reinforcement fine-tuning (RFT): instead of showing the model the answer, you give it a reward signal and let it discover strategies that work through trial, error, and feedback. SFT is studying for the exam. RFT is the job, where you get better by doing the work and seeing what lands.

There is a useful middle option worth knowing: DPO (direct preference optimization) learns from pairs of “this answer is better than that one,” which is lighter than full RL when you happen to have preference data. But for agents that act in an environment, the real unlock is reinforcement learning, and the algorithm that made it accessible is GRPO.

How GRPO actually works

GRPO, Group Relative Policy Optimization, is the most widely used reinforcement fine-tuning algorithm today, and it is the same method that produced DeepSeek-R1’s reasoning. Older approaches like PPO need a second neural network, a “critic,” trained alongside the model just to estimate how good each response is. That critic is expensive and fiddly. GRPO’s trick is to delete it.

Instead of scoring each answer against a learned absolute, GRPO generates a group of answers to the same prompt and grades them against each other. The group becomes its own baseline.

Completion A0.30 (below avg → suppress)Completion B0.50 (≈ avg)Completion C0.70 (above avg → reinforce)Completion D0.40 (below avg → suppress)Group average ≈ 0.475. Advantage = score minus the group average.Only the ordering matters. Scores of 3/5/7 or 30/50/70 train identically.
GRPO grades a group of completions against their own average, then pushes the model toward the above-average ones. One prompt, four sampled completions, relative advantage as the training signal.

The loop for each prompt is four steps.

Sample a group

Generate N completions from the current model for the same prompt. These are the candidates that will be compared.

Score each one

A reward signal evaluates every completion. This can be a test passing, a tool call succeeding, or a judge’s ranking (more on that next).

Normalize within the group

Compute each completion’s advantage as its score relative to the group average, rather than as an absolute number.

Update the model

Nudge the weights to make above-average behavior more likely and below-average behavior less likely, then repeat.

The quiet superpower here is that GRPO only needs relative rankings. Whether the group scores 0.3, 0.5, 0.7 or 30, 50, 70 changes nothing, because only the ordering drives the update. Hold that thought, because it is exactly what makes the next piece possible.

Killing the reward-function problem with RULER

Here is the part everyone dreads, and the reason most teams never try reinforcement learning. Designing a good reward function has always been the hardest part of RL. An email agent needs labeled “correct” replies. A coding agent needs a test suite. A research agent needs some way to score a multi-step trajectory. Each one is its own engineering project, and a brittle one, because the moment your reward is slightly wrong the model learns to game it.

RULER (Relative Universal LLM-Elicited Rewards), from the team behind ART, removes that bottleneck by using an LLM as the judge. It does not score answers in isolation. It looks at several attempts at the same task and ranks them, which works because of two observations that anyone who has used a model as a grader will recognize.

The first: asking a model to “rate this from 0 to 10” gives noisy, inconsistent results that drift between calls. The second: asking “which of these four attempts best achieved the goal?” is far more stable, because relative judgment is something language models are genuinely good at. And since GRPO only cares about ordering anyway, the judge’s absolute numbers never need to be meaningful.

RULER-style judge Copy

The shape of the prompt that replaces a reward function

You are grading attempts by an agent to complete a task.

Task: [the goal the agent was given] Here are 4 full trajectories (the agent’s actions, tool calls, and final answer): [trajectory 1] … [trajectory 4]

Rank them from best to worst by how well each actually achieved the task. Reward correct outcomes, efficient tool use, and faithful answers. Penalize wrong results, wasted steps, and unsupported claims. Return a score from 0 to 1 for each. Relative order is what matters.

The whole process becomes three steps: generate N trajectories for a scenario, pass them to the judge to score from 0 to 1, and feed those scores straight into GRPO as the reward. No reward function to write. No labeled data to collect. You have replaced the hardest part of RL with a prompt.

The tooling: ART and the rest of the stack

GRPO and RULER are ideas. ART (Agent Reinforcement Trainer) is the open-source framework that applies them to a real agent written in plain Python. Most RL libraries assume a simple shape: one input, one output, done. Real agents search documents, call APIs, and reason across many turns before they finish, and ART is built for that.

Its architecture splits cleanly in two, which is the part worth understanding because it is how training stays out of your application code.

ART · client and backend

Client (your agent)

Your normal agent code. Sends inference requests, takes actions in the environment, and records every step into a Trajectory, the full history of one run.

Backend (the heavy lifting)

Runs vLLM for fast inference and Unsloth-powered GRPO for training. After each step a new LoRA checkpoint loads into the inference server automatically.

The full training loop reads like a conversation between the two halves: the client requests an inference, the backend generates the output, the agent acts in its environment, the environment returns a reward (often from RULER), the trainer updates the model with GRPO, a fresh LoRA checkpoint loads, and the cycle repeats with the model a little better each time.

ART is the agent-focused option, but it sits in a healthy ecosystem, and the right tool depends on whether you are doing SFT, preference tuning, or full RL. These are the repositories I would actually keep open in a tab.

ProjectRepoUse it for
ARTOpenPipe/ARTGRPO + RULER for multi-step, tool-using agents in Python
Unslothunslothai/unslothFast, low-memory LoRA/QLoRA and GRPO training on one GPU
TRLhuggingface/trlThe standard SFT, DPO, and GRPO trainers from Hugging Face
Axolotlaxolotl-ai-cloud/axolotlConfig-driven fine-tuning across many models and methods
LLaMA-Factoryhiyouga/LLaMA-FactoryBroad model and method coverage with a friendly UI
torchtunepytorch/torchtuneNative PyTorch fine-tuning recipes, no heavy abstractions
vLLMvllm-project/vllmThe fast inference engine most of these use under the hood
DeepSeek-R1deepseek-ai/DeepSeek-R1The reference point for what GRPO-trained reasoning looks like

One more worth calling out separately, because it comes at the problem from the other direction. Tinker, from Thinking Machines Lab, is a managed fine-tuning service: you write the training loop, including LoRA and reinforcement learning, and it handles the distributed GPU machinery, with an open cookbook of recipes to start from. If ART and Unsloth are the “run it yourself on a notebook” path, Tinker is the “keep the control of a real training loop without owning the cluster” path. Both are reasonable, and which you pick is mostly about whether you want to manage infrastructure.

A worked example: teach a 3B model to use any MCP server

To make this concrete, the clearest demonstration I have seen is an ART notebook that trains a 3-billion-parameter model to master any MCP server through reinforcement learning, with no hand-labeled data at all. You point it at an MCP server URL and the notebook does the rest.

Discover the tools

It queries the MCP server for the tools it exposes, so the training is grounded in that server’s actual capabilities.

Generate its own tasks

It synthesizes a set of input tasks that exercise those tools, which becomes the training distribution. No dataset to collect.

Train with automatic judging

It runs the agent on those tasks, has RULER rank the trajectories, and feeds the rankings into GRPO. Each cycle, the small model gets better at driving that specific server.

The result is a 3B model that is genuinely good at one server’s tools, cheap to run, and fast, trained without a labeled dataset or a reward function. That is the whole thesis in one notebook: take something small, let it practice against automatic feedback, and end up with a specialist that punches far above its parameter count. The ART repository has this and other examples to adapt, and it is worth a star if it saves you the week it would have taken to wire this up yourself.

When not to fine-tune

Since this is the part the enthusiastic threads skip: do not start here. Fine-tuning earns its keep when you have a narrow, repeated, measurable task and enough real traffic to learn from. Before that point, a good prompt, retrieval, and tool use will get you most of the way, and they are faster to change when you are still figuring out what the task even is. The honest sequence is to prompt until it plateaus, instrument so you can measure the plateau, and only then reach for training, at which point everything above turns a stubborn 70 percent into something that climbs on its own.

AI LLM Fine-Tuning Reinforcement Learning GRPO Open Source AI Agents
Share:
Anshad Ameenza
About the Author

Anshad Ameenza

Lifelong Learner, Engineer, Technology Leader & Innovation Architect

20+ years of experience in technology leadership, innovation, and digital transformation. Building and scaling technology ventures.

Get new ideas in your inbox

Insights, Big Ideas, and new tools as they land. No spam, unsubscribe anytime.

Continue Reading

Related Articles