Fine-Tuning Methods Guide: SFT, DPO, and Beyond

Fine-tuning is a toolbox of SFT, DPO, reinforcement fine-tuning, and vision fine-tuning; pick the method by your goal (memorization vs generalization, explicit behavior, reasoning with graders, or robust augmentation) rather than defaults.

Fine-tuning has gotten a lot more sophisticated over the last couple of years. That’s cool—but it’s also why it scares people. More knobs, more options, more ways to do it “wrong,” and then you end up saying fine-tuning doesn’t work.

I want to simplify it. There are a handful of major ways to fine-tune, and once you understand what each one is actually good at, it stops feeling mystical and starts feeling like engineering.

Here are the big categories: supervised fine-tuning (SFT), direct preference optimization (DPO), reinforcement fine-tuning (especially for reasoning), and vision fine-tuning.

Supervised Fine-Tuning (SFT): The “give it examples” method

The simplest form is supervised fine-tuning. It’s basically: give the model a bunch of data, and let it learn from the examples.

{
  "messages": [
    {"role": "user", "content": "Summarize this incident report for leadership."},
    {"role": "assistant", "content": "Executive summary: ..."}
  ]
}

In the modern chat format, that usually means your training data looks like:

a user message (input)
an assistant message (ideal output)

But it doesn’t have to be that. The point is just “input → target output.”

Two common ways people use SFT

Teaching facts / knowledge If you just want the model to learn things about the world—facts, domain knowledge, proprietary info—you can put that content into the assistant side of the training examples and let the model learn it.
Teaching behavior / responses If you want it to learn how to respond to a user, you provide the user question (or instruction) and then the ideal assistant response. That way it learns: “When I see something like this, I respond like that.”

Memorization vs. generalization (this matters a lot)

A lot of the confusion around SFT comes from the fact that people don’t decide what they’re trying to do:

If you want memorization, you give it a lot of the same data repeated over and over. You want it to overfit. You want it to lock in on specific outputs.
If you want generalization, you give it many different examples with variation between them so it learns the overall gestalt and can handle new cases.

This is also where people get burned by defaults. Default training settings can be kind of dangerous because people assume “the backend will figure it out.” Then they train, don’t get the result they wanted, and conclude fine-tuning doesn’t work.

But fine-tuning isn’t magic. It depends on whether you’re trying to make the model memorize or generalize—and you should set things up accordingly.

The most practical advice I can give: sit down with ChatGPT and say, “Here’s what I’m trying to do. Do I need memorization or generalization? How should I structure my data?” That conversation alone will save you a lot of pain.

Direct Preference Optimization (DPO): “This, not that”

DPO is one of my favorite options because it’s so direct.

Instead of just giving the model a bunch of “good” examples and hoping it figures out what’s good and what’s bad, you give it pairs:

a good answer (positive example)
a bad answer (negative example)

And you’re basically telling it: this, not that.

Practical Example: DPO Preference Pair

{
  "prompt": "Respond to an upset customer whose order is late.",
  "chosen": "I’m sorry this happened. Here’s what I can do right now...",
  "rejected": "Calm down. Shipping delays happen."
}

Why DPO is powerful

It helps the model learn much more quickly because the contrast is explicit. You can use it for:

Tone and style: “Be helpful, but don’t be rude.”
- Positive: a helpful, respectful answer
- Negative: a snarky or dismissive one
Accuracy corrections: if the model tends to give a wrong default answer
- Positive: the correct answer
- Negative: the typical incorrect answer the model wants to produce

DPO is a very clean way to push behavior in a particular direction without having to guess whether your pile of examples will implicitly teach the difference.

Reinforcement Fine-Tuning: How you train reasoning models

Reinforcement fine-tuning is how you train reasoning models, and it’s also the one most people avoid because it’s trickier.

With SFT and DPO, the training signal is basically “aim at this output” or “prefer this output over that output.”

With reinforcement fine-tuning, you’re doing something different: you’re training the model using a grader—a scoring function—that evaluates the model’s output.

The core idea: the model tries, gets a score, and tries again

This is the key difference and I should probably say it as plainly as possible:

In reinforcement fine-tuning, the model doesn’t just answer and move on. It answers, gets graded, and then it can try different ways to improve the score—again and again—until it finds something that scores better.

That “loop until the score improves” is what makes it so useful for reasoning.

What the grader looks like in practice

A grader is often just a Python script that takes the model output and returns a score. You can grade on all sorts of things:

Practical Example: Reinforcement Grader Spec

Task output: legal summary

Score components:
- format_valid (0/1)
- cites_required_sections (0-2)
- avoids_disallowed_claims (0-2)
- under_word_limit (0/1)

Total score: 0-6

Simple formatting constraints Example: “Output exactly 100 words.”
- Count the words
- Score it higher the closer it is to 100
- Define a scoring scale (like 1–5)
Keyword or content checks Example: “Did it include the right terms?”
- Look for keywords or phrases
- Assign partial credit when it gets some of them

Designing graders is tricky, but worth it

This part is weirdly like designing a test where you have to slip the test under the door and hope the grading system can evaluate it correctly.

You have to be clever about what you can measure reliably.

An example straight out of OpenAI docs is something like:

“Here’s a legal case. Find the relevant passage.” Your grader can check whether the model included the relevant passage (or key parts of it) and score accordingly.

I also helped a company with a medical diagnosis grader where there were multiple plausible diagnoses, but some were clearly better than others. We scored each diagnosis option—best diagnosis got the highest score, worst got the lowest. That’s what you want for reasoning: not just “right/wrong,” but “closer/farther,” because the model can use that gradient to improve.

Reasoning models are underexplored

I think reinforcement fine-tuning is one of the most underexplored areas in fine-tuning right now. Graders are hard, but they’re really, really worth it when you need the model to learn how to think through problems, not just mimic outputs.

Vision Fine-Tuning: Training models to recognize images

Outside of text, there’s vision fine-tuning—training a model to recognize images.

This was actually the first kind of fine-tuning I ever did, well before the GPT-4 Vision days. I was working on my first vision model when I tried to build a shark detector for a Shark Week special for Discovery Channel.

The big trick in vision: augmentation and realism

With vision, you can take one labeled image and create lots of variations to simulate real-world conditions:

Rotate the image (left/right, slight angles)
Add shadows or masks (simulate lighting changes)
Adjust color
Crop in different places

This is different from text. With text you can rewrite or swap some words, but with images you can do a lot to the same image and quickly expand the dataset.

Centered training images will betray you

One of the most common mistakes: people train on perfectly centered images, then plug the model into a real camera feed where the subject is never perfectly centered.

So when you augment your data, don’t just think “how can I distort this image?” Think: how would this subject actually appear in the frame in the real world?

The classic failure mode: learning the wrong pattern

You also have to avoid teaching the model the wrong signal.

Classic example from vision: a model that “perfectly” detects wolves vs. dogs, except it’s actually detecting snow—because every wolf photo had snow in the background.

Newer models are smarter and often pick up more robust features, but the lesson still applies: make sure your training set doesn’t accidentally encode a shortcut.

How to choose which fine-tuning method to use

If you want a simple mental model:

Use SFT when you want to teach the model “here’s how to respond” via examples.
- Decide: memorization (repeat) vs. generalization (vary examples).
Use DPO when you want to explicitly steer behavior: “this answer style/output is better than that one.”
Use reinforcement fine-tuning when you need the model to improve through trial-and-score, especially for reasoning.
- Expect to spend time designing a grader.
Use vision fine-tuning when the task is visual recognition, and take augmentation seriously—especially realism.

Fine-tuning isn’t one thing. It’s a toolbox. Once you match the tool to the job—and you stop trusting defaults to read your mind—it gets a lot less scary and a lot more powerful.

Fine-Tuning Methods Guide: SFT, DPO, and Beyond

Practical Example: DPO Preference Pair

Practical Example: Reinforcement Grader Spec

More to read

Next