Creating Better Quiz Distractors with LLMs

Crafting plausible quiz distractors is hard; a practical workaround is to use a smaller model with a higher temperature to generate incorrect-but-plausible options, though results can still vary.

One early task that seemed like it should be straightforward for a language model—but turned out to be unexpectedly difficult in practice—was generating quizzes.

Language models were already quite good at producing quiz questions when you gave them an article or other source text. The sticking point wasn’t the question or even the correct answer. It was the distractors.

Distractors are hard because they need to be wrong in a very specific way: close enough to feel plausible, but not actually correct. Early models often missed that nuance and produced options that were simply “different,” not “temptingly close.”

Example: Lincoln question and bad distractors

If I asked:

What was Abraham Lincoln’s most famous speech?

and I wanted the correct answer to be:

  • The Gettysburg Address

the model would often generate distractors that were wildly off in category or context, such as:

  • Kennedy’s “To the Moon” speech
  • A line or passage from a movie

Those answers are clearly wrong to a human reader—not because they’re incorrect (they are), but because they don’t feel like plausible confusion. Good distractors tend to be adjacent: the same era, same domain, same kind of thing.

Why this happened

It wasn’t that the model “wanted” to give incorrect answers. If the model understands the question, it generally tries to give the right answer—unless it hallucinates.

Ironically, hallucination ended up being part of my workaround.

A practical workaround I used

I found I could get more usable distractors by intentionally leaning into a known weakness of early models:

  • Use a smaller model
  • Increase the temperature slightly
  • Encourage it to generate incorrect-but-plausible answers

In effect, the model might reason loosely:

“Abraham Lincoln… politician… famous speech or document…”

and produce something like:

  • The Magna Carta

That’s obviously not correct, but it demonstrates the kind of “close in vibe” error that can sometimes be reshaped into a distractor—especially when you’re trying to quickly generate multiple options.

Additional example (illustrative of the same failure mode)

If you asked a model for distractors to a well-known correct answer, early versions often produced items that were merely famous, not plausibly confusable. For instance, even when the correct answer is a specific historical speech, it might output unrelated famous speeches from other centuries or totally different contexts, because it’s sampling “famous speech” rather than “plausible alternative that a student might choose.”

The “one-shot” problem

The frustrating part was that you couldn’t reliably one-shot the whole thing with a prompt like:

Generate a quiz question, the correct answer, and three good distractors.

You could get decent questions and correct answers, but distractors often required extra prompting steps and more strategy to keep them in the same semantic neighborhood as the truth.

Where this stands now

Newer models are much better at producing reasonable distractors. But at the time, this was a real prompt-design problem that needed solving. My workaround—using a smaller model at higher temperature to generate controlled, plausible-sounding wrong answers—helped, but it still didn’t work for every topic.