Magic Phrases for Moderation: Prompt Patterns That Improve Safety Calls
Use standardized prompts and rating frameworks (like ESRB) along with explicit guidelines and practical examples to achieve more consistent, scalable AI-driven content moderation.
One of the biggest early requests we got for GPT-3 was content moderation.
As social media scaled and scrutiny increased, platforms were expected to monitor what people posted. The problem is that most of these companies run on advertising and don’t make a tremendous amount per user, so they don’t really have the resources to pay humans to monitor every single thing that gets posted online.
This sometimes gets lost when there’s an uproar over the latest social media controversy. It’s often overlooked by critics that paying somebody $20 an hour to read tweets doesn’t make much sense if you have tens of millions of people posting things every day, every hour. There just aren’t enough people to read everything—and even if there were, how are you supposed to pay for that?
That’s why AI was increasingly called in as a solution, and why sentiment analysis—trying to detect what a tweet meant—wasn’t just something for marketing. It became about uncovering intent and turned into an entire area of exploration. GPT-3 was pretty good at this, and better than a lot of existing methods.
A big issue was cost. But given that a human solution was never really viable at scale, it was still an improvement—going from paying humans to having an automated alternative was, in many cases, a good alternative to nothing.
Another hard part of moderation is that it’s difficult to get any two people to agree. A strategy I used early on was to lean on phrases the model already understood. If I gave a model some copy and asked, “Is this appropriate or inappropriate?” or “Is this family-friendly?” it struggled, because those labels are subjective.
But when I used ESRB—the rating system used to categorize video games—the model did much better. Simply asking, “What would the ESRB rating be for this?” produced much more consistent responses, and consistency is one of the main things you’re looking for in moderation.
This is one of many “magic phrases” I found. If you want something to look more professional, you can just say “AP style” or “MLA,” or use other terms that come with specific guidelines. Those phrases bring not only rules, but also lots of examples of what those rules look like in practice.
And that’s worth noting: it’s not enough to have a guideline. In some cases you also need examples of that guideline being applied. Guidelines by themselves have gray areas, and examples without a clear label can be hard to interpret.