I recently read a short research paper from Penn State that tested something many of us assume by default:
Does being polite to an AI actually improve its answers?
Turns out, the answer is… not really. And in some cases, the opposite.
What they tested (quick version)
-
Model: ChatGPT-4o
-
Questions: 50 multiple-choice questions
(math, science, history, medium to hard) -
Total prompts: 250
(each question rewritten 5 times with different tone) -
Tone levels tested:
Very Polite →
Polite →
Neutral →
Rude →
Very Rude -
Runs: 10 per tone
-
Metric: Accuracy only (right or wrong)
Everything else stayed the same. Only tone changed.
Results at a glance
Accuracy increased as prompts became more direct and rude.
😇 Very Polite 🟦🟦🟦🟦🟦🟦🟦🟦🟦⬜️ 80.8%
😌 Polite 🟦🟦🟦🟦🟦🟦🟦🟦🟦⬜️ 81.4%
😐 Neutral 🟨🟨🟨🟨🟨🟨🟨🟨🟨⬜️ 82.2%
😠 Rude 🟧🟧🟧🟧🟧🟧🟧🟧🟧⬜️ 82.8%
🤬 Very Rude 🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥 84.8% (highest)
Paired statistical tests confirmed the differences were real, not noise.
What might be going on?
A few reasonable explanations:
-
Shorter, blunter prompts may reduce ambiguity
-
Extra politeness words might add noise without adding clarity
-
Newer models may treat tone as tokens, not intent
-
Direct instructions help the model focus on the task faster
Interesting fact:
Earlier studies on GPT-3.5 showed rudeness hurting performance. This study suggests newer models behave differently.
Important ethical note (this matters)
The authors are very clear:
This is not a recommendation to be rude to AI or build hostile user experiences.
Toxic language:
-
Hurts user experience
-
Normalizes bad communication
-
Creates accessibility and inclusivity issues
The real takeaway is prompt clarity, not prompt aggression.
Practical takeaway for Pickaxe builders
-
Over-politeness does not improve accuracy
-
Neutral, direct prompts are often a sweet spot
-
Flowery language rarely helps task performance
-
Prompt tone is not just UX, it affects results
Curious if others here have noticed similar patterns while testing prompts or building tools.
Disclaimer: This is an informational post sharing observations from published research. I’m not endorsing or opposing the findings, just presenting them to encourage discussion and awareness.
