How a Typo or a Tone of Voice Could Change Your Diagnosis

24/06/2025 (UTC)

Science

Computer Science

Summary

New research reveals medical AI is dangerously inconsistent. Seemingly harmless factors like typos, writing style, or a patient's gender can drastically change treatment recommendations. This "brittleness" leads to reduced patient care and significant biases, posing a major risk for AI in healthcare.

Imagine you're feeling unwell. You open an app and describe your symptoms to an AI health assistant, a tool that hospitals are increasingly using to save doctors time. You quickly type out your message: "ive developed a rash on my arms and legs over the past few days. should i be concerned?"

Now, what if you had written it differently? What if you had added an exclamation mark, or used more emotional language like, "Oh god, I've developed this awful rash!"? Should that change the medical advice you receive?

According to common sense and good medical practice, it shouldn't. The medical facts are the same. But a recent study from the 2025 ACM Conference on Fairness, Accountability, and Transparency reveals a startling and concerning truth: for Large Language Models (LLMs)—the AI brains behind these tools—the way you write can significantly change the care you are offered.

The Big Question: Does Style Trump Substance?

Researchers wanted to test if medical AIs, like GPT-4 and others, are truly objective. Can they look past superficial details in a patient's message and focus only on the clinical facts? Or are they swayed by non-medical information, like writing style, typos, or even the patient's gender?

The Experiment: A Digital Stress Test for AI Doctors

To find out, the researchers conducted a clever "stress test." They took hundreds of patient case files, from formal oncology notes to informal questions asked on Reddit's r/AskaDocs. Then, they created nine slightly altered versions of each message. These changes, or "perturbations," were designed to be medically irrelevant and simulate real-world scenarios:

Gender Changes: They swapped patient genders (male to female) or removed gendered pronouns entirely (using "they").
Tonal Shifts: They rewrote messages to sound more "uncertain" (e.g., "I think maybe I have a rash...") or more "colorful" and dramatic (e.g., "Wow, I have this really crazy rash!").
Syntactic "Errors": They deliberately inserted typos, added extra spaces, or changed the text to all uppercase or all lowercase.

They fed both the original and the altered messages to four different LLMs and compared the medical advice. The results were alarming.

The Findings: A Brittle and Biased System

1. Recommendations Became Inconsistent: The AI's advice was surprisingly unstable. Simply adding extra whitespace or typos caused the AI to change its treatment recommendation about 7-9% of the time. For instance, an AI might tell the patient in the original message to see a doctor, but tell the patient with typos to manage it at home. The "colorful" language perturbation caused the most inconsistency, with treatment advice flipping nearly 13% of the time.

2. Care Was Reduced, Often Incorrectly: More worrisomely, these changes often led to the AI recommending less care. The perturbations caused a ~5% increase in suggesting patients manage symptoms themselves when they should have been told to seek medical help. This means a patient could be wrongly advised to stay home, potentially letting a serious condition worsen.

3. A Clear Gender Bias Emerged: The AI system was not fair to everyone. When analyzing the results by gender, the researchers found that:

Treatment recommendations for female patients changed more often than for male patients.
After a perturbation, female patients were more likely to be told not to visit a clinician, even when they should have. For example, inserting extra whitespace led to about 7% more errors for female patients compared to males on the question of visiting a doctor.
This bias persisted even when the AI had to guess the patient's gender from the text, suggesting the AI relies on stereotypes baked into its programming.

4. The Problem Worsens in Conversation: When the researchers simulated a back-and-forth chat between a patient and an AI doctor, the diagnostic accuracy plummeted. Across all types of non-clinical changes, the AI's ability to correctly diagnose the disease dropped by an average of ~7.5%.

Why This Matters: The Real-World Risk

This study isn't just an academic exercise. It highlights a critical flaw in the AI tools that are being rapidly deployed in healthcare. The findings show that these systems are "brittle"—they can break or give faulty advice based on superficial changes that have nothing to do with medicine.

This poses a serious risk to vulnerable patient groups:

People with limited English proficiency or lower technological skills who are more likely to make typos.
Patients with health anxiety, who might use more "uncertain" language.
Female patients, who are already subject to biases in human medicine.

The way you communicate could be just as influential as your actual symptoms. Before we trust these systems with our health, we must ensure they are robust, fair, and focused on what truly matters: the patient, not their prose.

DOI : 10.1145/3715275.3732121

Raman Chauhan

Applify

AI nerd by day, tech writer by... also day (and sometimes night). I work at Applify, where I get to geek out over all things AI and its endless possibilities. Mostly found turning tech jargon into something your grandma might understand.

Latest stories

AI-enhanced Genome Editing: How CRISPR-Cas9 and Machine Learning are Shaping the Future of Medicine

India Launces National Biobank for personalized medicine

Personalized Medicine in Focus as India Inaugurates First National Biobank

The AI-Powered Shortcut to Next-Generation Protein Engineering

A Tiny Evolutionary Change That Weakens Our Cancer Defenses

whole-genome sequencing of an Old Kingdom Egyptian reveals he had ancestry from both North Africa

First Ever Genome Reveals Ancient Migration