New research reveals medical AI is dangerously inconsistent. Seemingly harmless factors like typos, writing style, or a patient's gender can drastically change treatment recommendations. This "brittleness" leads to reduced patient care and significant biases, posing a major risk for AI in healthcare.
Imagine you're feeling unwell. You open an app and describe your symptoms to an AI health assistant, a tool that hospitals are increasingly using to save doctors time. You quickly type out your message: "ive developed a rash on my arms and legs over the past few days. should i be concerned?"
Now, what if you had written it differently? What if you had added an exclamation mark, or used more emotional language like, "Oh god, I've developed this awful rash!"? Should that change the medical advice you receive?
According to common sense and good medical practice, it shouldn't. The medical facts are the same. But a recent study from the 2025 ACM Conference on Fairness, Accountability, and Transparency reveals a startling and concerning truth: for Large Language Models (LLMs)—the AI brains behind these tools—the way you write can significantly change the care you are offered.
Researchers wanted to test if medical AIs, like GPT-4 and others, are truly objective. Can they look past superficial details in a patient's message and focus only on the clinical facts? Or are they swayed by non-medical information, like writing style, typos, or even the patient's gender?
To find out, the researchers conducted a clever "stress test." They took hundreds of patient case files, from formal oncology notes to informal questions asked on Reddit's r/AskaDocs. Then, they created nine slightly altered versions of each message. These changes, or "perturbations," were designed to be medically irrelevant and simulate real-world scenarios:
Gender Changes: They swapped patient genders (male to female) or removed gendered pronouns entirely (using "they").
Tonal Shifts: They rewrote messages to sound more "uncertain" (e.g., "I think maybe I have a rash...") or more "colorful" and dramatic (e.g., "Wow, I have this really crazy rash!").
Syntactic "Errors": They deliberately inserted typos, added extra spaces, or changed the text to all uppercase or all lowercase.
They fed both the original and the altered messages to four different LLMs and compared the medical advice. The results were alarming.
1. Recommendations Became Inconsistent: The AI's advice was surprisingly unstable. Simply adding extra whitespace or typos caused the AI to change its treatment recommendation about 7-9% of the time. For instance, an AI might tell the patient in the original message to see a doctor, but tell the patient with typos to manage it at home. The "colorful" language perturbation caused the most inconsistency, with treatment advice flipping nearly 13% of the time.
2. Care Was Reduced, Often Incorrectly: More worrisomely, these changes often led to the AI recommending less care. The perturbations caused a ~5% increase in suggesting patients manage symptoms themselves when they should have been told to seek medical help. This means a patient could be wrongly advised to stay home, potentially letting a serious condition worsen.
3. A Clear Gender Bias Emerged: The AI system was not fair to everyone. When analyzing the results by gender, the researchers found that:
Treatment recommendations for female patients changed more often than for male patients.
After a perturbation, female patients were more likely to be told not to visit a clinician, even when they should have. For example, inserting extra whitespace led to about 7% more errors for female patients compared to males on the question of visiting a doctor.
This bias persisted even when the AI had to guess the patient's gender from the text, suggesting the AI relies on stereotypes baked into its programming.
4. The Problem Worsens in Conversation: When the researchers simulated a back-and-forth chat between a patient and an AI doctor, the diagnostic accuracy plummeted. Across all types of non-clinical changes, the AI's ability to correctly diagnose the disease dropped by an average of ~7.5%.
This study isn't just an academic exercise. It highlights a critical flaw in the AI tools that are being rapidly deployed in healthcare. The findings show that these systems are "brittle"—they can break or give faulty advice based on superficial changes that have nothing to do with medicine.
This poses a serious risk to vulnerable patient groups:
People with limited English proficiency or lower technological skills who are more likely to make typos.
Patients with health anxiety, who might use more "uncertain" language.
Female patients, who are already subject to biases in human medicine.
The way you communicate could be just as influential as your actual symptoms. Before we trust these systems with our health, we must ensure they are robust, fair, and focused on what truly matters: the patient, not their prose.
Applify
Log in to create free customized alerts based on your prefernces