AI chatbots fail to diagnose patients by talking with them
Although popular AI models score highly on medical exams, their accuracy drops significantly when making a diagnosis based on a conversation with a simulated patient
By Jeremy Hsu
2 January 2025
Don’t call your favourite AI “doctor” just yet
Just_Super/Getty Images
Advanced artificial intelligence models score well on professional medical exams but still flunk one of the most crucial physician tasks: talking with patients to gather relevant medical information and deliver an accurate diagnosis.
“While large language models show impressive results on multiple-choice tests, their accuracy drops significantly in dynamic conversations,” says Pranav Rajpurkar at Harvard University. “The models particularly struggle with open-ended diagnostic reasoning.”
Read more
Supersonic flight will see a dramatic return in 2025 with new aircraft
Advertisement
That became evident when researchers developed a method for evaluating a clinical AI model’s reasoning capabilities based on simulated doctor-patient conversations. The “patients” were based on 2000 medical cases primarily drawn from professional US medical board exams.
“Simulating patient interactions enables the evaluation of medical history-taking skills, a critical component of clinical practice that cannot be assessed using case vignettes,” says Shreya Johri, also at Harvard University. The new evaluation benchmark, called CRAFT-MD, also “mirrors real-life scenarios, where patients may not know which details are crucial to share and may only disclose important information when prompted by specific questions”, she says.
The CRAFT-MD benchmark itself relies on AI. OpenAI’s GPT-4 model played the role of a “patient AI” in conversation with the “clinical AI” being tested. GPT-4 also helped grade the results by comparing the clinical AI’s diagnosis with the correct answer for each case. Human medical experts double-checked these evaluations. They also reviewed the conversations to check the patient AI’s accuracy and see if the clinical AI managed to gather the relevant medical information.