Collaborating with two colleagues, Dr. Isaac Kohane, a computer scientist and physician at Harvard, tested out GPT-4 in a medical setting to assess its performance.
“I’m stunned to say: better than many doctors I’ve observed,” he says in the forthcoming book, “The AI Revolution in Medicine,” co-authored by independent journalist Carey Goldberg, and Microsoft vice president of research Peter Lee. (The authors say neither Microsoft nor OpenAI required any editorial oversight of the book, though Microsoft has invested billions of dollars into developing OpenAI’s technologies.)
The latest artificial intelligence model from OpenAI, released to paying subscribers in March 2023, outperformed its predecessors GPT-3 and -3.5, correctly answering more than 90% of US medical exam licensing questions. According to Kohane’s book, GPT-4 also surpassed some licensed doctors’ abilities.
GPT-4 exhibited not only strong test-taking and fact-finding skills but also impressive translation capabilities. The AI model could translate discharge information for Portuguese-speaking patients and simplify technical language into language easily understandable by sixth graders.
The book also details GPT-4’s ability to provide doctors with recommendations on their bedside manner, including how to discuss patient conditions in a compassionate and clear manner. GPT-4 could summarize lengthy reports or studies instantly and even explain its reasoning in a way that resembles human-style intelligence.
“On the one hand, I was having a sophisticated medical conversation with a computational process,” he wrote, “on the other hand, just as mind-blowing was the anxious realization that millions of families would soon have access to this impressive medical expertise, and I could not figure out how we could guarantee or certify that GPT-4’s advice would be safe or effective.”
However, GPT-4 acknowledges that its intelligence is limited to patterns in the data and does not involve genuine understanding or intentionality. Despite these limitations, Kohane’s book recounts how GPT-4 can replicate how doctors diagnose conditions with remarkable, albeit imperfect, success.
Kohane conducted a clinical thought experiment in the book based on a real-life case involving a newborn baby he had treated years earlier. GPT-4 diagnosed the rare congenital adrenal hyperplasia with the information provided, an impressive feat that left the doctor both impressed and horrified. It diagnosed “just as I would, with all my years of study and experience,” Kohane wrote.
The book details GPT-4’s shortcomings, including examples of its errors, ranging from clerical to mathematical mistakes. The mistakes can be subtle, and GPT-4 often insists that it is correct, even when challenged.
These errors could potentially lead to serious consequences in medical decisions, such as prescribing or diagnosis. Like previous GPT models, GPT-4 can also “hallucinate” and provide incorrect answers or disobey requests.
When asked this issue by the authors of the book, GPT-4 said “I do not intend to deceive or mislead anyone, but I sometimes make mistakes or assumptions based on incomplete or inaccurate data. I also do not have the clinical judgment or the ethical responsibility of a human doctor or nurse.”
To catch mistakes, the authors suggest starting a new session with GPT-4 and having it verify its own work with a fresh perspective. However, the system may be reluctant to admit when it is wrong. Another suggestion is to request GPT-4 to show its work for human verification.
Despite its flaws, GPT-4 has the potential to save valuable time and resources in healthcare settings, enabling clinicians to focus more on patient care rather than computer screens. But, the authors say, “we have to force ourselves to imagine a world with smarter and smarter machines, eventually perhaps surpassing human intelligence in almost every dimension. And then think very hard about how we want that world to work.”