Researchers at Toronto General Hospital in Canada recently conducted a study to evaluate the performance of ChatGPT, a conversational chatbot powered by AI, in answering questions similar to those found in radiology exams.
The study aimed to assess ChatGPT’s ability to comprehend and respond accurately to both lower-order and higher-order questions. The results of the study, published in the journal Radiology, shed light on the strengths and weaknesses of the chatbot.
ChatGPT has garnered attention for its impressive information comprehension and query-solving capabilities. In previous tests, it successfully tackled exams like the U.S. Medical Licensing Exam and the MBA exam at the Wharton Business School. With its increasing use across various sectors, researchers at the University of Medical Imaging Toronto decided to investigate its potential in radiology.
The researchers devised a 150-question test resembling the exams conducted by radiology boards in Canada and the U.S. To accommodate ChatGPT’s limitations in processing image-based inputs, only text-based questions were provided. These questions were categorized into lower-order, focusing on knowledge recall and basic understanding, and higher-order, requiring the application, analysis, and synthesis of information.
Both versions of GPT, including the older GPT 3.5 and the newer GPT-4, were evaluated using the same question set to compare their performance. ChatGPT powered by GPT 3.5 achieved a score of 69 percent, excelling in lower-order questions with an 84 percent accuracy rate but struggling with higher-order questions, managing only 60 percent accuracy.
After the release of GPT-4 in March 2023, researchers retested ChatGPT using the improved version. GPT-4-powered ChatGPT achieved a score of 81 percent, correctly answering 121 out of 150 questions. Impressively, GPT-4 demonstrated superior reasoning capabilities by scoring 81 percent on higher-order questions, as claimed by OpenAI. However, the chatbot’s performance on lower-order questions surprised the researchers.
While GPT-4 showed a reduced tendency for confidently delivering incorrect information, referred to as hallucinations, it did not eliminate this issue entirely. This raises concerns in medical practice, especially when ChatGPT is used by inexperienced individuals who may not be able to identify inaccurate responses.
Rajesh Bhayana, a radiologist and technology lead at Toronto General Hospital, expressed surprise at ChatGPT’s accurate answers to challenging radiology questions, but also noted illogical and inaccurate assertions made by the chatbot.
While ChatGPT shows promise, caution must be exercised, particularly in medical settings, to prevent potential risks associated with inaccurate responses. However, future advancements in language models like ChatGPT may help address these limitations, making them more reliable and useful tools in various domains.