OpenAI’s ChatGPT, an artificial intelligence (AI) chatbot, fell short in a self-assessment practice test conducted by the American College of Gastroenterology (ACG), according to a study published in the American Journal of Gastroenterology.
The GPT-3.5 and GPT-4 versions of ChatGPT scored 65.1% and 62.4% out of 455 questions, respectively. Both versions failed to meet the required 70% passing grade. The researchers were surprised to find that the passing benchmark for the ACG’s practice test was relatively low. However, this revelation also highlights the opportunity for improvement.
Dr. Trindade expressed the importance of accurate information in medical settings, emphasizing the need for AI chatbots to thoroughly understand the topics they are addressing. While ChatGPT demonstrated knowledge gaps, Trindade suggests that a threshold of 95% accuracy or higher would be more appropriate in the medical field.
To evaluate ChatGPT’s performance, the researchers fed each question into the AI chatbot and examined the generated response and explanation. The corresponding answer on the ACG’s web-based assessment was selected based on ChatGPT’s response. The study used questions from the ACG’s multiple-choice practice assessments, which simulate the American Board of Internal Medicine’s gastroenterology board examination.
While the researchers set a 70% accuracy benchmark for their study, Dr. Trindade believes that the medical community should hold AI chatbots to much higher standards. He warns that the recent papers highlighting ChatGPT’s success in passing other medical assessments might overshadow the fact that this technology is not yet ready for regular clinical use. Simply achieving a passing grade does not meet the requirements for reliable medical information.
Dr. Trindade acknowledges the rapid pace of AI technology development and its growing presence in medical settings. However, he stresses the importance of optimizing these tools for clinical use. As the medical community transitions to accessing information and data through evolving technological advancements, it is crucial to ensure the accuracy and reliability of AI chatbots.
The study adds to the body of research evaluating AI models’ performance in medical credentialing tests, which serve as indicators of the technology’s potential as a medical tool. While AI models like Google’s Med-PaLM have demonstrated success in passing medical exams, ChatGPT’s performance in the gastroenterology assessment underscores the limitations of AI models without specific medical training and knowledge.
Dr. Trindade concludes that AI models, especially those without specialized medical information and training, should not be considered perfect tools for clinical use.
The ease of obtaining quick answers from AI platforms may be appealing in today’s busy world, but it is crucial to recognize that these technologies are not yet ready for prime time. Studies like this one serve as a reminder to reassess and establish appropriate expectations for the use of AI chatbots in the medical field.