Comparative performance of ChatGPT,
Gemini, and final-year emergency medicine
clerkship students in answering multiple-
choice questions: implications for the use of AI
in medical education
Dublin Core
Title
Comparative performance of ChatGPT,
Gemini, and final-year emergency medicine
clerkship students in answering multiple-
choice questions: implications for the use of AI
in medical education
Gemini, and final-year emergency medicine
clerkship students in answering multiple-
choice questions: implications for the use of AI
in medical education
Subject
The integration of artificial intelligence (AI) into medical education has gained significant attention,
Description
Abstract
Background The integration of artificial intelligence (AI) into medical education has gained significant attention,
particularly with the emergence of advanced language models, such as ChatGPT and Gemini. While these tools show
promise for answering multiple-choice questions (MCQs), their efficacy in specialized domains, such as Emergency
Medicine (EM) clerkship, remains underexplored. This study aimed to evaluate and compare the accuracy of ChatGPT,
Gemini, and final-year EM students when it comes to answering text-only and image-based MCQs, in order to assess
AI’s potential for use as a supplementary tool in the field of medical education.
Methods In this proof-of-concept study, a comparative analysis was conducted using 160 MCQs from an EM
clerkship curriculum, comprising 62 image-based questions and 98 text-only questions. The performance of the free
versions of ChatGPT (4.0) and Gemini (1.5), as well as that of 125 final-year EM students, was assessed. Responses were
categorized as “correct”, “incorrect”, or “unanswered”. Statistical analysis was then performed using IBM SPSS Statistics
(Version 26.0) to compare accuracy across groups and question types.
Results Significant performance differences were observed across the three groups (χ2 = 42.7, p<0.001). Final-year
EM students demonstrated the highest overall accuracy at 79.4%, outperforming both ChatGPT (72.5%) and Gemini
(54.4%). Students excelled in text-only MCQs, with an accuracy of 89.8%, and performed robustly on image-based
questions (62.9%). ChatGPT showed strong performance on text-only items (83.7%) but had reduced accuracy
on image-based questions (54.8%). Gemini performed moderately on text-only questions (73.5%) but struggled
significantly with image-based content, achieving only 24.2% accuracy. Pairwise comparisons confirmed that
students outperformed both AI models across all formats (p<0.01), with the widest performance gap observed in
Background The integration of artificial intelligence (AI) into medical education has gained significant attention,
particularly with the emergence of advanced language models, such as ChatGPT and Gemini. While these tools show
promise for answering multiple-choice questions (MCQs), their efficacy in specialized domains, such as Emergency
Medicine (EM) clerkship, remains underexplored. This study aimed to evaluate and compare the accuracy of ChatGPT,
Gemini, and final-year EM students when it comes to answering text-only and image-based MCQs, in order to assess
AI’s potential for use as a supplementary tool in the field of medical education.
Methods In this proof-of-concept study, a comparative analysis was conducted using 160 MCQs from an EM
clerkship curriculum, comprising 62 image-based questions and 98 text-only questions. The performance of the free
versions of ChatGPT (4.0) and Gemini (1.5), as well as that of 125 final-year EM students, was assessed. Responses were
categorized as “correct”, “incorrect”, or “unanswered”. Statistical analysis was then performed using IBM SPSS Statistics
(Version 26.0) to compare accuracy across groups and question types.
Results Significant performance differences were observed across the three groups (χ2 = 42.7, p<0.001). Final-year
EM students demonstrated the highest overall accuracy at 79.4%, outperforming both ChatGPT (72.5%) and Gemini
(54.4%). Students excelled in text-only MCQs, with an accuracy of 89.8%, and performed robustly on image-based
questions (62.9%). ChatGPT showed strong performance on text-only items (83.7%) but had reduced accuracy
on image-based questions (54.8%). Gemini performed moderately on text-only questions (73.5%) but struggled
significantly with image-based content, achieving only 24.2% accuracy. Pairwise comparisons confirmed that
students outperformed both AI models across all formats (p<0.01), with the widest performance gap observed in
Creator
Shaikha Nasser Al-Thani1
, Shahzad Anjum1,2, Zain Ali Bhutta1
, Sarah Bashir3
, Muhammad Azhar Majeed1
,
Anfal Sher Khan4
and Khalid Bashir1,2*
, Shahzad Anjum1,2, Zain Ali Bhutta1
, Sarah Bashir3
, Muhammad Azhar Majeed1
,
Anfal Sher Khan4
and Khalid Bashir1,2*
Date
2025
Contributor
Peri Irawan
Format
pdf
Language
english
Type
text
Files
Collection
Citation
Shaikha Nasser Al-Thani1
, Shahzad Anjum1,2, Zain Ali Bhutta1
, Sarah Bashir3
, Muhammad Azhar Majeed1
,
Anfal Sher Khan4
and Khalid Bashir1,2*, “Comparative performance of ChatGPT,
Gemini, and final-year emergency medicine
clerkship students in answering multiple-
choice questions: implications for the use of AI
in medical education,” Repository Horizon University Indonesia, accessed April 22, 2026, https://repository.horizon.ac.id/items/show/13245.
Gemini, and final-year emergency medicine
clerkship students in answering multiple-
choice questions: implications for the use of AI
in medical education,” Repository Horizon University Indonesia, accessed April 22, 2026, https://repository.horizon.ac.id/items/show/13245.