Comparative performance of ChatGPT,
Gemini, and final-year emergency medicine

clerkship students in answering multiple-
choice questions: implications for the use of AI

in medical education

Dublin Core

Title

Subject

The integration of artificial intelligence (AI) into medical education has gained significant attention,

Description

Abstract
Background The integration of artificial intelligence (AI) into medical education has gained significant attention,
particularly with the emergence of advanced language models, such as ChatGPT and Gemini. While these tools show
promise for answering multiple-choice questions (MCQs), their efficacy in specialized domains, such as Emergency
Medicine (EM) clerkship, remains underexplored. This study aimed to evaluate and compare the accuracy of ChatGPT,
Gemini, and final-year EM students when it comes to answering text-only and image-based MCQs, in order to assess
AI’s potential for use as a supplementary tool in the field of medical education.
Methods In this proof-of-concept study, a comparative analysis was conducted using 160 MCQs from an EM
clerkship curriculum, comprising 62 image-based questions and 98 text-only questions. The performance of the free
versions of ChatGPT (4.0) and Gemini (1.5), as well as that of 125 final-year EM students, was assessed. Responses were
categorized as “correct”, “incorrect”, or “unanswered”. Statistical analysis was then performed using IBM SPSS Statistics
(Version 26.0) to compare accuracy across groups and question types.
Results Significant performance differences were observed across the three groups (χ2 = 42.7, p<0.001). Final-year
EM students demonstrated the highest overall accuracy at 79.4%, outperforming both ChatGPT (72.5%) and Gemini
(54.4%). Students excelled in text-only MCQs, with an accuracy of 89.8%, and performed robustly on image-based
questions (62.9%). ChatGPT showed strong performance on text-only items (83.7%) but had reduced accuracy
on image-based questions (54.8%). Gemini performed moderately on text-only questions (73.5%) but struggled
significantly with image-based content, achieving only 24.2% accuracy. Pairwise comparisons confirmed that
students outperformed both AI models across all formats (p<0.01), with the widest performance gap observed in

Creator

Shaikha Nasser Al-Thani1

, Shahzad Anjum1,2, Zain Ali Bhutta1

, Sarah Bashir3

, Muhammad Azhar Majeed1
,

Anfal Sher Khan4

and Khalid Bashir1,2*

Date

2025

Contributor

Peri Irawan

Format

pdf

Language

english

Type

text

Files

s12245-025-00949-6.pdf

Collection

Volume 18 Issue 1 2025

Citation

Shaikha Nasser Al-Thani1 , Shahzad Anjum1,2, Zain Ali Bhutta1 , Sarah Bashir3 , Muhammad Azhar Majeed1 , Anfal Sher Khan4 and Khalid Bashir1,2*, “Comparative performance of ChatGPT,
Gemini, and final-year emergency medicine

clerkship students in answering multiple-
choice questions: implications for the use of AI

in medical education,” Repository Horizon University Indonesia, accessed April 22, 2026, https://repository.horizon.ac.id/items/show/13245.

Comparative performance of ChatGPT, Gemini, and final-year emergency medicine clerkship students in answering multiple- choice questions: implications for the use of AI in medical education

Dublin Core

Title

Subject

Description

Creator

Date

Contributor

Format

Language

Type

Files

Collection

Tags

Citation

Comparative performance of ChatGPT,
Gemini, and final-year emergency medicine

clerkship students in answering multiple-
choice questions: implications for the use of AI

in medical education