ISSN: 2459-1777 | E-ISSN 2587-0394
Volume : 10 Issue : 3 Year : 2025
pdf

Comparison of the Accuracy, Comprehensiveness, and Readability of ChatGPT, Google Gemini, and Microsoft Copilot on Dry Eye Disease [Beyoglu Eye J]
Beyoglu Eye J. 2025; 10(3): 168-174 | DOI: 10.14744/bej.2025.76743

Comparison of the Accuracy, Comprehensiveness, and Readability of ChatGPT, Google Gemini, and Microsoft Copilot on Dry Eye Disease

Dilan Colak1, Burcu Yakut2, Abdullah Agin2
1Department of Ophthalmology, University of Health Science, Beyoglu Eye Training and Research Hospital, Istanbul, Türkiye
2Department of Ophthalmology, University of Health Science, Haseki Training and Research Hospital, Istanbul, Türkiye

OBJECTIVES: This study compared the performance of ChatGPT, Google Gemini, and Microsoft Copilot in answering 25 questions about dry eye disease and evaluated comprehensiveness, accuracy, and readability metrics.
METHODS: The artificial intelligence (AI) platforms answered 25 questions derived from the American Academy of Oph-thalmology’s Eye Health webpage. Three reviewers assigned comprehensiveness (0–5) and accuracy (−2 to 2) scores. Readability metrics included Flesch-Kincaid Grade Level, Flesch Reading Ease Score, sentence/word statistics, and total content measures. Responses were rated by three independent reviewers. Readability metrics were also calculated, and platforms were compared using Kruskal–Wallis and Friedman tests with post hoc analysis. Reviewer consistency was as-sessed using the intraclass correlation coefficient (ICC).
RESULTS: Google Gemini demonstrated the highest comprehensiveness and accuracy scores, significantly outperform-ing Microsoft Copilot (p<0.001). ChatGPT produced the most sentences and words (p<0.001), while readability met-rics showed no significant differences among models (p>0.05). Inter-observer reliability was highest for Google Gemini (ICC=0.701), followed by ChatGPT (ICC=0.578), with Microsoft Copilot showing the lowest agreement (ICC=0.495). These results indicate Google Gemini’s superior performance and consistency, whereas Microsoft Copilot had the weak-est overall performance.
DISCUSSION AND CONCLUSION: Google Gemini excelled in content volume while maintaining high comprehensiveness and accuracy, out-performing ChatGPT and Microsoft Copilot in content generation. The platforms displayed comparable readability and linguistic complexity. These findings inform AI tool selection in health-related contexts, emphasizing Google Gemini’s strengths in detailed responses. Its superior performance suggests potential utility in clinical and patient-facing applications requiring accurate and comprehensive content.

Keywords: Artificial intelligence, ChatGPT, dry eye disease, Google Gemini, Microsoft Copilot

Corresponding Author: Abdullah Agin, Türkiye
Manuscript Language: English
×
APA
NLM
AMA
MLA
Chicago
Copied!
CITE
LookUs & Online Makale