Comparison of the Accuracy, Comprehensiveness, and Readability of ChatGPT, Google Gemini, and Microsoft Copilot on Dry Eye Disease

Colak, Dilan; Yakut, Burcu; Agin, Abdullah

Volume : 10 Issue : 3 Year : 2025

10/3Current Issue Ahead of Print Archive Covers Most Accessed Articles

Join the Editorial Board

ICMJE COI Form

Comparison of the Accuracy, Comprehensiveness, and Readability of ChatGPT, Google Gemini, and Microsoft Copilot on Dry Eye Disease [Beyoglu Eye J]

Beyoglu Eye J. 2025; 10(3): 168-174 | DOI: 10.14744/bej.2025.76743

Comparison of the Accuracy, Comprehensiveness, and Readability of ChatGPT, Google Gemini, and Microsoft Copilot on Dry Eye Disease

Dilan Colak¹, Burcu Yakut², Abdullah Agin²
¹Department of Ophthalmology, University of Health Science, Beyoglu Eye Training and Research Hospital, Istanbul, Türkiye
²Department of Ophthalmology, University of Health Science, Haseki Training and Research Hospital, Istanbul, Türkiye

OBJECTIVES: This study compared the performance of ChatGPT, Google Gemini, and Microsoft Copilot in answering 25 questions about dry eye disease and evaluated comprehensiveness, accuracy, and readability metrics.
METHODS: The artificial intelligence (AI) platforms answered 25 questions derived from the American Academy of Oph-thalmology�s Eye Health webpage. Three reviewers assigned comprehensiveness (0�5) and accuracy (−2 to 2) scores. Readability metrics included Flesch-Kincaid Grade Level, Flesch Reading Ease Score, sentence/word statistics, and total content measures. Responses were rated by three independent reviewers. Readability metrics were also calculated, and platforms were compared using Kruskal�Wallis and Friedman tests with post hoc analysis. Reviewer consistency was as-sessed using the intraclass correlation coefficient (ICC).
RESULTS: Google Gemini demonstrated the highest comprehensiveness and accuracy scores, significantly outperform-ing Microsoft Copilot (p<0.001). ChatGPT produced the most sentences and words (p<0.001), while readability met-rics showed no significant differences among models (p>0.05). Inter-observer reliability was highest for Google Gemini (ICC=0.701), followed by ChatGPT (ICC=0.578), with Microsoft Copilot showing the lowest agreement (ICC=0.495). These results indicate Google Gemini�s superior performance and consistency, whereas Microsoft Copilot had the weak-est overall performance.
DISCUSSION AND CONCLUSION: Google Gemini excelled in content volume while maintaining high comprehensiveness and accuracy, out-performing ChatGPT and Microsoft Copilot in content generation. The platforms displayed comparable readability and linguistic complexity. These findings inform AI tool selection in health-related contexts, emphasizing Google Gemini�s strengths in detailed responses. Its superior performance suggests potential utility in clinical and patient-facing applications requiring accurate and comprehensive content.

Keywords: Artificial intelligence, ChatGPT, dry eye disease, Google Gemini, Microsoft Copilot

Corresponding Author: Abdullah Agin, T�rkiye
Manuscript Language: English

CITE

Full Text PDF Download citation RIS EndNote BibTex Medlars Procite Reference Manager Send email to author Similar articles PubMed Google Scholar

Quick Search

Comparison of the Accuracy, Comprehensiveness, and Readability of ChatGPT, Google Gemini, and Microsoft Copilot on Dry Eye Disease