OBJECTIVES: This study compared the performance of ChatGPT, Google Gemini, and Microsoft Copilot in answering 25 questions about dry eye disease and evaluated comprehensiveness, accuracy, and readability metrics.
METHODS: The artificial intelligence (AI) platforms answered 25 questions derived from the American Academy of Oph-thalmology’s Eye Health webpage. Three reviewers assigned comprehensiveness (0–5) and accuracy (−2 to 2) scores. Readability metrics included Flesch-Kincaid Grade Level, Flesch Reading Ease Score, sentence/word statistics, and total content measures. Responses were rated by three independent reviewers. Readability metrics were also calculated, and platforms were compared using Kruskal–Wallis and Friedman tests with post hoc analysis. Reviewer consistency was as-sessed using the intraclass correlation coefficient (ICC).
RESULTS: Google Gemini demonstrated the highest comprehensiveness and accuracy scores, significantly outperform-ing Microsoft Copilot (p<0.001). ChatGPT produced the most sentences and words (p<0.001), while readability met-rics showed no significant differences among models (p>0.05). Inter-observer reliability was highest for Google Gemini (ICC=0.701), followed by ChatGPT (ICC=0.578), with Microsoft Copilot showing the lowest agreement (ICC=0.495). These results indicate Google Gemini’s superior performance and consistency, whereas Microsoft Copilot had the weak-est overall performance.
DISCUSSION AND CONCLUSION: Google Gemini excelled in content volume while maintaining high comprehensiveness and accuracy, out-performing ChatGPT and Microsoft Copilot in content generation. The platforms displayed comparable readability and linguistic complexity. These findings inform AI tool selection in health-related contexts, emphasizing Google Gemini’s strengths in detailed responses. Its superior performance suggests potential utility in clinical and patient-facing applications requiring accurate and comprehensive content.