BACKGROUND
ChatGPT, an artificial intelligence (AI) chatbot, is the fastest growing consumer application in history. Given recent trends identifying increasing patient use of Internet sources for self-education, we seek to evaluate the quality of ChatGPT-generated responses for patient education on thyroid nodules.
METHODS
ChatGPT was queried 4 times with 30 identical questions. Queries differed by initial chatbot prompting: no prompting, patient-friendly prompting, 8th-grade level prompting, and prompting for references. Answers were scored on a hierarchical score: incorrect, partially correct, correct, or correct with references. Proportions of responses at incremental score thresholds were compared by prompt type using chi-squared analysis. Flesch-Kincaid grade level was calculated for each answer. The relationship between prompt type and grade level was assessed using analysis of variance. References provided within ChatGPT answers were totaled and analyzed for veracity.
RESULTS
Across all prompts (n=120 questions), 83 answers (69.2%) were at least correct. Proportions of responses that were at least partially correct (p=0.795) and correct (p=0.402) did not differ by prompt; responses that were correct with references did (p<0.0001). Responses from 8th-grade level prompting were the lowest mean grade level (13.43 ± 2.86) and were significantly lower than no prompting (14.97 ± 2.01, p=0.01) and prompting for references (16.43 ± 2.05, p<0.0001). Prompting for references generated 80/80 (100%) of referenced publications within answers. Seventy references (87.5%) were legitimate citations, and 58/80 (72.5%) provided accurately reported information from the referenced publications.
CONCLUSION
ChatGPT overall provides appropriate answers to most questions on thyroid nodules regardless of prompting. Despite targeted prompting strategies, ChatGPT reliably generates responses corresponding to grade levels well-above accepted recommendations for presenting medical information to patients. Significant rates of AI hallucination may preclude clinicians from recommending the current version of ChatGPT as an educational tool for patients at this time.
背景
ChatGPT是一种人工智能聊天机器人,是历史上增长最快的消费应用程序。鉴于近期发现患者越来越多地使用互联网资源进行自我教育的趋势,我们试图评估ChatGPT针对甲状腺结节患者教育所生成回答的质量。
方法
用30个相同的问题对ChatGPT进行了4次询问。询问因初始聊天机器人提示方式不同而有所差异:无提示、患者友好型提示、八年级水平提示以及要求提供参考文献的提示。答案按等级评分:不正确、部分正确、正确或正确且有参考文献。使用卡方分析按提示类型比较不同得分阈值的回答比例。计算每个答案的弗莱施 - 金凯德年级水平。使用方差分析评估提示类型与年级水平之间的关系。对ChatGPT答案中提供的参考文献进行汇总并分析其真实性。
结果
在所有提示(共120个问题)中,83个答案(69.2%)至少是正确的。至少部分正确(p = 0.795)和正确(p = 0.402)的回答比例不因提示方式而有差异;正确且有参考文献的回答则有差异(p < 0.0001)。八年级水平提示的回答平均年级水平最低(13.43 ± 2.86),且显著低于无提示(14.97 ± 2.01,p = 0.01)和要求提供参考文献的提示(16.43 ± 2.05,p < 0.0001)。要求提供参考文献的提示在答案中生成了80/80(100%)的参考文献。70个参考文献(87.5%)是合理引用,且58/80(72.5%)准确提供了参考文献中的信息。
结论
无论提示方式如何,ChatGPT总体上对大多数甲状腺结节问题都能提供恰当的答案。尽管有针对性的提示策略,但ChatGPT生成的回答对应的年级水平远高于向患者呈现医学信息的公认建议水平。人工智能产生幻觉的比例较高,可能会使临床医生目前不推荐将当前版本的ChatGPT作为患者的教育工具。