Posters

Academic Level (Author 1)

Medical Student

Academic Level (Author 2)

Medical Student

Academic Level (Author 3)

Medical Student

Academic Level (Author 4)

Faculty

Discipline/Specialty (Author 4)

Orthopedic Surgery

Academic Level (Author 5)

Faculty

Discipline/Specialty (Author 5)

Orthopedic Surgery

Discipline Track

Biomedical ENGR/Technology/Computation

Abstract

Introduction: Within the past few years, large language models (LLMs) (ChatGPT, LLaMa 3, Microsoft Copilot) have increasingly become a resource that patients engage with to learn about health care procedures, including total knee replacement (TKR). Previous studies have analyzed the efficacy of large language models in providing accurate and relevant responses to questions about various procedures. Our study aims to evaluate the clarity, validity, and understandability of LLMs to patient questions about total knee replacement and assess the consistency of these models and their effectiveness in providing accurate, valid, and guideline-adherent information to patients.

Methods: We selected 30 frequently asked questions for TKR in five categories: preoperative concerns, operative details, postoperative recovery, complications and lifestyle changes post-surgery. The questions were posed to AI models including LLaMA 3, ChatGPT-4.0, Microsoft Copilot, Google’s Bard and Perplexity. Clinical orthopedics specializing in TKR assessed LLM responses on a Likert scale to grade their clarity, validity and understandability.

Results: Based on the evaluation of responses, ChatGPT provided the longest and most detailed responses to the 30 questions, while Gemini offered the most succinct answers. Perplexity consistently included citations with links to sources, although these were not always aligned with the most current guidelines. ChatGPT, Microsoft Copilot, and Gemini included disclaimer statements about variability due to patient-specific factors (age, gender, comorbidities, etc.). Graded assessments revealed that Meta Llama 3 and Perplexity achieved the highest scores for clarity (9.6), while ChatGPT scored highest for understandability (9.48). Gemini achieved the highest validity score (9.44), suggesting it provided the most guideline-adherent responses.

Conclusions: This study demonstrates the potential of AI language models to provide patients with clear, valid, and understandable information about TKR. While each model has unique strengths—such as Perplexity's citations and ChatGPT's detailed explanations—Gemini emerges as the most valid model for guideline-adherent information. These findings can guide the integration of LLMs into patient education and highlight areas for improvement to enhance their reliability and usefulness in clinical practice. Further studies are needed to address limitations and optimize these tools for patient-centered care.

Presentation Type

Poster

Share

COinS
 

Evaluating AI Language Models for Patient Queries on Total Knee Replacement (TKR)

Introduction: Within the past few years, large language models (LLMs) (ChatGPT, LLaMa 3, Microsoft Copilot) have increasingly become a resource that patients engage with to learn about health care procedures, including total knee replacement (TKR). Previous studies have analyzed the efficacy of large language models in providing accurate and relevant responses to questions about various procedures. Our study aims to evaluate the clarity, validity, and understandability of LLMs to patient questions about total knee replacement and assess the consistency of these models and their effectiveness in providing accurate, valid, and guideline-adherent information to patients.

Methods: We selected 30 frequently asked questions for TKR in five categories: preoperative concerns, operative details, postoperative recovery, complications and lifestyle changes post-surgery. The questions were posed to AI models including LLaMA 3, ChatGPT-4.0, Microsoft Copilot, Google’s Bard and Perplexity. Clinical orthopedics specializing in TKR assessed LLM responses on a Likert scale to grade their clarity, validity and understandability.

Results: Based on the evaluation of responses, ChatGPT provided the longest and most detailed responses to the 30 questions, while Gemini offered the most succinct answers. Perplexity consistently included citations with links to sources, although these were not always aligned with the most current guidelines. ChatGPT, Microsoft Copilot, and Gemini included disclaimer statements about variability due to patient-specific factors (age, gender, comorbidities, etc.). Graded assessments revealed that Meta Llama 3 and Perplexity achieved the highest scores for clarity (9.6), while ChatGPT scored highest for understandability (9.48). Gemini achieved the highest validity score (9.44), suggesting it provided the most guideline-adherent responses.

Conclusions: This study demonstrates the potential of AI language models to provide patients with clear, valid, and understandable information about TKR. While each model has unique strengths—such as Perplexity's citations and ChatGPT's detailed explanations—Gemini emerges as the most valid model for guideline-adherent information. These findings can guide the integration of LLMs into patient education and highlight areas for improvement to enhance their reliability and usefulness in clinical practice. Further studies are needed to address limitations and optimize these tools for patient-centered care.

 

To view the content in your browser, please download Adobe Reader or, alternately,
you may Download the file to your hard drive.

NOTE: The latest versions of Adobe Reader do not support viewing PDF files within Firefox on Mac OS and if you are using a modern (Intel) Mac, there is no official plugin for viewing PDF files within the browser window.