
Posters
Presenting Author Academic/Professional Position
Medical Student
Academic Level (Author 1)
Medical Student
Discipline/Specialty (Author 1)
Orthopedic Surgery
Academic Level (Author 2)
Medical Student
Discipline/Specialty (Author 2)
Internal Medicine
Academic Level (Author 3)
Medical Student
Discipline/Specialty (Author 3)
Otolaryngology (OHNS)
Academic Level (Author 4)
Other
Discipline/Specialty (Author 4)
Orthopedic Surgery
Academic Level (Author 5)
Faculty
Discipline/Specialty (Author 5)
Orthopedic Surgery
Presentation Type
Poster
Discipline Track
Patient Care
Abstract Type
Research/Clinical
Abstract
Aims: The study aimed to evaluate the accuracy, comprehensiveness, and readability of responses generated by ChatGPT 4.0 to 30 common patient questions about the Bernese Periacetabular Osteotomy (PAO).
Methods: Two fellowship-trained orthopaedic surgeons specializing in hip preservation selected thirty questions from a prior study identifying common PAO questions on social media. Each question was entered into ChatGPT 4.0, and the surgeons independently graded responses. Responses were evaluated using an established grading system: “excellent,” “satisfactory requiring minimal clarification,” “satisfactory requiring moderate clarification,” or “unsatisfactory.” Accuracy and comprehensiveness were assessed based on the concordance of response content with current literature. Readability was analyzed by calculating the Flesch-Kincaid Grade Level and Flesch-Kincaid Reading Ease. Interrater reliability was measured with Cohen's kappa.
Results: Regarding accuracy and comprehensiveness, 96.7% of responses were graded as "excellent" or "satisfactory, requiring minimal clarification." One reviewer rated 24 responses (80%) as "excellent," while the second reviewer assigned this rating to 17 responses (56.7%). Of the remaining responses, 6 (20%) and 12 (40%) were rated as "satisfactory, requiring minimal clarification" by the first and second reviewers, respectively. Only one response (3.3%) was graded as "satisfactory, requiring moderate clarification," and none were rated as "unsatisfactory." Interrater reliability showed moderate agreement (κ = 0.5). Readability analysis revealed an average Flesch-Kincaid Grade Level corresponding to an 11th-grade reading level (11.07 ± 1.60) and a mean Reading Ease score requiring college-level reading comprehension (39.89 ± 8.37). Notably, 93.3% of responses required at least a college-level education to comprehend (Grade Level ≥ 12.5 or Reading Ease ≤ 50.0).
Conclusion: ChatGPT 4.0 provided accurate, satisfactory answers to common questions about PAO, with most rated as excellent. However, the advanced reading level may pose comprehension challenges for patients. ChatGPT is a promising educational resource for PAO patients; future iterations should prioritize improving readability without compromising quality.
Recommended Citation
Gaddis, John M.; Arellano, Elias; Alsabawi, Yossef; Castaneda, Pablo; and Wells, Joel, "Assessing the Accuracy and Readability of ChatGPT 4.0’s Responses to Common Patient Questions Regarding Periacetabular Osteotomy (PAO)" (2025). Research Symposium. 77.
https://scholarworks.utrgv.edu/somrs/2025/posters/77
Included in
Assessing the Accuracy and Readability of ChatGPT 4.0’s Responses to Common Patient Questions Regarding Periacetabular Osteotomy (PAO)
Aims: The study aimed to evaluate the accuracy, comprehensiveness, and readability of responses generated by ChatGPT 4.0 to 30 common patient questions about the Bernese Periacetabular Osteotomy (PAO).
Methods: Two fellowship-trained orthopaedic surgeons specializing in hip preservation selected thirty questions from a prior study identifying common PAO questions on social media. Each question was entered into ChatGPT 4.0, and the surgeons independently graded responses. Responses were evaluated using an established grading system: “excellent,” “satisfactory requiring minimal clarification,” “satisfactory requiring moderate clarification,” or “unsatisfactory.” Accuracy and comprehensiveness were assessed based on the concordance of response content with current literature. Readability was analyzed by calculating the Flesch-Kincaid Grade Level and Flesch-Kincaid Reading Ease. Interrater reliability was measured with Cohen's kappa.
Results: Regarding accuracy and comprehensiveness, 96.7% of responses were graded as "excellent" or "satisfactory, requiring minimal clarification." One reviewer rated 24 responses (80%) as "excellent," while the second reviewer assigned this rating to 17 responses (56.7%). Of the remaining responses, 6 (20%) and 12 (40%) were rated as "satisfactory, requiring minimal clarification" by the first and second reviewers, respectively. Only one response (3.3%) was graded as "satisfactory, requiring moderate clarification," and none were rated as "unsatisfactory." Interrater reliability showed moderate agreement (κ = 0.5). Readability analysis revealed an average Flesch-Kincaid Grade Level corresponding to an 11th-grade reading level (11.07 ± 1.60) and a mean Reading Ease score requiring college-level reading comprehension (39.89 ± 8.37). Notably, 93.3% of responses required at least a college-level education to comprehend (Grade Level ≥ 12.5 or Reading Ease ≤ 50.0).
Conclusion: ChatGPT 4.0 provided accurate, satisfactory answers to common questions about PAO, with most rated as excellent. However, the advanced reading level may pose comprehension challenges for patients. ChatGPT is a promising educational resource for PAO patients; future iterations should prioritize improving readability without compromising quality.