Economics and Finance Faculty Publications
Document Type
Article
Publication Date
5-13-2026
Abstract
Background: Alzheimer's disease and related dementias (ADRD) are progressive neurodegenerative conditions where early detection is critical for timely intervention and care planning. However, current diagnostic methods are often inaccessible, costly, and delayed, especially for underserved populations. There is a growing need for scalable, noninvasive tools that can support timely diagnosis. Spontaneous speech contains rich acoustic and linguistic markers that can serve as noninvasive behavioral markers for cognitive decline. Foundation models, pretrained on large-scale audio or text data, generate high-dimensional embeddings that encode rich contextual and acoustic information.
Objective: This study benchmarks open-source foundation language and speech models to evaluate their effectiveness in detecting ADRD from spontaneous speech as a potential solution for early, noninvasive, and scalable ADRD detection.
Methods: In this study, we used the Pioneering Research for Early Prediction of Alzheimer’s and Related Dementias EUREKA (PREPARE) Challenge dataset, which consists of audio recordings from over 1600 participants with 3 distinct categories of cognitive decline: healthy control (HC), mild cognitive impairment (MCI), and Alzheimer's disease (AD). We further excluded samples that are non-English, nonspontaneous speech, or of poor quality. Our final samples included 703 (59.13%) HC, 81 (6.81%) MCI, and 405 (34.06%) AD cases. We systematically benchmarked 18 open-source foundation speech and language models to classify cognitive status into 3 categories (HC, MCI, or AD). Post hoc interpretability analysis was performed for the best-performing model using Shapley additive explanations linking high-dimensional embeddings with explainable acoustic and linguistic markers.
Results: Whisper-medium model achieved the highest performance among speech models at 0.731 accuracy and 0.802 area under the curve, while Bidirectional Encoder Representations from Transformers with pause annotation achieved the top accuracy of 0.662 and 0.744 area under the curve among language models. Overall, ADRD detection based on state-of-the-art automatic speech recognition model-generated audio-embeddings outperformed other models, and the inclusion of nonsemantic information, such as pause patterns, consistently improved the classification performance of text-embedding–based models.
Conclusions: Our work presents a comprehensive comparative evaluation of state-of-the-art speech and language models for AD and MCI detection on a large, clinically relevant dataset. Embeddings derived from acoustic models, which capture both semantic and acoustic information, show promising performance and highlight the potential for developing a more scalable, noninvasive, and cost-effective early detection tool for ADRD.
Recommended Citation
Li, J. et al. (2026) “Early Detection of Alzheimer’s Disease and Related Dementias From Spontaneous Speech Using Foundation Speech and Language Models: Comparative Evaluation,” JMIR Formative Research, 10, p. e79411. https://doi.org/10.2196/79411
Publication Title
JMIR Formative Research
DOI
10.2196/79411

Comments
© Jingyu Li, Lingchao Mao, Xi Mao, Hairong Wang, Zhendong Wang, Xuelei Sherry Ni.
This is an open-access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work, first published in JMIR Formative Research, is properly cited. The complete bibliographic information, a link to the original publication on https://formative.jmir.org, as well as this copyright and license information must be included.