Economics and Finance Faculty Publications

Document Type

Article

Publication Date

5-13-2026

Abstract

Background: Alzheimer's disease and related dementias (ADRD) are progressive neurodegenerative conditions where early detection is critical for timely intervention and care planning. However, current diagnostic methods are often inaccessible, costly, and delayed, especially for underserved populations. There is a growing need for scalable, noninvasive tools that can support timely diagnosis. Spontaneous speech contains rich acoustic and linguistic markers that can serve as noninvasive behavioral markers for cognitive decline. Foundation models, pretrained on large-scale audio or text data, generate high-dimensional embeddings that encode rich contextual and acoustic information.

Objective: This study benchmarks open-source foundation language and speech models to evaluate their effectiveness in detecting ADRD from spontaneous speech as a potential solution for early, noninvasive, and scalable ADRD detection.

Methods: In this study, we used the Pioneering Research for Early Prediction of Alzheimer’s and Related Dementias EUREKA (PREPARE) Challenge dataset, which consists of audio recordings from over 1600 participants with 3 distinct categories of cognitive decline: healthy control (HC), mild cognitive impairment (MCI), and Alzheimer's disease (AD). We further excluded samples that are non-English, nonspontaneous speech, or of poor quality. Our final samples included 703 (59.13%) HC, 81 (6.81%) MCI, and 405 (34.06%) AD cases. We systematically benchmarked 18 open-source foundation speech and language models to classify cognitive status into 3 categories (HC, MCI, or AD). Post hoc interpretability analysis was performed for the best-performing model using Shapley additive explanations linking high-dimensional embeddings with explainable acoustic and linguistic markers.

Results: Whisper-medium model achieved the highest performance among speech models at 0.731 accuracy and 0.802 area under the curve, while Bidirectional Encoder Representations from Transformers with pause annotation achieved the top accuracy of 0.662 and 0.744 area under the curve among language models. Overall, ADRD detection based on state-of-the-art automatic speech recognition model-generated audio-embeddings outperformed other models, and the inclusion of nonsemantic information, such as pause patterns, consistently improved the classification performance of text-embedding–based models.

Conclusions: Our work presents a comprehensive comparative evaluation of state-of-the-art speech and language models for AD and MCI detection on a large, clinically relevant dataset. Embeddings derived from acoustic models, which capture both semantic and acoustic information, show promising performance and highlight the potential for developing a more scalable, noninvasive, and cost-effective early detection tool for ADRD.

Comments

© Jingyu Li, Lingchao Mao, Xi Mao, Hairong Wang, Zhendong Wang, Xuelei Sherry Ni.

This is an open-access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work, first published in JMIR Formative Research, is properly cited. The complete bibliographic information, a link to the original publication on https://formative.jmir.org, as well as this copyright and license information must be included.

Publication Title

JMIR Formative Research

DOI

10.2196/79411

Share

COinS
 
 

To view the content in your browser, please download Adobe Reader or, alternately,
you may Download the file to your hard drive.

NOTE: The latest versions of Adobe Reader do not support viewing PDF files within Firefox on Mac OS and if you are using a modern (Intel) Mac, there is no official plugin for viewing PDF files within the browser window.