Speech foundation models in healthcare: Effect of layer selection on pathological speech feature prediction
Authors:
Daniela A. Wiepert,
Rene L. Utianski,
Joseph R. Duffy,
John L. Stricker,
Leland R. Barnard,
David T. Jones,
Hugo Botha
Abstract:
Accurately extracting clinical information from speech is critical to the diagnosis and treatment of many neurological conditions. As such, there is interest in leveraging AI for automatic, objective assessments of clinical speech to facilitate diagnosis and treatment of speech disorders. We explore transfer learning using foundation models, focusing on the impact of layer selection for the downst…
▽ More
Accurately extracting clinical information from speech is critical to the diagnosis and treatment of many neurological conditions. As such, there is interest in leveraging AI for automatic, objective assessments of clinical speech to facilitate diagnosis and treatment of speech disorders. We explore transfer learning using foundation models, focusing on the impact of layer selection for the downstream task of predicting pathological speech features. We find that selecting an optimal layer can greatly improve performance (~15.8% increase in balanced accuracy per feature as compared to worst layer, ~13.6% increase as compared to final layer), though the best layer varies by predicted feature and does not always generalize well to unseen data. A learned weighted sum offers comparable performance to the average best layer in-distribution (only ~1.2% lower) and had strong generalization for out-of-distribution data (only 1.5% lower than the average best layer).
△ Less
Submitted 21 June, 2024; v1 submitted 2 February, 2024;
originally announced February 2024.
Detecting Speech Abnormalities with a Perceiver-based Sequence Classifier that Leverages a Universal Speech Model
Authors:
Hagen Soltau,
Izhak Shafran,
Alex Ottenwess,
Joseph R. JR Duffy,
Rene L. Utianski,
Leland R. Barnard,
John L. Stricker,
Daniela Wiepert,
David T. Jones,
Hugo Botha
Abstract:
We propose a Perceiver-based sequence classifier to detect abnormalities in speech reflective of several neurological disorders. We combine this classifier with a Universal Speech Model (USM) that is trained (unsupervised) on 12 million hours of diverse audio recordings. Our model compresses long sequences into a small set of class-specific latent representations and a factorized projection is use…
▽ More
We propose a Perceiver-based sequence classifier to detect abnormalities in speech reflective of several neurological disorders. We combine this classifier with a Universal Speech Model (USM) that is trained (unsupervised) on 12 million hours of diverse audio recordings. Our model compresses long sequences into a small set of class-specific latent representations and a factorized projection is used to predict different attributes of the disordered input speech. The benefit of our approach is that it allows us to model different regions of the input for different classes and is at the same time data efficient. We evaluated the proposed model extensively on a curated corpus from the Mayo Clinic. Our model outperforms standard transformer (80.9%) and perceiver (81.8%) models and achieves an average accuracy of 83.1%. With limited task-specific data, we find that pretraining is important and surprisingly pretraining with the unrelated automatic speech recognition (ASR) task is also beneficial. Encodings from the middle layers provide a mix of both acoustic and phonetic information and achieve best prediction results compared to just using the final layer encodings (83.1% vs. 79.6%). The results are promising and with further refinements may help clinicians detect speech abnormalities without needing access to highly specialized speech-language pathologists.
△ Less
Submitted 16 October, 2023;
originally announced October 2023.