Skip to main content

Showing 1–22 of 22 results for author: Liu, A T

Searching in archive cs. Search in all archives.
.
  1. arXiv:2509.14480  [pdf, ps, other

    cs.CL cs.AI cs.MA

    Process-Supervised Reinforcement Learning for Interactive Multimodal Tool-Use Agents

    Authors: Weiting Tan, Xinghua Qu, Ming Tu, Meng Ge, Andy T. Liu, Philipp Koehn, Lu Lu

    Abstract: Effective interactive tool use requires agents to master Tool Integrated Reasoning (TIR): a complex process involving multi-turn planning and long-context dialogue management. To train agents for this dynamic process, particularly in multi-modal contexts, we introduce a sandbox environment for reinforcement learning (RL) that supports interleaved speech-text rollouts. Our core strategy, Turn-level… ▽ More

    Submitted 17 September, 2025; originally announced September 2025.

  2. arXiv:2412.16474  [pdf, other

    eess.AS cs.CL

    Enhancing Multilingual ASR for Unseen Languages via Language Embedding Modeling

    Authors: Shao-Syuan Huang, Kuan-Po Huang, Andy T. Liu, Hung-yi Lee

    Abstract: Multilingual Automatic Speech Recognition (ASR) aims to recognize and transcribe speech from multiple languages within a single system. Whisper, one of the most advanced ASR models, excels in this domain by handling 99 languages effectively, leveraging a vast amount of data and incorporating language tags as prefixes to guide the recognition process. However, despite its success, Whisper struggles… ▽ More

    Submitted 20 December, 2024; originally announced December 2024.

    Comments: Accepted by ICASSP 2025

  3. arXiv:2411.05361  [pdf, ps, other

    cs.CL eess.AS

    Dynamic-SUPERB Phase-2: A Collaboratively Expanding Benchmark for Measuring the Capabilities of Spoken Language Models with 180 Tasks

    Authors: Chien-yu Huang, Wei-Chih Chen, Shu-wen Yang, Andy T. Liu, Chen-An Li, Yu-Xiang Lin, Wei-Cheng Tseng, Anuj Diwan, Yi-Jen Shih, Jiatong Shi, William Chen, Chih-Kai Yang, Wenze Ren, Xuanjun Chen, Chi-Yuan Hsiao, Puyuan Peng, Shih-Heng Wang, Chun-Yi Kuan, Ke-Han Lu, Kai-Wei Chang, Fabian Ritter-Gutierrez, Kuan-Po Huang, Siddhant Arora, You-Kuan Lin, Ming To Chuang , et al. (55 additional authors not shown)

    Abstract: Multimodal foundation models, such as Gemini and ChatGPT, have revolutionized human-machine interactions by seamlessly integrating various forms of data. Developing a universal spoken language model that comprehends a wide range of natural language instructions is critical for bridging communication gaps and facilitating more intuitive interactions. However, the absence of a comprehensive evaluati… ▽ More

    Submitted 9 June, 2025; v1 submitted 8 November, 2024; originally announced November 2024.

    Comments: ICLR 2025

  4. arXiv:2410.13351  [pdf, other

    cs.CL cs.AI cs.LG

    Representation Learning of Structured Data for Medical Foundation Models

    Authors: Vijay Prakash Dwivedi, Viktor Schlegel, Andy T. Liu, Thanh-Tung Nguyen, Abhinav Ramesh Kashyap, Jeng Wei, Wei-Hsian Yin, Stefan Winkler, Robby T. Tan

    Abstract: Large Language Models (LLMs) have demonstrated remarkable performance across various domains, including healthcare. However, their ability to effectively represent structured non-textual data, such as the alphanumeric medical codes used in records like ICD-10 or SNOMED-CT, is limited and has been particularly exposed in recent research. This paper examines the challenges LLMs face in processing me… ▽ More

    Submitted 17 October, 2024; originally announced October 2024.

    Comments: NeurIPS 2024 Workshop on Unifying Representations in Neural Models (UniReps 2024)

  5. arXiv:2409.16295  [pdf, other

    eess.AS cs.CL cs.LG cs.SD

    Efficient Training of Self-Supervised Speech Foundation Models on a Compute Budget

    Authors: Andy T. Liu, Yi-Cheng Lin, Haibin Wu, Stefan Winkler, Hung-yi Lee

    Abstract: Despite their impressive success, training foundation models remains computationally costly. This paper investigates how to efficiently train speech foundation models with self-supervised learning (SSL) under a limited compute budget. We examine critical factors in SSL that impact the budget, including model architecture, model size, and data size. Our goal is to make analytical steps toward under… ▽ More

    Submitted 4 February, 2025; v1 submitted 9 September, 2024; originally announced September 2024.

    Comments: Accepted to IEEE SLT 2024

    Journal ref: 2024 IEEE Spoken Language Technology Workshop (SLT)

  6. arXiv:2408.14418  [pdf, other

    cs.CL cs.AI

    MEDSAGE: Enhancing Robustness of Medical Dialogue Summarization to ASR Errors with LLM-generated Synthetic Dialogues

    Authors: Kuluhan Binici, Abhinav Ramesh Kashyap, Viktor Schlegel, Andy T. Liu, Vijay Prakash Dwivedi, Thanh-Tung Nguyen, Xiaoxue Gao, Nancy F. Chen, Stefan Winkler

    Abstract: Automatic Speech Recognition (ASR) systems are pivotal in transcribing speech into text, yet the errors they introduce can significantly degrade the performance of downstream tasks like summarization. This issue is particularly pronounced in clinical dialogue summarization, a low-resource domain where supervised data for fine-tuning is scarce, necessitating the use of ASR models as black-box solut… ▽ More

    Submitted 8 January, 2025; v1 submitted 26 August, 2024; originally announced August 2024.

    Comments: Accepted by the Thirty-Ninth AAAI Conference on Artificial Intelligence (AAAI-25)

  7. arXiv:2408.12095  [pdf, other

    cs.CL cs.AI cs.LG

    uMedSum: A Unified Framework for Advancing Medical Abstractive Summarization

    Authors: Aishik Nagar, Yutong Liu, Andy T. Liu, Viktor Schlegel, Vijay Prakash Dwivedi, Arun-Kumar Kaliya-Perumal, Guna Pratheep Kalanchiam, Yili Tang, Robby T. Tan

    Abstract: Medical abstractive summarization faces the challenge of balancing faithfulness and informativeness. Current methods often sacrifice key information for faithfulness or introduce confabulations when prioritizing informativeness. While recent advancements in techniques like in-context learning (ICL) and fine-tuning have improved medical summarization, they often overlook crucial aspects such as fai… ▽ More

    Submitted 25 August, 2024; v1 submitted 21 August, 2024; originally announced August 2024.

    Comments: 12 pages

  8. On the social bias of speech self-supervised models

    Authors: Yi-Cheng Lin, Tzu-Quan Lin, Hsi-Che Lin, Andy T. Liu, Hung-yi Lee

    Abstract: Self-supervised learning (SSL) speech models have achieved remarkable performance in various tasks, yet the biased outcomes, especially affecting marginalized groups, raise significant concerns. Social bias refers to the phenomenon where algorithms potentially amplify disparate properties between social groups present in the data used for training. Bias in SSL models can perpetuate injustice by au… ▽ More

    Submitted 7 June, 2024; originally announced June 2024.

    Comments: Accepted by INTERSPEECH 2024

    Journal ref: Proc. Interspeech 2024, 4638-4642

  9. arXiv:2404.09385  [pdf, other

    eess.AS cs.CL eess.SP

    A Large-Scale Evaluation of Speech Foundation Models

    Authors: Shu-wen Yang, Heng-Jui Chang, Zili Huang, Andy T. Liu, Cheng-I Lai, Haibin Wu, Jiatong Shi, Xuankai Chang, Hsiang-Sheng Tsai, Wen-Chin Huang, Tzu-hsun Feng, Po-Han Chi, Yist Y. Lin, Yung-Sung Chuang, Tzu-Hsien Huang, Wei-Cheng Tseng, Kushal Lakhotia, Shang-Wen Li, Abdelrahman Mohamed, Shinji Watanabe, Hung-yi Lee

    Abstract: The foundation model paradigm leverages a shared foundation model to achieve state-of-the-art (SOTA) performance for various tasks, requiring minimal downstream-specific modeling and data annotation. This approach has proven crucial in the field of Natural Language Processing (NLP). However, the speech processing community lacks a similar setup to explore the paradigm systematically. In this work,… ▽ More

    Submitted 29 May, 2024; v1 submitted 14 April, 2024; originally announced April 2024.

    Comments: The extended journal version for SUPERB and SUPERB-SG. Published in IEEE/ACM TASLP. The Arxiv version is preferred

  10. Parallel Synthesis for Autoregressive Speech Generation

    Authors: Po-chun Hsu, Da-rong Liu, Andy T. Liu, Hung-yi Lee

    Abstract: Autoregressive neural vocoders have achieved outstanding performance in speech synthesis tasks such as text-to-speech and voice conversion. An autoregressive vocoder predicts a sample at some time step conditioned on those at previous time steps. Though it synthesizes natural human speech, the iterative generation inevitably makes the synthesis time proportional to the utterance length, leading to… ▽ More

    Submitted 5 June, 2024; v1 submitted 25 April, 2022; originally announced April 2022.

    Comments: IEEE/ACM Transactions on Audio, Speech, and Language Processing

  11. arXiv:2203.06849  [pdf, other

    cs.CL cs.SD eess.AS

    SUPERB-SG: Enhanced Speech processing Universal PERformance Benchmark for Semantic and Generative Capabilities

    Authors: Hsiang-Sheng Tsai, Heng-Jui Chang, Wen-Chin Huang, Zili Huang, Kushal Lakhotia, Shu-wen Yang, Shuyan Dong, Andy T. Liu, Cheng-I Jeff Lai, Jiatong Shi, Xuankai Chang, Phil Hall, Hsuan-Jui Chen, Shang-Wen Li, Shinji Watanabe, Abdelrahman Mohamed, Hung-yi Lee

    Abstract: Transfer learning has proven to be crucial in advancing the state of speech and natural language processing research in recent years. In speech, a model pre-trained by self-supervised learning transfers remarkably well on multiple tasks. However, the lack of a consistent evaluation methodology is limiting towards a holistic understanding of the efficacy of such models. SUPERB was a step towards in… ▽ More

    Submitted 14 March, 2022; originally announced March 2022.

    Comments: ACL 2022 main conference

  12. arXiv:2203.01543  [pdf, other

    cs.CL cs.AI cs.LG

    QaNER: Prompting Question Answering Models for Few-shot Named Entity Recognition

    Authors: Andy T. Liu, Wei Xiao, Henghui Zhu, Dejiao Zhang, Shang-Wen Li, Andrew Arnold

    Abstract: Recently, prompt-based learning for pre-trained language models has succeeded in few-shot Named Entity Recognition (NER) by exploiting prompts as task guidance to increase label efficiency. However, previous prompt-based methods for few-shot NER have limitations such as a higher computational complexity, poor zero-shot ability, requiring manual prompt engineering, or lack of prompt robustness. In… ▽ More

    Submitted 4 March, 2022; v1 submitted 3 March, 2022; originally announced March 2022.

    Comments: 8 pages, 6 figures

  13. arXiv:2110.07957  [pdf, other

    eess.AS cs.CL cs.SD

    Don't speak too fast: The impact of data bias on self-supervised speech models

    Authors: Yen Meng, Yi-Hui Chou, Andy T. Liu, Hung-yi Lee

    Abstract: Self-supervised Speech Models (S3Ms) have been proven successful in many speech downstream tasks, like ASR. However, how pre-training data affects S3Ms' downstream behavior remains an unexplored issue. In this paper, we study how pre-training data affects S3Ms by pre-training models on biased datasets targeting different factors of speech, including gender, content, and prosody, and evaluate these… ▽ More

    Submitted 26 April, 2022; v1 submitted 15 October, 2021; originally announced October 2021.

    Comments: Accepted by ICASSP 2022

  14. arXiv:2106.00273  [pdf, other

    cs.SD cs.LG eess.AS

    Improving the Adversarial Robustness for Speaker Verification by Self-Supervised Learning

    Authors: Haibin Wu, Xu Li, Andy T. Liu, Zhiyong Wu, Helen Meng, Hung-yi Lee

    Abstract: Previous works have shown that automatic speaker verification (ASV) is seriously vulnerable to malicious spoofing attacks, such as replay, synthetic speech, and recently emerged adversarial attacks. Great efforts have been dedicated to defending ASV against replay and synthetic speech; however, only a few approaches have been explored to deal with adversarial attacks. All the existing approaches t… ▽ More

    Submitted 4 June, 2024; v1 submitted 1 June, 2021; originally announced June 2021.

    Comments: Accepted by TASLP

  15. arXiv:2105.01051  [pdf, ps, other

    cs.CL cs.SD eess.AS

    SUPERB: Speech processing Universal PERformance Benchmark

    Authors: Shu-wen Yang, Po-Han Chi, Yung-Sung Chuang, Cheng-I Jeff Lai, Kushal Lakhotia, Yist Y. Lin, Andy T. Liu, Jiatong Shi, Xuankai Chang, Guan-Ting Lin, Tzu-Hsien Huang, Wei-Cheng Tseng, Ko-tik Lee, Da-Rong Liu, Zili Huang, Shuyan Dong, Shang-Wen Li, Shinji Watanabe, Abdelrahman Mohamed, Hung-yi Lee

    Abstract: Self-supervised learning (SSL) has proven vital for advancing research in natural language processing (NLP) and computer vision (CV). The paradigm pretrains a shared model on large volumes of unlabeled data and achieves state-of-the-art (SOTA) for various tasks with minimal adaptation. However, the speech processing community lacks a similar setup to systematically explore the paradigm. To bridge… ▽ More

    Submitted 15 October, 2021; v1 submitted 3 May, 2021; originally announced May 2021.

    Comments: To appear in Interspeech 2021

  16. arXiv:2102.07047  [pdf, other

    eess.AS cs.AI

    Adversarial defense for automatic speaker verification by cascaded self-supervised learning models

    Authors: Haibin Wu, Xu Li, Andy T. Liu, Zhiyong Wu, Helen Meng, Hung-yi Lee

    Abstract: Automatic speaker verification (ASV) is one of the core technologies in biometric identification. With the ubiquitous usage of ASV systems in safety-critical applications, more and more malicious attackers attempt to launch adversarial attacks at ASV systems. In the midst of the arms race between attack and defense in ASV, how to effectively improve the robustness of ASV against adversarial attack… ▽ More

    Submitted 13 February, 2021; originally announced February 2021.

    Comments: Accepted to ICASSP 2021

  17. arXiv:2007.06028  [pdf, other

    eess.AS cs.CL cs.LG

    TERA: Self-Supervised Learning of Transformer Encoder Representation for Speech

    Authors: Andy T. Liu, Shang-Wen Li, Hung-yi Lee

    Abstract: We introduce a self-supervised speech pre-training method called TERA, which stands for Transformer Encoder Representations from Alteration. Recent approaches often learn by using a single auxiliary task like contrastive prediction, autoregressive prediction, or masked reconstruction. Unlike previous methods, we use alteration along three orthogonal axes to pre-train Transformer Encoders on a larg… ▽ More

    Submitted 4 August, 2021; v1 submitted 12 July, 2020; originally announced July 2020.

    Comments: Published in IEEE/ACM TASLP, final published article available at https://ieeexplore.ieee.org/document/9478264

    Journal ref: IEEE/ACM TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, Vol. 29, 2021

  18. arXiv:2006.03265  [pdf, other

    cs.CL

    Understanding Self-Attention of Self-Supervised Audio Transformers

    Authors: Shu-wen Yang, Andy T. Liu, Hung-yi Lee

    Abstract: Self-supervised Audio Transformers (SAT) enable great success in many downstream speech applications like ASR, but how they work has not been widely explored yet. In this work, we present multiple strategies for the analysis of attention mechanisms in SAT. We categorize attentions into explainable categories, where we discover each category possesses its own unique functionality. We provide a visu… ▽ More

    Submitted 10 August, 2020; v1 submitted 5 June, 2020; originally announced June 2020.

    Comments: Accepted by INTERSPEECH 2020, ICML 2020 Workshop on Self-supervision in Audio and Speech

    Journal ref: INTERSPEECH 2020

  19. arXiv:2006.03214  [pdf, other

    eess.AS cs.LG

    Defense for Black-box Attacks on Anti-spoofing Models by Self-Supervised Learning

    Authors: Haibin Wu, Andy T. Liu, Hung-yi Lee

    Abstract: High-performance anti-spoofing models for automatic speaker verification (ASV), have been widely used to protect ASV by identifying and filtering spoofing audio that is deliberately generated by text-to-speech, voice conversion, audio replay, etc. However, it has been shown that high-performance anti-spoofing models are vulnerable to adversarial attacks. Adversarial attacks, that are indistinguish… ▽ More

    Submitted 7 December, 2020; v1 submitted 4 June, 2020; originally announced June 2020.

  20. arXiv:1912.02461  [pdf, ps, other

    cs.SD cs.LG eess.AS

    Towards Robust Neural Vocoding for Speech Generation: A Survey

    Authors: Po-chun Hsu, Chun-hsuan Wang, Andy T. Liu, Hung-yi Lee

    Abstract: Recently, neural vocoders have been widely used in speech synthesis tasks, including text-to-speech and voice conversion. However, when encountering data distribution mismatch between training and inference, neural vocoders trained on real data often degrade in voice quality for unseen scenarios. In this paper, we train four common neural vocoders, including WaveNet, WaveRNN, FFTNet, Parallel Wave… ▽ More

    Submitted 20 August, 2020; v1 submitted 5 December, 2019; originally announced December 2019.

    Comments: Submitted to INTERSPEECH 2020

  21. arXiv:1910.12638  [pdf, other

    eess.AS cs.CL cs.LG cs.SD

    Mockingjay: Unsupervised Speech Representation Learning with Deep Bidirectional Transformer Encoders

    Authors: Andy T. Liu, Shu-wen Yang, Po-Han Chi, Po-chun Hsu, Hung-yi Lee

    Abstract: We present Mockingjay as a new speech representation learning approach, where bidirectional Transformer encoders are pre-trained on a large amount of unlabeled speech. Previous speech representation methods learn through conditioning on past frames and predicting information about future frames. Whereas Mockingjay is designed to predict the current frame through jointly conditioning on both past a… ▽ More

    Submitted 2 February, 2020; v1 submitted 24 October, 2019; originally announced October 2019.

    Comments: Accepted by ICASSP 2020, Lecture Session

    Journal ref: ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

  22. arXiv:1905.11563  [pdf, other

    cs.CL cs.LG cs.SD eess.AS

    Unsupervised End-to-End Learning of Discrete Linguistic Units for Voice Conversion

    Authors: Andy T. Liu, Po-chun Hsu, Hung-yi Lee

    Abstract: We present an unsupervised end-to-end training scheme where we discover discrete subword units from speech without using any labels. The discrete subword units are learned under an ASR-TTS autoencoder reconstruction setting, where an ASR-Encoder is trained to discover a set of common linguistic units given a variety of speakers, and a TTS-Decoder trained to project the discovered units back to the… ▽ More

    Submitted 20 June, 2019; v1 submitted 27 May, 2019; originally announced May 2019.

    Comments: Accepted by Interspeech 2019, Graz, Austria

    Journal ref: Interspeech 2019