Skip to main content

Showing 1–19 of 19 results for author: Hai, J

.
  1. arXiv:2601.17645  [pdf, ps, other

    cs.SD cs.CL cs.CV cs.MM eess.AS

    AVMeme Exam: A Multimodal Multilingual Multicultural Benchmark for LLMs' Contextual and Cultural Knowledge and Thinking

    Authors: Xilin Jiang, Qiaolin Wang, Junkai Wu, Xiaomin He, Zhongweiyang Xu, Yinghao Ma, Minshuo Piao, Kaiyi Yang, Xiuwen Zheng, Riki Shimizu, Yicong Chen, Arsalan Firoozi, Gavin Mischler, Sukru Samet Dindar, Richard Antonello, Linyang He, Tsun-An Hsieh, Xulin Fan, Yulun Wu, Yuesheng Ma, Chaitanya Amballa, Weixiong Chen, Jiarui Hai, Ruisi Li, Vishal Choudhari , et al. (8 additional authors not shown)

    Abstract: Internet audio-visual clips convey meaning through time-varying sound and motion, which extend beyond what text alone can represent. To examine whether AI models can understand such signals in human cultural contexts, we introduce AVMeme Exam, a human-curated benchmark of over one thousand iconic Internet sounds and videos spanning speech, songs, music, and sound effects. Each meme is paired with… ▽ More

    Submitted 24 January, 2026; originally announced January 2026.

    Comments: avmemeexam.github.io/public

  2. arXiv:2601.04343  [pdf, ps, other

    cs.SD cs.AI eess.AS

    Summary of The Inaugural Music Source Restoration Challenge

    Authors: Yongyi Zang, Jiarui Hai, Wanying Ge, Qiuqiang Kong, Zheqi Dai, Helin Wang, Yuki Mitsufuji, Mark D. Plumbley

    Abstract: Music Source Restoration (MSR) aims to recover original, unprocessed instrument stems from professionally mixed and degraded audio, requiring the reversal of both production effects and real-world degradations. We present the inaugural MSR Challenge, which features objective evaluation on studio-produced mixtures using Multi-Mel-SNR, Zimtohrli, and FAD-CLAP, alongside subjective evaluation on real… ▽ More

    Submitted 7 January, 2026; originally announced January 2026.

  3. arXiv:2512.14657  [pdf, ps, other

    cs.SD

    Adapting Speech Language Model to Singing Voice Synthesis

    Authors: Yiwen Zhao, Jiatong Shi, Jinchuan Tian, Yuxun Tang, Jiarui Hai, Jionghao Han, Shinji Watanabe

    Abstract: Speech Language Models (SLMs) have recently emerged as a unified paradigm for addressing a wide range of speech-related tasks, including text-to-speech (TTS), speech enhancement (SE), and automatic speech recognition (ASR). However, the generalization capability of large-scale pre-trained SLMs remains underexplored. In this work, we adapt a 1.7B parameter TTS pretrained SLM for singing voice synth… ▽ More

    Submitted 16 December, 2025; originally announced December 2025.

    Comments: Accepted by NeurIPS 2025 workshop AI for Music

  4. arXiv:2510.10995  [pdf, ps, other

    cs.SD

    MSRBench: A Benchmarking Dataset for Music Source Restoration

    Authors: Yongyi Zang, Jiarui Hai, Wanying Ge, Qiuqiang Kong, Zheqi Dai, Helin Wang, Yuki Mitsufuji, Mark D. Plumbley

    Abstract: Music Source Restoration (MSR) extends source separation to realistic settings where signals undergo production effects (equalization, compression, reverb) and real-world degradations, with the goal of recovering the original unprocessed sources. Existing benchmarks cannot measure restoration fidelity: synthetic datasets use unprocessed stems but unrealistic mixtures, while real production dataset… ▽ More

    Submitted 13 October, 2025; originally announced October 2025.

  5. arXiv:2509.18606  [pdf, ps, other

    eess.AS cs.AI cs.SD

    FlexSED: Towards Open-Vocabulary Sound Event Detection

    Authors: Jiarui Hai, Helin Wang, Weizhe Guo, Mounya Elhilali

    Abstract: Despite recent progress in large-scale sound event detection (SED) systems capable of handling hundreds of sound classes, existing multi-class classification frameworks remain fundamentally limited. They cannot process free-text sound queries, which enable more flexible and user-friendly interaction, and they lack zero-shot capabilities and offer poor few-shot adaptability. Although text-query-bas… ▽ More

    Submitted 22 September, 2025; originally announced September 2025.

  6. arXiv:2509.18603  [pdf, ps, other

    eess.AS cs.AI cs.SD

    SynSonic: Augmenting Sound Event Detection through Text-to-Audio Diffusion ControlNet and Effective Sample Filtering

    Authors: Jiarui Hai, Mounya Elhilali

    Abstract: Data synthesis and augmentation are essential for Sound Event Detection (SED) due to the scarcity of temporally labeled data. While augmentation methods like SpecAugment and Mix-up can enhance model performance, they remain constrained by the diversity of existing samples. Recent generative models offer new opportunities, yet their direct application to SED is challenging due to the lack of precis… ▽ More

    Submitted 22 September, 2025; originally announced September 2025.

  7. arXiv:2506.02863  [pdf, ps, other

    eess.AS cs.AI cs.SD

    CapSpeech: Enabling Downstream Applications in Style-Captioned Text-to-Speech

    Authors: Helin Wang, Jiarui Hai, Dading Chong, Karan Thakkar, Tiantian Feng, Dongchao Yang, Junhyeok Lee, Thomas Thebaud, Laureano Moro Velazquez, Jesus Villalba, Zengyi Qin, Shrikanth Narayanan, Mounya Elhiali, Najim Dehak

    Abstract: Recent advancements in generative artificial intelligence have significantly transformed the field of style-captioned text-to-speech synthesis (CapTTS). However, adapting CapTTS to real-world applications remains challenging due to the lack of standardized, comprehensive datasets and limited research on downstream tasks built upon CapTTS. To address these gaps, we introduce CapSpeech, a new benchm… ▽ More

    Submitted 26 September, 2025; v1 submitted 3 June, 2025; originally announced June 2025.

  8. arXiv:2506.01257  [pdf

    cs.CL cs.AI

    DeepSeek in Healthcare: A Survey of Capabilities, Risks, and Clinical Applications of Open-Source Large Language Models

    Authors: Jiancheng Ye, Sophie Bronstein, Jiarui Hai, Malak Abu Hashish

    Abstract: DeepSeek-R1 is a cutting-edge open-source large language model (LLM) developed by DeepSeek, showcasing advanced reasoning capabilities through a hybrid architecture that integrates mixture of experts (MoE), chain of thought (CoT) reasoning, and reinforcement learning. Released under the permissive MIT license, DeepSeek-R1 offers a transparent and cost-effective alternative to proprietary models li… ▽ More

    Submitted 1 June, 2025; originally announced June 2025.

  9. arXiv:2505.19314  [pdf, ps, other

    eess.AS cs.AI cs.SD

    SoloSpeech: Enhancing Intelligibility and Quality in Target Speech Extraction through a Cascaded Generative Pipeline

    Authors: Helin Wang, Jiarui Hai, Dongchao Yang, Chen Chen, Kai Li, Junyi Peng, Thomas Thebaud, Laureano Moro Velazquez, Jesus Villalba, Najim Dehak

    Abstract: Target Speech Extraction (TSE) aims to isolate a target speaker's voice from a mixture of multiple speakers by leveraging speaker-specific cues, typically provided as auxiliary audio (a.k.a. cue audio). Although recent advancements in TSE have primarily employed discriminative models that offer high perceptual quality, these models often introduce unwanted artifacts, reduce naturalness, and are se… ▽ More

    Submitted 6 September, 2025; v1 submitted 25 May, 2025; originally announced May 2025.

  10. arXiv:2503.13806  [pdf

    cs.CV cs.AI

    DescriptorMedSAM: Language-Image Fusion with Multi-Aspect Text Guidance for Medical Image Segmentation

    Authors: Wenjie Zhang, Liming Luo, Mengnan He, Jiarui Hai, Jiancheng Ye

    Abstract: Accurate organ segmentation is essential for clinical tasks such as radiotherapy planning and disease monitoring. Recent foundation models like MedSAM achieve strong results using point or bounding-box prompts but still require manual interaction. We propose DescriptorMedSAM, a lightweight extension of MedSAM that incorporates structured text prompts, ranging from simple organ names to combined sh… ▽ More

    Submitted 21 September, 2025; v1 submitted 17 March, 2025; originally announced March 2025.

  11. arXiv:2409.10819  [pdf, ps, other

    eess.AS cs.SD

    EzAudio: Enhancing Text-to-Audio Generation with Efficient Diffusion Transformer

    Authors: Jiarui Hai, Yong Xu, Hao Zhang, Chenxing Li, Helin Wang, Mounya Elhilali, Dong Yu

    Abstract: We introduce EzAudio, a text-to-audio (T2A) generation framework designed to produce high-quality, natural-sounding sound effects. Core designs include: (1) We propose EzAudio-DiT, an optimized Diffusion Transformer (DiT) designed for audio latent representations, improving convergence speed, as well as parameter and memory efficiency. (2) We apply a classifier-free guidance (CFG) rescaling techni… ▽ More

    Submitted 19 June, 2025; v1 submitted 16 September, 2024; originally announced September 2024.

    Comments: Accepted at Interspeech 2025

  12. arXiv:2409.08425  [pdf, other

    eess.AS cs.SD

    SoloAudio: Target Sound Extraction with Language-oriented Audio Diffusion Transformer

    Authors: Helin Wang, Jiarui Hai, Yen-Ju Lu, Karan Thakkar, Mounya Elhilali, Najim Dehak

    Abstract: In this paper, we introduce SoloAudio, a novel diffusion-based generative model for target sound extraction (TSE). Our approach trains latent diffusion models on audio, replacing the previous U-Net backbone with a skip-connected Transformer that operates on latent features. SoloAudio supports both audio-oriented and language-oriented TSE by utilizing a CLAP model as the feature extractor for targe… ▽ More

    Submitted 1 January, 2025; v1 submitted 12 September, 2024; originally announced September 2024.

    Comments: Submitted to ICASSP 2025

  13. arXiv:2409.07556  [pdf, other

    eess.AS cs.SD

    SSR-Speech: Towards Stable, Safe and Robust Zero-shot Text-based Speech Editing and Synthesis

    Authors: Helin Wang, Meng Yu, Jiarui Hai, Chen Chen, Yuchen Hu, Rilin Chen, Najim Dehak, Dong Yu

    Abstract: In this paper, we introduce SSR-Speech, a neural codec autoregressive model designed for stable, safe, and robust zero-shot textbased speech editing and text-to-speech synthesis. SSR-Speech is built on a Transformer decoder and incorporates classifier-free guidance to enhance the stability of the generation process. A watermark Encodec is proposed to embed frame-level watermarks into the edited re… ▽ More

    Submitted 1 January, 2025; v1 submitted 11 September, 2024; originally announced September 2024.

    Comments: ICASSP 2025

  14. arXiv:2406.16314  [pdf, other

    eess.AS

    DreamVoice: Text-Guided Voice Conversion

    Authors: Jiarui Hai, Karan Thakkar, Helin Wang, Zengyi Qin, Mounya Elhilali

    Abstract: Generative voice technologies are rapidly evolving, offering opportunities for more personalized and inclusive experiences. Traditional one-shot voice conversion (VC) requires a target recording during inference, limiting ease of usage in generating desired voice timbres. Text-guided generation offers an intuitive solution to convert voices to desired "DreamVoices" according to the users' needs. O… ▽ More

    Submitted 24 June, 2024; originally announced June 2024.

    Comments: Accepted at INTERSPEECH 2024

  15. arXiv:2406.07461  [pdf, other

    eess.AS

    Noise-robust Speech Separation with Fast Generative Correction

    Authors: Helin Wang, Jesus Villalba, Laureano Moro-Velazquez, Jiarui Hai, Thomas Thebaud, Najim Dehak

    Abstract: Speech separation, the task of isolating multiple speech sources from a mixed audio signal, remains challenging in noisy environments. In this paper, we propose a generative correction method to enhance the output of a discriminative separator. By leveraging a generative corrector based on a diffusion model, we refine the separation process for single-channel mixture speech by removing noises and… ▽ More

    Submitted 11 June, 2024; originally announced June 2024.

    Comments: Accepted at INTERSPEECH 2024

  16. arXiv:2402.06599  [pdf, other

    cs.CV cs.AI

    On the Out-Of-Distribution Generalization of Multimodal Large Language Models

    Authors: Xingxuan Zhang, Jiansheng Li, Wenjing Chu, Junjia Hai, Renzhe Xu, Yuqing Yang, Shikai Guan, Jiazheng Xu, Peng Cui

    Abstract: We investigate the generalization boundaries of current Multimodal Large Language Models (MLLMs) via comprehensive evaluation under out-of-distribution scenarios and domain-specific tasks. We evaluate their zero-shot generalization across synthetic images, real-world distributional shifts, and specialized datasets like medical and molecular imagery. Empirical results indicate that MLLMs struggle w… ▽ More

    Submitted 9 February, 2024; originally announced February 2024.

  17. arXiv:2311.00814  [pdf, other

    cs.SD eess.AS

    Investigating Self-Supervised Deep Representations for EEG-based Auditory Attention Decoding

    Authors: Karan Thakkar, Jiarui Hai, Mounya Elhilali

    Abstract: Auditory Attention Decoding (AAD) algorithms play a crucial role in isolating desired sound sources within challenging acoustic environments directly from brain activity. Although recent research has shown promise in AAD using shallow representations such as auditory envelope and spectrogram, there has been limited exploration of deep Self-Supervised (SS) representations on a larger scale. In this… ▽ More

    Submitted 7 November, 2023; v1 submitted 1 November, 2023; originally announced November 2023.

    Comments: Submitted to ICASSP 2024

  18. arXiv:2310.04567  [pdf, other

    eess.AS cs.SD

    DPM-TSE: A Diffusion Probabilistic Model for Target Sound Extraction

    Authors: Jiarui Hai, Helin Wang, Dongchao Yang, Karan Thakkar, Najim Dehak, Mounya Elhilali

    Abstract: Common target sound extraction (TSE) approaches primarily relied on discriminative approaches in order to separate the target sound while minimizing interference from the unwanted sources, with varying success in separating the target from the background. This study introduces DPM-TSE, a first generative method based on diffusion probabilistic modeling (DPM) for target sound extraction, to achieve… ▽ More

    Submitted 9 October, 2023; v1 submitted 6 October, 2023; originally announced October 2023.

    Comments: Submitted to ICASSP 2024

  19. R2RNet: Low-light Image Enhancement via Real-low to Real-normal Network

    Authors: Jiang Hai, Zhu Xuan, Songchen Han, Ren Yang, Yutong Hao, Fengzhu Zou, Fang Lin

    Abstract: Images captured in weak illumination conditions could seriously degrade the image quality. Solving a series of degradation of low-light images can effectively improve the visual quality of images and the performance of high-level visual tasks. In this study, a novel Retinex-based Real-low to Real-normal Network (R2RNet) is proposed for low-light image enhancement, which includes three subnets: a D… ▽ More

    Submitted 11 November, 2021; v1 submitted 28 June, 2021; originally announced June 2021.

    Comments: 12 pages, 9 figures

    Journal ref: Journal of Visual Communication and Image Representation, 2022