8000
Skip to content

inseong00/CAF-Score

Repository files navigation

CAF-Score: Calibrating CLAP with LALMs for Reference-Free Audio Caption Evaluation

CAF-Score is a comprehensive reference-free audio-caption alignment evaluation metric that combines CLAP (Contrastive Language-Audio Pretraining) similarity scores with FLEUR scores from Large Audio Language Models (LALMs).

Overview

This repository provides:

  • CLAP Evaluation: Unified interface for multiple CLAP models (MS-CLAP, LAION-CLAP, MGA-CLAP, M2D-CLAP)
  • LALM Evaluation: FLEUR metric implementation for Audio-Flamingo-3, Qwen3-Omni, and Qwen2.5-Omni
  • CAF-Score Computation: Combined metric for robust audio-caption alignment assessment
  • BRACE Benchmark Evaluation: Evaluation scripts for the BRACE dataset

Installation

Due to dependency conflicts between Audio-Flamingo-3 and Qwen-Omni models, two separate conda environments are required.

Environment 1: Audio-Flamingo-3

# Create environment for Audio-Flamingo-3
conda env create -f environment_af3.yml

# Activate the environment
conda activate caf_af3

Use this environment when running evaluations with --lalm_model audioflamingo3.

Environment 2: Qwen-Omni (Qwen3-Omni / Qwen2.5-Omni)

# Create environment for Qwen-Omni
conda env create -f environment_qwen3.yml

# Activate the environment
conda activate caf_qwen3

# Install vllm from Qwen3-Omni compatible branch (required)
pip install git+https://github.com/wangxiongts/vllm.git

# Install flash-attention (recommended for better performance)
pip install flash-attn --no-build-isolation

Use this environment when running evaluations with --lalm_model qwen3omni, qwen25omni-3b, or qwen25omni-7b.

Environment Summary

Environment LALM Model Python Key Dependencies
caf_af3 Audio-Flamingo-3 3.10 transformers 5.x, torch 2.5
caf_qwen3 Qwen3-Omni, Qwen2.5-Omni 3.10 transformers 4.57, torch 2.7, vllm

Alternative: Docker

As an alternative to conda environments, you can use Docker. Two separate Docker setups are provided under docker/.

Prerequisites: Docker and NVIDIA Container Toolkit must be installed.

Audio-Flamingo-3

cd docker/af3

# Build and start the container
docker compose up -d

# Enter the container
docker exec -it caf-af3 bash

Qwen3-Omni

cd docker/qwen3

# Build and start the container
docker compose up -d

# Enter the container
docker exec -it caf-score bash

Volume Mounts

Both Docker setups automatically mount the following directories from the host:

Host Path Container Path Description
. (project root) /workspace/CAF-Score Source code
./pretrained_models /workspace/CAF-Score/pretrained_models Model weights
~/.cache/huggingface /root/.cache/huggingface HuggingFace cache
./data /workspace/CAF-Score/data Dataset files

Data Preparation

To run evaluations on the BRACE dataset, you need to place the audio files in the following directory structure:

data/
└── audio/
    ├── clotho/        # Place Clotho audio files here (.wav)
    └── audiocaps/     # Place AudioCaps audio files here (.wav)

Project Structure

CAF_Score/
├── run_caf.py                  # Single audio-caption CAF-Score computation
├── eval_caf.py                 # Direct CAF-Score evaluation on BRACE dataset
├── eval_clap.py                # CLAP model evaluation script
├── eval_lalm.py                # LALM (FLEUR) evaluation script
├── calc_caf.py                 # CAF-Score calculation from pre-computed results
├── src/
│   ├── clap.py                 # Unified CLAP model wrapper
│   ├── fleur/                  # FLEUR metric module
│   │   ├── __init__.py         # Unified API: load_model(), get_fleur()
│   │   ├── base.py             # Shared: FleurModel, prompt, score smoothing
│   │   ├── af3.py              # Audio-Flamingo-3 backend
│   │   └── qwen.py             # Qwen3-Omni / Qwen2.5-Omni backend (vLLM & torch)
│   └── models/                 # Model implementations (MGA-CLAP, M2D-CLAP)
├── configs/
│   └── mgaclap_config.yaml
├── data/
│   ├── audio/            # Audio files
│   │   ├── clotho/       # Clotho .wav files
│   │   └── audiocaps/    # AudioCaps .wav files
│   ├── meta/             # BRACE dataset metadata
│   └── results/          # Evaluation results
├── pretrained_models/    # Pre-trained model weights (not included)
├── environment_af3.yml   # Conda environment for Audio-Flamingo-3
└── environment_qwen3.yml # Conda environment for Qwen-Omni

Quick Start

Prerequisites

For Qwen-Omni models, if you have downloaded the models locally, set the paths via environment variables:

# Qwen3-Omni
export QWEN3_OMNI_MODEL_PATH="/path/to/Qwen3-Omni-30B-A3B-Instruct"
export QWEN3_OMNI_THINKING_MODEL_PATH="/path/to/Qwen3-Omni-30B-A3B-Thinking"

# Qwen2.5-Omni
export QWEN25_OMNI_3B_MODEL_PATH="/path/to/Qwen2.5-Omni-3B"
export QWEN25_OMNI_7B_MODEL_PATH="/path/to/Qwen2.5-Omni-7B"

Replace /path/to/ with the actual paths where you downloaded the models.

Single Audio-Caption CAF-Score

Important: Activate the appropriate environment before running:

  • Use conda activate caf_af3 for --lalm_model audioflamingo3
  • Use conda activate caf_qwen3 for --lalm_model qwen3omni, qwen25omni-3b, or qwen25omni-7b

Compute CAF-Score for a single audio file and caption:

# Basic usage
python run_caf.py --audio_path /path/to/audio.wav --caption "A dog barking loudly" \
    --clap_model m2dclap --lalm_model qwen3omni

# With sliding window for long audio
python run_caf.py --audio_path /path/to/long_audio.wav --caption "Music playing" \
    --clap_model laionclap --lalm_model audioflamingo3 --use_slide_window

Output example:

============================================================
CAF-Score Results
============================================================
Audio: /path/to/audio.wav
Caption: A dog barking loudly
------------------------------------------------------------
CLAP Model: m2dclap
LALM Model: qwen3omni
------------------------------------------------------------
CLAP Score: 0.3245
FLEUR Score: 0.7812
------------------------------------------------------------
CAF-Score: 0.4158
============================================================

Direct CAF-Score Evaluation on BRACE Dataset

Evaluate CAF-Score directly from audio files (computes both CLAP and FLEUR scores on-the-fly):

# Basic evaluation
python eval_caf.py --lalm_model audioflamingo3 --clap_model laionclap \
    --dataset audiocaps_main

# With Qwen3-Omni and sliding window
python eval_caf.py --lalm_model qwen3omni --clap_model msclap \
    --dataset clotho_main \
    --use_slide_window --pooling max \
    --tensor_parallel_size 2

# With thinking mode for LALM
python eval_caf.py --lalm_model qwen3omni --clap_model laionclap \
    --dataset audiocaps_hallu --use_think_mode

Usage

1. CLAP Evaluation

Evaluate audio-caption alignment using CLAP models:

# Using MS-CLAP
python eval_clap.py --clap_model msclap --dataset audiocaps_main

# Using LAION-CLAP
python eval_clap.py --clap_model laionclap --dataset clotho_main

# With sliding window for long audio
python eval_clap.py --clap_model mgaclap --dataset audiocaps_hallu \
    --use_slide_window --pooling max

Supported CLAP Models:

  • msclap: Microsoft CLAP
  • laionclap: LAION-CLAP (htsat-base)
  • mgaclap: MGA-CLAP (requires pre-trained weights)
  • m2dclap: M2D-CLAP (requires pre-trained weights)

2. LALM Evaluation (FLEUR)

Evaluate using Large Audio Language Models:

# Using Audio-Flamingo-3
python eval_lalm.py --lalm_model audioflamingo3 --dataset audiocaps_main

# Using Qwen3-Omni
python eval_lalm.py --lalm_model qwen3omni --dataset clotho_main \
    --tensor_parallel_size 2

# Using Qwen2.5-Omni (3B or 7B)
python eval_lalm.py --lalm_model qwen25omni-7b --dataset audiocaps_main \
    --tensor_parallel_size 2

# With thinking mode (Qwen3-Omni only)
python eval_lalm.py --lalm_model qwen3omni --dataset audiocaps_hallu \
    --use_think_mode

3. CAF-Score Calculation from Pre-computed Results

Calculate CAF-Score from pre-computed CLAP and LALM results:

python calc_caf.py --lalm_model audioflamingo3 --clap_model laionclap \
    --dataset audiocaps_main

CAF-Score Formula

CAF-Score combines CLAP similarity and FLEUR score using a weighted average:

CAF-Score = α × CLAP_Score + (1 - α) × FLEUR_Score

Where:

  • α (alpha): Weight parameter (default: 0.8)
  • CLAP_Score: Audio-text similarity from CLAP model
  • FLEUR_Score: Smoothed evaluation score from LALM

Pre-trained Models

CLAP Models

Model Download Notes
MS-CLAP Automatic (via msclap package) Version 2023
LAION-CLAP Automatic (via HuggingFace) Multiple variants available
MGA-CLAP Download Place in pretrained_models/mga-clap.pt
M2D-CLAP Download Place in pretrained_models/m2d_clap_*/

LALM Models

Model Access
Audio-Flamingo-3 HuggingFace
Qwen3-Omni-Instruct HuggingFace
Qwen3-Omni-Thinking HuggingFace
Qwen2.5-Omni-3B HuggingFace
Qwen2.5-Omni-7B HuggingFace

BRACE Dataset

The BRACE (Benchmark for Rating Audio Caption Evaluation) dataset provides standardized evaluation for audio captioning metrics. Download the dataset from the official repository.

Audio File Setup

For evaluation, place your audio files according to the following paths:

  • Clotho: data/audio/clotho/
  • AudioCaps: data/audio/audiocaps/

Supported subsets:

  • audiocaps_main: AudioCaps main evaluation set
  • audiocaps_hallu: AudioCaps hallucination set
  • clotho_main: Clotho main evaluation set
  • clotho_hallu: Clotho hallucination set

Configuration

GPU Configuration

Set CUDA devices before running:

export CUDA_VISIBLE_DEVICES=0,1  # Use GPU 0 and 1
python eval_lalm.py --lalm_model qwen3omni --tensor_parallel_size 2

Python API

You can also use CAF-Score programmatically:

from run_caf import compute_caf_score

# Compute CAF-Score for a single audio-caption pair
result = compute_caf_score(
    audio_path="/path/to/audio.wav",
    caption="A dog barking loudly",
    clap_model_name="laionclap",
    lalm_model_name="audioflamingo3",
)

print(f"CLAP Score: {result['clap_score']:.4f}")
print(f"FLEUR Score: {result['fleur_score']:.4f}")
print(f"CAF-Score: {result['caf_score']:.4f}")

Citation

If you use CAF-Score in your research, please cite:

License

This project is licensed under the MIT License - see the LICENSE file for details.

Acknowledgments

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

0