CAF-Score is a comprehensive reference-free audio-caption alignment evaluation metric that combines CLAP (Contrastive Language-Audio Pretraining) similarity scores with FLEUR scores from Large Audio Language Models (LALMs).
This repository provides:
- CLAP Evaluation: Unified interface for multiple CLAP models (MS-CLAP, LAION-CLAP, MGA-CLAP, M2D-CLAP)
- LALM Evaluation: FLEUR metric implementation for Audio-Flamingo-3, Qwen3-Omni, and Qwen2.5-Omni
- CAF-Score Computation: Combined metric for robust audio-caption alignment assessment
- BRACE Benchmark Evaluation: Evaluation scripts for the BRACE dataset
Due to dependency conflicts between Audio-Flamingo-3 and Qwen-Omni models, two separate conda environments are required.
# Create environment for Audio-Flamingo-3
conda env create -f environment_af3.yml
# Activate the environment
conda activate caf_af3Use this environment when running evaluations with --lalm_model audioflamingo3.
# Create environment for Qwen-Omni
conda env create -f environment_qwen3.yml
# Activate the environment
conda activate caf_qwen3
# Install vllm from Qwen3-Omni compatible branch (required)
pip install git+https://github.com/wangxiongts/vllm.git
# Install flash-attention (recommended for better performance)
pip install flash-attn --no-build-isolationUse this environment when running evaluations with --lalm_model qwen3omni, qwen25omni-3b, or qwen25omni-7b.
| Environment | LALM Model | Python | Key Dependencies |
|---|---|---|---|
caf_af3 |
Audio-Flamingo-3 | 3.10 | transformers 5.x, torch 2.5 |
caf_qwen3 |
Qwen3-Omni, Qwen2.5-Omni | 3.10 | transformers 4.57, torch 2.7, vllm |
As an alternative to conda environments, you can use Docker. Two separate Docker setups are provided under docker/.
Prerequisites: Docker and NVIDIA Container Toolkit must be installed.
cd docker/af3
# Build and start the container
docker compose up -d
# Enter the container
docker exec -it caf-af3 bashcd docker/qwen3
# Build and start the container
docker compose up -d
# Enter the container
docker exec -it caf-score bashBoth Docker setups automatically mount the following directories from the host:
| Host Path | Container Path | Description |
|---|---|---|
. (project root) |
/workspace/CAF-Score |
Source code |
./pretrained_models |
/workspace/CAF-Score/pretrained_models |
Model weights |
~/.cache/huggingface |
/root/.cache/huggingface |
HuggingFace cache |
./data |
/workspace/CAF-Score/data |
Dataset files |
To run evaluations on the BRACE dataset, you need to place the audio files in the following directory structure:
data/
└── audio/
├── clotho/ # Place Clotho audio files here (.wav)
└── audiocaps/ # Place AudioCaps audio files here (.wav)CAF_Score/
├── run_caf.py # Single audio-caption CAF-Score computation
├── eval_caf.py # Direct CAF-Score evaluation on BRACE dataset
├── eval_clap.py # CLAP model evaluation script
├── eval_lalm.py # LALM (FLEUR) evaluation script
├── calc_caf.py # CAF-Score calculation from pre-computed results
├── src/
│ ├── clap.py # Unified CLAP model wrapper
│ ├── fleur/ # FLEUR metric module
│ │ ├── __init__.py # Unified API: load_model(), get_fleur()
│ │ ├── base.py # Shared: FleurModel, prompt, score smoothing
│ │ ├── af3.py # Audio-Flamingo-3 backend
│ │ └── qwen.py # Qwen3-Omni / Qwen2.5-Omni backend (vLLM & torch)
│ └── models/ # Model implementations (MGA-CLAP, M2D-CLAP)
├── configs/
│ └── mgaclap_config.yaml
├── data/
│ ├── audio/ # Audio files
│ │ ├── clotho/ # Clotho .wav files
│ │ └── audiocaps/ # AudioCaps .wav files
│ ├── meta/ # BRACE dataset metadata
│ └── results/ # Evaluation results
├── pretrained_models/ # Pre-trained model weights (not included)
├── environment_af3.yml # Conda environment for Audio-Flamingo-3
└── environment_qwen3.yml # Conda environment for Qwen-OmniFor Qwen-Omni models, if you have downloaded the models locally, set the paths via environment variables:
# Qwen3-Omni
export QWEN3_OMNI_MODEL_PATH="/path/to/Qwen3-Omni-30B-A3B-Instruct"
export QWEN3_OMNI_THINKING_MODEL_PATH="/path/to/Qwen3-Omni-30B-A3B-Thinking"
# Qwen2.5-Omni
export QWEN25_OMNI_3B_MODEL_PATH="/path/to/Qwen2.5-Omni-3B"
export QWEN25_OMNI_7B_MODEL_PATH="/path/to/Qwen2.5-Omni-7B"Replace /path/to/ with the actual paths where you downloaded the models.
Important: Activate the appropriate environment before running:
- Use
conda activate caf_af3for--lalm_model audioflamingo3 - Use
conda activate caf_qwen3for--lalm_model qwen3omni,qwen25omni-3b, orqwen25omni-7b
Compute CAF-Score for a single audio file and caption:
# Basic usage
python run_caf.py --audio_path /path/to/audio.wav --caption "A dog barking loudly" \
--clap_model m2dclap --lalm_model qwen3omni
# With sliding window for long audio
python run_caf.py --audio_path /path/to/long_audio.wav --caption "Music playing" \
--clap_model laionclap --lalm_model audioflamingo3 --use_slide_windowOutput example:
============================================================
CAF-Score Results
============================================================
Audio: /path/to/audio.wav
Caption: A dog barking loudly
------------------------------------------------------------
CLAP Model: m2dclap
LALM Model: qwen3omni
------------------------------------------------------------
CLAP Score: 0.3245
FLEUR Score: 0.7812
------------------------------------------------------------
CAF-Score: 0.4158
============================================================
Evaluate CAF-Score directly from audio files (computes both CLAP and FLEUR scores on-the-fly):
# Basic evaluation
python eval_caf.py --lalm_model audioflamingo3 --clap_model laionclap \
--dataset audiocaps_main
# With Qwen3-Omni and sliding window
python eval_caf.py --lalm_model qwen3omni --clap_model msclap \
--dataset clotho_main \
--use_slide_window --pooling max \
--tensor_parallel_size 2
# With thinking mode for LALM
python eval_caf.py --lalm_model qwen3omni --clap_model laionclap \
--dataset audiocaps_hallu --use_think_modeEvaluate audio-caption alignment using CLAP models:
# Using MS-CLAP
python eval_clap.py --clap_model msclap --dataset audiocaps_main
# Using LAION-CLAP
python eval_clap.py --clap_model laionclap --dataset clotho_main
# With sliding window for long audio
python eval_clap.py --clap_model mgaclap --dataset audiocaps_hallu \
--use_slide_window --pooling maxSupported CLAP Models:
msclap: Microsoft CLAPlaionclap: LAION-CLAP (htsat-base)mgaclap: MGA-CLAP (requires pre-trained weights)m2dclap: M2D-CLAP (requires pre-trained weights)
Evaluate using Large Audio Language Models:
# Using Audio-Flamingo-3
python eval_lalm.py --lalm_model audioflamingo3 --dataset audiocaps_main
# Using Qwen3-Omni
python eval_lalm.py --lalm_model qwen3omni --dataset clotho_main \
--tensor_parallel_size 2
# Using Qwen2.5-Omni (3B or 7B)
python eval_lalm.py --lalm_model qwen25omni-7b --dataset audiocaps_main \
--tensor_parallel_size 2
# With thinking mode (Qwen3-Omni only)
python eval_lalm.py --lalm_model qwen3omni --dataset audiocaps_hallu \
--use_think_modeCalculate CAF-Score from pre-computed CLAP and LALM results:
python calc_caf.py --lalm_model audioflamingo3 --clap_model laionclap \
--dataset audiocaps_mainCAF-Score combines CLAP similarity and FLEUR score using a weighted average:
CAF-Score = α × CLAP_Score + (1 - α) × FLEUR_Score
Where:
α(alpha): Weight parameter (default: 0.8)CLAP_Score: Audio-text similarity from CLAP modelFLEUR_Score: Smoothed evaluation score from LALM
| Model | Download | Notes |
|---|---|---|
| MS-CLAP | Automatic (via msclap package) | Version 2023 |
| LAION-CLAP | Automatic (via HuggingFace) | Multiple variants available |
| MGA-CLAP | Download | Place in pretrained_models/mga-clap.pt |
| M2D-CLAP | Download | Place in pretrained_models/m2d_clap_*/ |
| Model | Access |
|---|---|
| Audio-Flamingo-3 | HuggingFace |
| Qwen3-Omni-Instruct | HuggingFace |
| Qwen3-Omni-Thinking | HuggingFace |
| Qwen2.5-Omni-3B | HuggingFace |
| Qwen2.5-Omni-7B | HuggingFace |
The BRACE (Benchmark for Rating Audio Caption Evaluation) dataset provides standardized evaluation for audio captioning metrics. Download the dataset from the official repository.
For evaluation, place your audio files according to the following paths:
- Clotho:
data/audio/clotho/ - AudioCaps:
data/audio/audiocaps/
Supported subsets:
audiocaps_main: AudioCaps main evaluation setaudiocaps_hallu: AudioCaps hallucination setclotho_main: Clotho main evaluation setclotho_hallu: Clotho hallucination set
Set CUDA devices before running:
export CUDA_VISIBLE_DEVICES=0,1 # Use GPU 0 and 1
python eval_lalm.py --lalm_model qwen3omni --tensor_parallel_size 2You can also use CAF-Score programmatically:
from run_caf import compute_caf_score
# Compute CAF-Score for a single audio-caption pair
result = compute_caf_score(
audio_path="/path/to/audio.wav",
caption="A dog barking loudly",
clap_model_name="laionclap",
lalm_model_name="audioflamingo3",
)
print(f"CLAP Score: {result['clap_score']:.4f}")
print(f"FLEUR Score: {result['fleur_score']:.4f}")
print(f"CAF-Score: {result['caf_score']:.4f}")If you use CAF-Score in your research, please cite:
This project is licensed under the MIT License - see the LICENSE file for details.
- FLEUR - Reference-free evaluation metric
- MS-CLAP - Microsoft-CLAP implementation
- LAION-CLAP - LAION-CLAP implementation
- MGA-CLAP - MGA-CLAP implementation
- M2D-CLAP - M2D-CLAP implementation
- Audio-Flamingo-3 - NVIDIA Audio-Flamingo3 model
- Qwen3-Omni - Qwen3 Omni model
- Qwen2.5-Omni - Qwen2.5 Omni model