CAF-Score: Calibrating CLAP with LALMs for Reference-Free Audio Caption Evaluation

CAF-Score is a comprehensive reference-free audio-caption alignment evaluation metric that combines CLAP (Contrastive Language-Audio Pretraining) similarity scores with FLEUR scores from Large Audio Language Models (LALMs).

Overview

This repository provides:

CLAP Evaluation: Unified interface for multiple CLAP models (MS-CLAP, LAION-CLAP, MGA-CLAP, M2D-CLAP)
LALM Evaluation: FLEUR metric implementation for Audio-Flamingo-3, Qwen3-Omni, and Qwen2.5-Omni
CAF-Score Computation: Combined metric for robust audio-caption alignment assessment
BRACE Benchmark Evaluation: Evaluation scripts for the BRACE dataset

Installation

Due to dependency conflicts between Audio-Flamingo-3 and Qwen-Omni models, two separate conda environments are required.

Environment 1: Audio-Flamingo-3

# Create environment for Audio-Flamingo-3
conda env create -f environment_af3.yml

# Activate the environment
conda activate caf_af3

Use this environment when running evaluations with --lalm_model audioflamingo3.

Environment 2: Qwen-Omni (Qwen3-Omni / Qwen2.5-Omni)

# Create environment for Qwen-Omni
conda env create -f environment_qwen3.yml

# Activate the environment
conda activate caf_qwen3

# Install vllm from Qwen3-Omni compatible branch (required)
pip install git+https://github.com/wangxiongts/vllm.git

# Install flash-attention (recommended for better performance)
pip install flash-attn --no-build-isolation

Use this environment when running evaluations with --lalm_model qwen3omni, qwen25omni-3b, or qwen25omni-7b.

Environment Summary

Environment	LALM Model	Python	Key Dependencies
`caf_af3`	Audio-Flamingo-3	3.10	transformers 5.x, torch 2.5
`caf_qwen3`	Qwen3-Omni, Qwen2.5-Omni	3.10	transformers 4.57, torch 2.7, vllm

Alternative: Docker

As an alternative to conda environments, you can use Docker. Two separate Docker setups are provided under docker/.

Prerequisites: Docker and NVIDIA Container Toolkit must be installed.

Audio-Flamingo-3

cd docker/af3

# Build and start the container
docker compose up -d

# Enter the container
docker exec -it caf-af3 bash

Qwen3-Omni

cd docker/qwen3

# Build and start the container
docker compose up -d

# Enter the container
docker exec -it caf-score bash

Volume Mounts

Both Docker setups automatically mount the following directories from the host:

Host Path	Container Path	Description
`.` (project root)	`/workspace/CAF-Score`	Source code
`./pretrained_models`	`/workspace/CAF-Score/pretrained_models`	Model weights
`~/.cache/huggingface`	`/root/.cache/huggingface`	HuggingFace cache
`./data`	`/workspace/CAF-Score/data`	Dataset files

Data Preparation

To run evaluations on the BRACE dataset, you need to place the audio files in the following directory structure:

data/
└── audio/
    ├── clotho/        # Place Clotho audio files here (.wav)
    └── audiocaps/     # Place AudioCaps audio files here (.wav)

Project Structure

CAF_Score/
├── run_caf.py                  # Single audio-caption CAF-Score computation
├── eval_caf.py                 # Direct CAF-Score evaluation on BRACE dataset
├── eval_clap.py                # CLAP model evaluation script
├── eval_lalm.py                # LALM (FLEUR) evaluation script
├── calc_caf.py                 # CAF-Score calculation from pre-computed results
├── src/
│   ├── clap.py                 # Unified CLAP model wrapper
│   ├── fleur/                  # FLEUR metric module
│   │   ├── __init__.py         # Unified API: load_model(), get_fleur()
│   │   ├── base.py             # Shared: FleurModel, prompt, score smoothing
│   │   ├── af3.py              # Audio-Flamingo-3 backend
│   │   └── qwen.py             # Qwen3-Omni / Qwen2.5-Omni backend (vLLM & torch)
│   └── models/                 # Model implementations (MGA-CLAP, M2D-CLAP)
├── configs/
│   └── mgaclap_config.yaml
├── data/
│   ├── audio/            # Audio files
│   │   ├── clotho/       # Clotho .wav files
│   │   └── audiocaps/    # AudioCaps .wav files
│   ├── meta/             # BRACE dataset metadata
│   └── results/          # Evaluation results
├── pretrained_models/    # Pre-trained model weights (not included)
├── environment_af3.yml   # Conda environment for Audio-Flamingo-3
└── environment_qwen3.yml # Conda environment for Qwen-Omni

Quick Start

Prerequisites

For Qwen-Omni models, if you have downloaded the models locally, set the paths via environment variables:

# Qwen3-Omni
export QWEN3_OMNI_MODEL_PATH="/path/to/Qwen3-Omni-30B-A3B-Instruct"
export QWEN3_OMNI_THINKING_MODEL_PATH="/path/to/Qwen3-Omni-30B-A3B-Thinking"

# Qwen2.5-Omni
export QWEN25_OMNI_3B_MODEL_PATH="/path/to/Qwen2.5-Omni-3B"
export QWEN25_OMNI_7B_MODEL_PATH="/path/to/Qwen2.5-Omni-7B"

Replace /path/to/ with the actual paths where you downloaded the models.

Single Audio-Caption CAF-Score

Important: Activate the appropriate environment before running:

Use conda activate caf_af3 for --lalm_model audioflamingo3
Use conda activate caf_qwen3 for --lalm_model qwen3omni, qwen25omni-3b, or qwen25omni-7b

Compute CAF-Score for a single audio file and caption:

# Basic usage
python run_caf.py --audio_path /path/to/audio.wav --caption "A dog barking loudly" \
    --clap_model m2dclap --lalm_model qwen3omni

# With sliding window for long audio
python run_caf.py --audio_path /path/to/long_audio.wav --caption "Music playing" \
    --clap_model laionclap --lalm_model audioflamingo3 --use_slide_window

Output example:

============================================================
CAF-Score Results
============================================================
Audio: /path/to/audio.wav
Caption: A dog barking loudly
------------------------------------------------------------
CLAP Model: m2dclap
LALM Model: qwen3omni
------------------------------------------------------------
CLAP Score: 0.3245
FLEUR Score: 0.7812
------------------------------------------------------------
CAF-Score: 0.4158
============================================================

Direct CAF-Score Evaluation on BRACE Dataset

Evaluate CAF-Score directly from audio files (computes both CLAP and FLEUR scores on-the-fly):

# Basic evaluation
python eval_caf.py --lalm_model audioflamingo3 --clap_model laionclap \
    --dataset audiocaps_main

# With Qwen3-Omni and sliding window
python eval_caf.py --lalm_model qwen3omni --clap_model msclap \
    --dataset clotho_main \
    --use_slide_window --pooling max \
    --tensor_parallel_size 2

# With thinking mode for LALM
python eval_caf.py --lalm_model qwen3omni --clap_model laionclap \
    --dataset audiocaps_hallu --use_think_mode

Usage

1. CLAP Evaluation

Evaluate audio-caption alignment using CLAP models:

# Using MS-CLAP
python eval_clap.py --clap_model msclap --dataset audiocaps_main

# Using LAION-CLAP
python eval_clap.py --clap_model laionclap --dataset clotho_main

# With sliding window for long audio
python eval_clap.py --clap_model mgaclap --dataset audiocaps_hallu \
    --use_slide_window --pooling max

Supported CLAP Models:

msclap: Microsoft CLAP
laionclap: LAION-CLAP (htsat-base)
mgaclap: MGA-CLAP (requires pre-trained weights)
m2dclap: M2D-CLAP (requires pre-trained weights)

2. LALM Evaluation (FLEUR)

Evaluate using Large Audio Language Models:

# Using Audio-Flamingo-3
python eval_lalm.py --lalm_model audioflamingo3 --dataset audiocaps_main

# Using Qwen3-Omni
python eval_lalm.py --lalm_model qwen3omni --dataset clotho_main \
    --tensor_parallel_size 2

# Using Qwen2.5-Omni (3B or 7B)
python eval_lalm.py --lalm_model qwen25omni-7b --dataset audiocaps_main \
    --tensor_parallel_size 2

# With thinking mode (Qwen3-Omni only)
python eval_lalm.py --lalm_model qwen3omni --dataset audiocaps_hallu \
    --use_think_mode

3. CAF-Score Calculation from Pre-computed Results

Calculate CAF-Score from pre-computed CLAP and LALM results:

python calc_caf.py --lalm_model audioflamingo3 --clap_model laionclap \
    --dataset audiocaps_main

CAF-Score Formula

CAF-Score combines CLAP similarity and FLEUR score using a weighted average:

CAF-Score = α × CLAP_Score + (1 - α) × FLEUR_Score

Where:

α (alpha): Weight parameter (default: 0.8)
CLAP_Score: Audio-text similarity from CLAP model
FLEUR_Score: Smoothed evaluation score from LALM

Pre-trained Models

CLAP Models

Model	Download	Notes
MS-CLAP	Automatic (via msclap package)	Version 2023
LAION-CLAP	Automatic (via HuggingFace)	Multiple variants available
MGA-CLAP	Download	Place in `pretrained_models/mga-clap.pt`
M2D-CLAP	Download	Place in `pretrained_models/m2d_clap_*/`

LALM Models

Model	Access
Audio-Flamingo-3	HuggingFace
Qwen3-Omni-Instruct	HuggingFace
Qwen3-Omni-Thinking	HuggingFace
Qwen2.5-Omni-3B	HuggingFace
Qwen2.5-Omni-7B	HuggingFace

BRACE Dataset

The BRACE (Benchmark for Rating Audio Caption Evaluation) dataset provides standardized evaluation for audio captioning metrics. Download the dataset from the official repository.

Audio File Setup

For evaluation, place your audio files according to the following paths:

Clotho: data/audio/clotho/
AudioCaps: data/audio/audiocaps/

Supported subsets:

audiocaps_main: AudioCaps main evaluation set
audiocaps_hallu: AudioCaps hallucination set
clotho_main: Clotho main evaluation set
clotho_hallu: Clotho hallucination set

Configuration

GPU Configuration

Set CUDA devices before running:

export CUDA_VISIBLE_DEVICES=0,1  # Use GPU 0 and 1
python eval_lalm.py --lalm_model qwen3omni --tensor_parallel_size 2

Python API

You can also use CAF-Score programmatically:

from run_caf import compute_caf_score

# Compute CAF-Score for a single audio-caption pair
result = compute_caf_score(
    audio_path="/path/to/audio.wav",
    caption="A dog barking loudly",
    clap_model_name="laionclap",
    lalm_model_name="audioflamingo3",
)

print(f"CLAP Score: {result['clap_score']:.4f}")
print(f"FLEUR Score: {result['fleur_score']:.4f}")
print(f"CAF-Score: {result['caf_score']:.4f}")

Citation

If you use CAF-Score in your research, please cite:

License

This project is licensed under the MIT License - see the LICENSE file for details.

Acknowledgments

FLEUR - Reference-free evaluation metric
MS-CLAP - Microsoft-CLAP implementation
LAION-CLAP - LAION-CLAP implementation
MGA-CLAP - MGA-CLAP implementation
M2D-CLAP - M2D-CLAP implementation
Audio-Flamingo-3 - NVIDIA Audio-Flamingo3 model
Qwen3-Omni - Qwen3 Omni model
Qwen2.5-Omni - Qwen2.5 Omni model

Name		Name	Last commit message	Last commit date
Latest commit History 24 Commits
configs		configs
data		data
docker		docker
src		src
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
calc_caf.py		calc_caf.py
environment_af3.yml		environment_af3.yml
environment_qwen3.yml		environment_qwen3.yml
eval_caf.py		eval_caf.py
eval_clap.py		eval_clap.py
eval_lalm.py		eval_lalm.py
run_caf.py		run_caf.py

Folders and files

Latest commit

History

Repository files navigation

CAF-Score: Calibrating CLAP with LALMs for Reference-Free Audio Caption Evaluation

Overview

Installation

Environment 1: Audio-Flamingo-3

Environment 2: Qwen-Omni (Qwen3-Omni / Qwen2.5-Omni)

Environment Summary

Alternative: Docker

Audio-Flamingo-3

Qwen3-Omni

Volume Mounts

Data Preparation

Project Structure

Quick Start

Prerequisites

Single Audio-Caption CAF-Score

Direct CAF-Score Evaluation on BRACE Dataset

Usage

1. CLAP Evaluation

2. LALM Evaluation (FLEUR)

3. CAF-Score Calculation from Pre-computed Results

CAF-Score Formula

Pre-trained Models

CLAP Models

LALM Models

BRACE Dataset

Audio File Setup

Configuration

GPU Configuration

Python API

Citation

License

Acknowledgments

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages