8000
Skip to content

xora-17/prosody-control-vits

 
 

Repository files navigation

Prosody Control VITS: Audio Generation Samples

This repository showcases the performance of our Prosody-Controlled VITS model in comparison to the baseline SOTA VITS model. Below are various evaluation scenarios with audio samples.


🔈 Reference Audios (Original)

These audios serve as reference samples used to guide the prosody of the generated outputs.

Reference Audio Link
Reference 1 1.wav
Reference 2 2.wav
Reference 3 3.wav
Reference 4 4.wav

🔁 Comparison of Generated Audios

▶️ It's an emergency, Go! Go! Go!

Description Link
Reference audio 1 1.wav
Reference audio 2 2.wav
Reference audio 3 3.wav
Reference audio 4 4.wav
VITS vits.wav

▶️ We are done here, let's go

Description Link
Reference audio 1 1.wav
Reference audio 2 2.wav
Reference audio 3 3.wav
Reference audio 4 4.wav
VITS vits.wav

🧪 Comparison Sample With Original Audio

▶️ I have to be careful of them, as they tear very easily

Description Link
Reference audio 1 1.wav
Reference audio 2 2.wav
Reference audio 3 3.wav
Reference audio 4 4.wav
VITS vits.wav
Original original.wav

📈 Comparison Over Training Iterations

▶️ It's bed time, let's go to sleep

Iteration Link
1000 1000.wav
3000 3000.wav
6000 6000.wav
9000 9000.wav
12000 12000.wav
15000 15000.wav
VITS vits.wav

Each sample demonstrates how emotional prosody and speaker reference can enhance the expressiveness of synthesized speech using our model.

About

VITS: Conditional Variational Autoencoder with Adversarial Learning for End-to-End Text-to-Speech

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages

  • Python 63.9%
  • Jupyter Notebook 35.5%
  • Cython 0.6%
0