VALL-E

Neural Codec Language Models are Zero-Shot Text to Speech Synthesizers

[Paper]


Chengyi Wang*,   Sanyuan Chen*,   Yu Wu*,   Ziqiang Zhang,   Long Zhou,   Shujie Liu,  
Zhuo Chen,   Yanqing Liu,   Huaming Wang,   Jinyu Li,   Lei He,   Sheng Zhao,   Furu Wei

Microsoft

Abstract. We introduce a language modeling approach for text to speech synthesis (TTS). Specifically, we train a neural codec language model (called VALL-E) using discrete codes derived from an off-the-shelf neural audio codec model, and regard TTS as a conditional language modeling task rather than continuous signal regression as in previous work. During the pre-training stage, we scale up the TTS training data to 60K hours of English speech which is hundreds of times larger than existing systems. VALL-E emerges in-context learning capabilities and can be used to synthesize high-quality personalized speech with only a 3-second enrolled recording of an unseen speaker as an acoustic prompt. Experiment results show that VALL-E significantly outperforms the state-of-the-art zero-shot TTS system in terms of speech naturalness and speaker similarity. In addition, we find VALL-E could preserve the speaker's emotion and acoustic environment of the acoustic prompt in synthesis.

This page is for research demonstration purposes only.

Model Overview

The overview of VALL-E. Unlike the previous pipeline (e.g., phoneme → mel-spectrogram → waveform), the pipeline of VALL-E is phoneme → discrete code → waveform. VALL-E generates the discrete audio codec codes based on phoneme and acoustic code prompts, corresponding to the target content and the speaker's voice. VALL-E directly enables various speech synthesis applications, such as zero-shot TTS, speech editing, and content creation combined with other generative AI models like GPT-3.

LibriSpeech Samples

Text Speaker Prompt Ground Truth Baseline VALL-E
They moved thereafter cautiously about the hut groping before and about them to find something to show that Warrenton had fulfilled his mission.
And lay me down in thy cold bed and leave my shining lot.
Number ten, fresh nelly is waiting on you, good night husband.
Yea, his honourable worship is within, but he hath a godly minister or two with him, and likewise a leech.
Instead of shoes, the old man wore boots with turnover tops, and his blue coat had wide cuffs of gold braid.
The army found the people in poverty and left them in comparative wealth.
Thus did this humane and right minded father comfort his unhappy daughter, and her mother embracing her again, did all she could to soothe her feelings.
He was in deep converse with the clerk and entered the hall holding him by the arm.

VCTK Samples

Text Speaker Prompt Ground Truth Baseline VALL-E
We have to reduce the number of plastic bags.
So what is the campaign about?
My life has changed a lot.
Nothing is yet confirmed.
I could hardly move for the next couple of days.
His son has been travelling with the Tartan Army for years.
Her husband was very concerned that it might be fatal.
We've made a couple of albums.

Synthesis of Diversity

Thanks to the sampling-based discrete token generation methods, given a pair of text and speaker prompts, VALL-E can synthesize diverse personalized speech samples with different random seeds.

Text Speaker Prompt VALL-E Sample1 VALL-E Sample2
Because we do not need it.
I must do something about it.
He has not been named.
Number ten, fresh nelly is waiting on you, good night husband.

Acoustic Environment Maintenance

VALL-E can synthesize personalized speech while maintaining the acoustic environment of the speaker prompt. The audio and transcriptions are sampled from the Fisher dataset.

Text Speaker Prompt Ground Truth VALL-E
I think it's like you know um more convenient too.
Um we have to pay have this security fee just in case she would damage something but um.
Everything is run by computer but you got to know how to think before you can do a computer.
As friends thing I definitely I've got more male friends.

Speaker’s Emotion Maintenance

VALL-E can synthesize personalized speech while maintaining the emotion in the speaker prompt. The audio prompts are sampled from the Emotional Voices Database.

Text Emotion Speaker Prompt VALL-E
We have to reduce the number of plastic bags. Anger
Sleepy
Neutral
Amused
Disgusted

More Samples

We randomly selected some transcriptions and 3s audio segments from LibriSpeech test-clean set as the text and speaker prompts and then use VALL-E to synthesize the personalized speech. Note that the transcriptions and audio segments are from different speakers, there is no ground truth speech for reference.

Text Speaker Prompt VALL-E
The others resented postponement, but it was just his scruples that charmed me.
Notwithstanding the high resolution of hawkeye, he fully comprehended all the difficulties and danger he was about to incur.
We were more interested in the technical condition of the station than in the commercial part.
Paul takes pride in his ministry not to his own praise but to the praise of god.
The ideas also remain but they have become types in nature forms of men animals birds fishes.
Other circumstances permitting that instinct disposes men to look with favor upon productive efficiency and on whatever is of human use.
But suppose you said I'm fond of writing, my people always say my letters home are good enough for punch.
He summoned half a dozen citizens to join his posse who followed obeyed and assisted him.

Ethics Statement

Since VALL-E could synthesize speech that maintains speaker identity, it may carry potential risks in misuse of the model, such as spoofing voice identification or impersonating a specific speaker. We conducted the experiments under the assumption that the user agree to be the target speaker in speech synthesis. If the model is generalized to unseen speakers in the real world, it should include a protocol to ensure that the speaker approves the use of their voice and a synthesized speech detection model.