Left image Right image

FireRedTTS: A Foundation Text-To-Speech Framework for Industry-Level Generative Speech Applications

[Paper]


FireRed Team

Abstract.This work proposes FireRedTTS, a foundation text-to-speech framework, to meet the growing demands for personalized and diverse generative speech applications. The framework comprises three parts: data processing, foundation system, and downstream applications. First, we comprehensively present our data processing pipeline, which transforms massive raw audio into a large-scale high-quality TTS dataset with rich annotations and a wide coverage of content, speaking style, and timbre. Then, we propose a language-model-based foundation TTS system. The speech signal is compressed into discrete semantic tokens via a semantic-aware speech tokenizer, and can be generated by a language model from the prompt text and audio. Then, a two-stage waveform generator is proposed to decode them to the high-fidelity waveform. We present two applications of this system: voice cloning for dubbing and human-like speech generation for chatbots. The experimental results demonstrate the solid in-context learning capability of FireRedTTS, which can stably synthesize high-quality speech consistent with the prompt text and audio. For dubbing, FireRedTTS can clone target voices in a zero-shot way for the UGC scenario and adapt to studio-level expressive voice characters in the PUGC scenario via few-shot fine-tuning with 1-hour recording. Moreover, FireRedTTS achieves controllable human-like speech generation in a casual style with paralinguistic behaviors and emotions via instruction tuning, to better serve spoken chatbots.

Contents

Demo

System Overview

Figure 1. An overview of the FireRedTTS foundation system with a text-to-speech language model that maps text tokens to semantic tokens, and a two-stage token-to-waveform generator. The semantic tokens are extracted using a semantic-aware speech tokenizer. In Mel decoder, (c) illustrates a flow-matching-based approach that transforms the noise into a spectrogram using semantic tokens as conditions, and (d) depicts a streamable decoder based on a multi-stream language model and a Mel Codec.

Human-Like Speech Generation


Emotions


Emotion Text Generated
Angry 我真的无法理解你为什么总是这么自私,完全不顾及别人的感受!
Happy 真是难以置信,这么好的事情竟然发生在我身上!
Sad 你离开后,生活仿佛失去了颜色,我再也找不到曾经的那份快乐了。

Paralingustics


Behavior Text Generated
Word-Level Repetition 因为[word_rep]因为[word_rep]因为这样一个对待对我们来说重要的动物它应该有一些选。
Char-Level Repetition 哎我然后我[char_rep]我然后我就跟他说我说这个字你不认识吗。
Elongation 后来回来的时候,我自己[elong]坐着飞机回来的。
Hissing 当然需求的话[oralsii]肯定是有这一部分的。
Dental click 当然需求的话肯定[tsk]是有这一部分的。
Breath 当然需求的话肯定[breath]是有这一部分的。
Laugh 当然需求的话肯定[laugh]是有这一部分的。
Speak with a laugh 这^部^的^话^,我觉得相对于电影来说,其实会和我一样吧,都会选择电影吧。
Emphasis 谁@知@道@啊,它里面才出现那么几个镜头,而且我就感觉他是客@串@的。
Filled pause 嗯(filled pause)没有这回事。
Confirmation 嗯(confirmation)没有这回事。
Realization 哦(realization)[prolong],你在那儿呢。
Surprise 哦(surprise),你在那儿呢。

Voice Cloning


Zero-Shot In-Context Learning


Chinese Text English Prompt Generation
明确线上课程教师讲授直播录播时间一般不超过二十分钟。
恰好在这个时候出现的演员限酬令和制作限价令,也绝不是偶然的了。
English Text Chinese Prompt Generation
This characterisation of the greater continence in the use of stimulants practised by the women of the reputable classes may seem an excessive refinement of logic at the expense of common sense.
Before any could stop him he butted his Majesty so furiously that the King soared far into the air and tumbled in a heap among the benches, where he lay moaning and groaning.
Code-Switch Text Chinese Prompt Generation
你昨天的performance真是outstanding,完全展示了你的skills。
我觉得我们需要一个更clear的strategy来实现我们的goals。
Code-Switch Text English Prompt Generation
这次旅行的schedule有点tight,我们需要plan的更efficient一些。
他今天的mood看起来不太好,可能需要一些space。

Few-Shot Speaker Adaptation


Text Prompt Audio Zero-Shot Generated Few-Shot Generated
Speaker 1 (2min) 伏旭舅舅给小朋友讲的故事,是由,海豚绘本花园出品的,绘本故事。
为了避开朋友,董海棠让王女士先出去遛狗。
Speaker 2 (2min) 负责相关区域的快递员已按深圳市统一要求做了核酸检测。
明确线上课程教师讲授直播录播时间一般不超过二十分钟。
Wu Kong
(1 hour)
本人代写小学生,中学生暑假作业,价格免议包邮哦。
可以理解为,调整的实质是放松了单笔申报最大数量的限制。
你们太没信誉了,这都五天了,还不发货,我要投诉你。

Ablation Studies

Prompt Processing for Robust Voice Cloning


SNR Text Prompt Generation
0 可以理解为,调整的实质是放松了单笔申报最大数量的限制。 Raw
Processed
10 您看,您这个行业的很多客户,都通过线上推广实现销量增长了。 Raw
Processed
20 首尔中央地方法院十四日就李明基涉嫌虐待员工案作出一审宣判。 Raw
Processed

Non-Streamable and Streamable Decoder


Text Ground-Truth Flow-Matching Decoder Streamable Decoder
不少观众啊是大惑不解呀,怎么我军,只能用迫击炮呢,大炮哪去了,其实啊,长津湖战场上的志愿军们呢。
有人报告了蒋介石,老蒋得知汤恩伯想逃到日本下令阻止。
到了他这个级别,花哨的各种招式已经不管用了,这是道的碰撞。