Left image Right image

FireRedTTS: A Foundation Text-To-Speech Framework for Industry-Level Generative Speech Applications


FireRed Team

Abstract.This work proposes FireRedTTS, a foundation text-to-speech framework, to meet the growing demands for personalized and diverse generative speech applications. The framework comprises three parts: data processing, foundation system, and downstream applications. First, we comprehensively present our data processing pipeline, which transforms massive raw audio into a large-scale high-quality TTS dataset with rich annotations and a wide coverage of content, speaking style, and timbre. Then, we propose a language-model-based foundation TTS system. The speech signal is compressed into discrete semantic tokens via a semantic-aware speech tokenizer, and can be generated by a language model from the prompt text and audio. Then, a two-stage waveform generator is proposed to decode them to the high-fidelity waveform. We present two applications of this system: voice cloning for dubbing and human-like speech generation for chatbots. The experimental results demonstrate the solid in-context learning capability of FireRedTTS, which can stably synthesize high-quality speech consistent with the prompt text and audio. For dubbing, FireRedTTS can clone target voices in a zero-shot way for the UGC scenario and adapt to studio-level expressive voice characters in the PUGC scenario via few-shot fine-tuning with 1-hour recording. Moreover, FireRedTTS achieves controllable human-like speech generation in a casual style with paralinguistic behaviors and emotions via instruction tuning, to better serve spoken chatbots.



System Overview

Figure 1. An overview of the FireRedTTS foundation system with a text-to-speech language model that maps text tokens to semantic tokens, and a two-stage token-to-waveform generator. The semantic tokens are extracted using a semantic-aware speech tokenizer. In Mel decoder, (c) illustrates a flow-matching-based approach that transforms the noise into a spectrogram using semantic tokens as conditions, and (d) depicts a streamable decoder based on a multi-stream language model and a Mel Codec.

Human-Like Speech Generation


Emotion Text Generated
Angry 我真的无法理解你为什么总是这么自私,完全不顾及别人的感受!
Happy 真是难以置信,这么好的事情竟然发生在我身上!
Sad 你离开后,生活仿佛失去了颜色,我再也找不到曾经的那份快乐了。


Behavior Text Generated
Word-Level Repetition 因为[word_rep]因为[word_rep]因为这样一个对待对我们来说重要的动物它应该有一些选。
Char-Level Repetition 哎我然后我[char_rep]我然后我就跟他说我说这个字你不认识吗。
Elongation 后来回来的时候,我自己[elong]坐着飞机回来的。
Hissing 当然需求的话[oralsii]肯定是有这一部分的。
Dental click 当然需求的话肯定[tsk]是有这一部分的。
Breath 当然需求的话肯定[breath]是有这一部分的。
Laugh 当然需求的话肯定[laugh]是有这一部分的。
Speak with a laugh 这^部^的^话^,我觉得相对于电影来说,其实会和我一样吧,都会选择电影吧。
Emphasis 谁@知@道@啊,它里面才出现那么几个镜头,而且我就感觉他是客@串@的。
Filled pause 嗯(filled pause)没有这回事。
Confirmation 嗯(confirmation)没有这回事。
Realization 哦(realization)[prolong],你在那儿呢。
Surprise 哦(surprise),你在那儿呢。

Voice Cloning

Zero-Shot In-Context Learning

Chinese Text English Prompt Generation
English Text Chinese Prompt Generation
This characterisation of the greater continence in the use of stimulants practised by the women of the reputable classes may seem an excessive refinement of logic at the expense of common sense.
Before any could stop him he butted his Majesty so furiously that the King soared far into the air and tumbled in a heap among the benches, where he lay moaning and groaning.
Code-Switch Text Chinese Prompt Generation
Code-Switch Text English Prompt Generation

Few-Shot Speaker Adaptation

Text Prompt Audio Zero-Shot Generated Few-Shot Generated
Speaker 1 (2min) 伏旭舅舅给小朋友讲的故事,是由,海豚绘本花园出品的,绘本故事。
Speaker 2 (2min) 负责相关区域的快递员已按深圳市统一要求做了核酸检测。
Wu Kong
(1 hour)

Ablation Studies

Prompt Processing for Robust Voice Cloning

SNR Text Prompt Generation
0 可以理解为,调整的实质是放松了单笔申报最大数量的限制。 Raw
10 您看,您这个行业的很多客户,都通过线上推广实现销量增长了。 Raw
20 首尔中央地方法院十四日就李明基涉嫌虐待员工案作出一审宣判。 Raw

Non-Streamable and Streamable Decoder

Text Ground-Truth Flow-Matching Decoder Streamable Decoder