FireRedTTS

Left image Right image

FireRedTTS-2: Towards Long Conversational Speech Generation for Podcast and Chatbot

[Paper] [Code]

Abstract. Current dialogue generation approaches typically require the complete dialogue text before synthesis and produce a single, inseparable speech containing all voices, making them unsuitable for interactive chat; moreover, they suffer from unstable synthesis, inaccurate speaker transitions, and incoherent prosody. In this work, we present FireRedTTS‑2, a long-form streaming TTS system for multi-speaker dialogue generation, delivering stable, natural speech with reliable speaker switching and context-aware prosody. A new 12.5Hz streaming speech tokenizer accelerates training and inference, extends maximum dialogue length, encodes richer semantics to stabilize text-to-token modeling and supports high-fidelity streaming generation for real-time applications. We adopt a text–speech interleaved format, concatenating speaker-labeled text with aligned speech tokens in chronological order, and model it with a dual-transformer: a large decoder-only transformer predicts tokens at the first layer, and a smaller one completes subsequent layers. Experimental results show that FireRedTTS‑2 integrates seamlessly with chat frameworks and, with minimal fine-tuning, produces emotionally expressive speech guided by implicit contextual cues. In podcast generation, it surpasses existing systems including MoonCast, Zipvoice-Dialogue, and MOSS-TTSD in objective intelligibility, speaker-turn reliability, and perceived naturalness with context-consistent prosody.

Contents

Video Demo
System Overview
Podcast Generation

Zero-Shot Podcast Generation
Speaker-Specific Finetuned Podcast Generation

Interactive Chat

Emotional Speech Generation

Video Demo

⚠️ Speaker voices: hosts "肥杰" and "惠子" from the podcast "肥话连篇". Use without authorization is forbidden.
⚠️ 声音来源：播客 "肥话连篇" 主播 "肥杰" 和 "惠子"，未经授权不能使用。

System Overview

Figure 1. An overview of FireRedTTS-2, including: (a) a new speech tokenizer with a 12.5 Hz frame rate and enhanced semantic information, and (b) a text-to-speech model using a dual-transformer architecture with interleaved text–speech input, enabling per-sentence generation and context-coherent prosody.

Podcast Generation

Zero-Shot Podcast Generation

Note: We use the first two dialogue turns as a prompt and generate speech from the remaining text.

	Prompt Audio	MoonCast	ZipVoice-Dialog	MOSS-TTSD	FireRedTTS-2
Chinese
Chinese
English
English

Speaker-Specific Finetuned Podcast Generation

Note: We generate speech from the multi-turn dialogue text using two fine-tuned speaker voices.

Dialogue Content	Ground Truth	FireRedTTS-2
[S1]嗯，肯定是买油画它最容易转手。但是我们当然还是要支持年轻的壮志，艺术家行为艺术家，各种艺术生态形式都要支持。对嗯嗯。 [S2]真的就是买这种影像艺术的人多吗？收藏影像艺术。 [S1]这我周围没有，但是肯定是有嗯，对我我自己是只做油画。
[S1]我们粗略统计了当天晚上的人流有七百多。 [S2]哦，那真的很多，因为我不知道就陈轴那一场，其实也算是你们的派对。 [S1]对，没错。 [S2]也是你们的年度派对。但是那一场其实相对来说就挺小众的，我感觉可能现场有个一百来人就已经蛮多了。
[S1]那可能说对对，没有去过美国来说去去看到美国线下。巴斯曼也好，沃尔玛也好，他们线下不管说，因为深圳出去的还是电子周边的会表达，会发现哇对这个价格真的是很高呀。都是卖三十五美金、四十美金，甚至一个手机壳，就是二十五美金开。 [S2]对，没错，我每次都觉得不不可思议。我什么人会买三五十美金的手机壳？但是其实在在那个target啊，就塔吉特这种超级市场，大家都是这样的，定价也很多人买。 [S1]对对，那这样我们再去看说亚马逊上面卖卖卖手机壳也好啊，贴膜也好，还包括说车窗也好，各种线材也好，大概就是七块九九或者说啊八块九九，这个价格才是卖的最多的啊。因为亚马逊的游戏规则限定的。如果说你卖七块九九以下，那你基本上是不赚钱的。 [S2]那比如说呃除了这个可能去到海外这个调查，然后这个调研考察那肯定是最直接的了。那平时我知道你是刚才建立了一个这个叫做呃rean的这样的一个一个播客，它是一个英文的。然后平时你还听一些什么样的东西，或者是从哪里获取一些这个海外市场的一些信息呢？ [S1]嗯，因为做做亚马逊的话呢，我们会关注很多行业内的东西。就比如说行业有什么样亚马逊有什么样新的游戏规则呀。呃，物流的价格有没有波动呀，包括说有没有什么新的评论的政策呀，广告有什么新的打法呀？那这些我们会会关关注很多行业内部的微信公众号呀，还包括去去查一些知乎专栏的文章呀，以及说我们周边有很多同行。那我们经常会坐在一起聊天，看看信息有什么共享。那这个是关注山内的一个方式。

Interactive Chat

Implicit Emotional Control

Dialogue Content	Emotion	Generated
[S1]我把你攒了半年的演唱会门票送人了。 [S_DIALOG]那是我排了三天队抢到的！你凭什么送人？你赔得起吗？！	Angry
[S1]你答应帮我寄快递，结果拖了三天还没寄。 [S_DIALOG]对不起，我每天想着“明天寄”，结果一直拖，我现在就去，真的不好意思。	Apology
[S1]你朋友最近情绪很低落，朋友圈发些奇怪的话。 [S_DIALOG]我也看到了，我好担心他是不是想不开，可我又不敢直接问。	Concern
[S1]我家狗狗今天第一次学会握手，还摇尾巴！ [S_DIALOG]呜呜太萌了！它一定超爱你，快奖励它小饼干！	Happy
[S1]你妈妈走之前最遗憾的是没见你结婚？ [S_DIALOG]我结婚那天，在她照片前放了双筷子。敬酒时，我跪下说：“妈，我成家了。” 可没人回应我，那一刻，我哭得像个没妈的孩子。	Sad
[S1]我手机语音助手突然用我妈妈的声音说话！ [S_DIALOG]这太诡异了！它是不是偷偷连接了另一个世界？！	Surprised