The panorama of generative audio is shifting towards effectivity. A brand new open-source contender, Kani-TTS-2, has been launched by the group at nineninesix.ai. This mannequin marks a departure from heavy, compute-expensive TTS techniques. As a substitute, it treats audio as a language, delivering high-fidelity speech synthesis with a remarkably small footprint.
Kani-TTS-2 provides a lean, high-performance different to closed-source APIs. It’s at present obtainable on Hugging Face in each English (EN) and Portuguese (PT) variations.
The Structure: LFM2 and NanoCodec
Kani-TTS-2 follows the ‘Audio-as-Language‘ philosophy. The mannequin doesn’t use conventional mel-spectrogram pipelines. As a substitute, it converts uncooked audio into discrete tokens utilizing a neural codec.
The system depends on a two-stage course of:
- The Language Spine: The mannequin is constructed on LiquidAI’s LFM2 (350M) structure. This spine generates ‘audio intent’ by predicting the subsequent audio tokens. As a result of LFM (Liquid Basis Fashions) are designed for effectivity, they supply a sooner different to straightforward transformers.
- The Neural Codec: It makes use of the NVIDIA NanoCodec to show these tokens into 22kHz waveforms.
Through the use of this structure, the mannequin captures human-like prosody—the rhythm and intonation of speech—with out the ‘robotic’ artifacts present in older TTS techniques.
Effectivity: 10,000 Hours in 6 Hours
The coaching metrics for Kani-TTS-2 are a masterclass in optimization. The English mannequin was educated on 10,000 hours of high-quality speech information.
Whereas that scale is spectacular, the pace of coaching is the true story. The analysis group educated the mannequin in solely 6 hours utilizing a cluster of 8 NVIDIA H100 GPUs. This proves that huge datasets now not require weeks of compute time when paired with environment friendly architectures like LFM2.
Zero-Shot Voice Cloning and Efficiency
The standout function for builders is zero-shot voice cloning. In contrast to conventional fashions that require fine-tuning for brand spanking new voices, Kani-TTS-2 makes use of speaker embeddings.
- The way it works: You present a brief reference audio clip.
- The consequence: The mannequin extracts the distinctive traits of that voice and applies them to the generated textual content immediately.
From a deployment perspective, the mannequin is extremely accessible:
- Parameter Depend: 400M (0.4B) parameters.
- Pace: It encompasses a Actual-Time Issue (RTF) of 0.2. This implies it might probably generate 10 seconds of speech in roughly 2 seconds.
- {Hardware}: It requires solely 3GB of VRAM, making it appropriate with consumer-grade GPUs just like the RTX 3060 or 4050.
- License: Launched below the Apache 2.0 license, permitting for industrial use.
Key Takeaways
- Environment friendly Structure: The mannequin makes use of a 400M parameter spine based mostly on LiquidAI’s LFM2 (350M). This ‘Audio-as-Language’ method treats speech as discrete tokens, permitting for sooner processing and extra human-like intonation in comparison with conventional architectures.
- Fast Coaching at Scale: Kani-TTS-2-EN was educated on 10,000 hours of high-quality speech information in simply 6 hours utilizing 8 NVIDIA H100 GPUs.
- Immediate Zero-Shot Cloning: There isn’t a want for fine-tuning to duplicate a selected voice. By offering a brief reference audio clip, the mannequin makes use of speaker embeddings to immediately synthesize textual content within the goal speaker’s voice.
- Excessive Efficiency on Edge {Hardware}: With a Actual-Time Issue (RTF) of 0.2, the mannequin can generate 10 seconds of audio in roughly 2 seconds. It requires solely 3GB of VRAM, making it totally practical on consumer-grade GPUs just like the RTX 3060.
- Developer-Pleasant Licensing: Launched below the Apache 2.0 license, Kani-TTS-2 is prepared for industrial integration. It provides a local-first, low-latency different to costly closed-source TTS APIs.
Take a look at the Mannequin Weight. Additionally, be at liberty to observe us on Twitter and don’t neglect to affix our 100k+ ML SubReddit and Subscribe to our E-newsletter. Wait! are you on telegram? now you may be part of us on telegram as properly.
Michal Sutter is a knowledge science skilled with a Grasp of Science in Knowledge Science from the College of Padova. With a stable basis in statistical evaluation, machine studying, and information engineering, Michal excels at reworking complicated datasets into actionable insights.

Elevate your perspective with NextTech Information, the place innovation meets perception.
Uncover the newest breakthroughs, get unique updates, and join with a worldwide community of future-focused thinkers.
Unlock tomorrow’s traits right this moment: learn extra, subscribe to our publication, and turn out to be a part of the NextTech neighborhood at NextTech-news.com

