Alibaba’s Tongyi Lab speech group has launched two new fashions—Enjoyable-CosyVoice3.5 and Enjoyable-AudioGen-VD—each supporting “FreeStyle” instruction-based voice era by way of pure language instructions.
In keeping with Alibaba Group, the fashions permit customers to generate and management voice output straight via textual content prompts—whether or not fine-tuning vocal expression or designing completely new timbres and soundscapes from scratch. Whereas each fashions assist pure language-controlled speech synthesis, they aim completely different use instances: Enjoyable-CosyVoice3.5 focuses on multilingual voice cloning and fine-grained expressive management, whereas Enjoyable-AudioGen-VD facilities on voice design and immersive scene-based audio era.
Enjoyable-CosyVoice3.5 upgrades the corporate’s Instruct-TTS capabilities, enabling customers to generate speech freely with a single sentence of instruction. Customers can describe supply type in pure language—akin to “sound extra decided,” “decrease the pitch barely and gradual the tempo,” or “add delicate emotional variation”—and the mannequin interprets and renders the specified impact.
The mannequin now provides assist for Thai, Indonesian, Portuguese, and Vietnamese. Alibaba claims that throughout 13 languages, Enjoyable-CosyVoice3.5 maintains industry-leading efficiency in Phrase Error Fee (WER) and speaker similarity (SpkSim) benchmarks. It has additionally been optimized for uncommon characters and complicated sentences, decreasing mispronunciation charges for unusual characters from 15.2% to five.3%, whereas delivering extra steady efficiency on long-form textual content.

By way of reinforcement learning-based fine-tuning, the mannequin improves total naturalness and expressive layering. On the efficiency aspect, its tokenizer body price has been halved, and first-packet latency diminished by 35%, enabling sooner responses and smoother experiences in real-time interplay situations.
Enjoyable-AudioGen-VD, in the meantime, permits customers to generate not solely voices however full auditory scenes primarily based on pure language descriptions—integrating character and setting right into a unified output.
The mannequin helps detailed management over:
- Fundamental attributes: gender, age, accent, pitch, speech price
- Timbre qualities: husky, vivid, deep, magnetic
- Feelings: anger, unhappiness, pleasure, willpower
- Position simulation: customer support agent, veteran, baby, AI assistant, broadcaster
- Complicated psychological states: nuanced expressions akin to “calm on the floor however trembling inside”

Past voice era, Enjoyable-AudioGen-VD can create immersive sound environments, together with layered background noise (city streets, cafés, battlefields), spatial reverb results (cathedrals, metallic cells, underwater acoustics), device-style audio filters (classic radio, walkie-talkie, respiratory masks), and dynamic environmental interactions akin to fluctuating wind noise or shifting echoes.
Collectively, the 2 fashions sign Alibaba’s continued push into controllable, high-fidelity speech and audio era—increasing the boundaries of AI-driven voice interplay and immersive media creation.
Supply: IT House
Elevate your perspective with NextTech Information, the place innovation meets perception.
Uncover the most recent breakthroughs, get unique updates, and join with a world community of future-focused thinkers.
Unlock tomorrow’s traits immediately: learn extra, subscribe to our e-newsletter, and grow to be a part of the NextTech group at NextTech-news.com

