Alibaba has launched Wan2.2-S2V, an open-source speech-to-video mannequin designed to generate lifelike animated avatars from portrait pictures and audio clips. Positioned as a part of the Wan2.2 video era sequence, the mannequin permits skilled creators to supply film-quality digital people able to talking, singing, and performing throughout a number of codecs.
Wan2.2-S2V affords versatile framing choices, together with portrait, bust, and full-body views, whereas dynamically integrating character actions and environmental components based mostly on textual content prompts. By combining text-guided world movement management with audio-driven native actions, the system delivers pure and expressive animations that stretch past standard talking-head content material.
A key technical development lies in its body processing methodology, which compresses historic frames into compact latent representations. This strategy reduces computational calls for and improves stability for long-video era, a long-standing problem in animated content material manufacturing. Output resolutions of 480p and 720p additional broaden its applicability throughout social media {and professional} use circumstances.
To help high-quality efficiency, the analysis crew constructed a large-scale audio-visual dataset aligned with movie and tv manufacturing wants. Utilizing multi-resolution coaching, Wan2.2-S2V adapts throughout vertical short-form content material and conventional horizontal codecs, making it appropriate for various inventive workflows.
The mannequin is now accessible on Hugging Face, GitHub, and Alibaba’s ModelScope platform.
Elevate your perspective with NextTech Information, the place innovation meets perception.
Uncover the most recent breakthroughs, get unique updates, and join with a world community of future-focused thinkers.
Unlock tomorrow’s tendencies at present: learn extra, subscribe to our e-newsletter, and grow to be a part of the NextTech neighborhood at NextTech-news.com

