Wan 2.6
Alibaba's Wan 2.6 generates audio-driven video up to 15 seconds with video reference support. One of the few models that accepts both audio input and video reference for maximum creative control over long-duration clips.
What is Wan 2.6?
Wan 2.6 is Alibaba's latest video generation model in the Wan series. It combines two input types that are rarely available together: audio input (which drives the video generation) and video reference input (which guides the visual style and motion character). Add up to 15 seconds of clip duration and you have a highly capable tool for long-form, reference-driven audio video production.
The audio-driven generation means the video responds to the audio you provide — rhythm, energy, texture. This is fundamentally different from models where audio is a toggle for AI-generated sound alongside video. In Wan 2.6, you bring the audio and the model produces video that matches it.
The video reference input adds visual guidance on top of the audio drive. Supply a reference clip to establish the visual language the output should follow, then supply audio to define its rhythm and character. Combined with a text prompt, this gives you three layers of creative direction over the 15-second output.
Audio-driven video
Audio input shapes the output
Video reference
Guide visual style and motion
Up to 15 seconds
Extended long-duration clips
Tri-input control
Audio + video ref + text
How to generate video with Wan 2.6 on project.video
Open the composer
Go to your project.video dashboard. Wan 2.6 is available under Alibaba models in the model selector.
Select Wan 2.6
Choose Wan 2.6. The composer will show you the audio input slot, video reference slot, and text prompt field.
Upload audio and/or video reference
Upload your audio file to drive the generation. Optionally add a video reference clip to guide the visual style and motion. Both inputs are optional but improve results.
Set duration and generate
Choose duration up to 15 seconds, aspect ratio, and write your prompt for visual direction. Generate and view your output in the gallery.
Technical specs
Best use cases
Music video content
Upload a track and a reference clip that matches the artistic direction you want. Wan 2.6 generates video driven by the music's energy and styled according to the reference — an audio-first music video workflow.
Long-form branded content with audio
At 15 seconds, Wan 2.6 can produce complete brand storytelling clips where the audio drives the energy of the piece. Supply brand audio (jingle, voiceover) and let the video respond.
Replicating a visual style from reference
Provide a video reference that captures the visual language you want (cinematography, color, motion) and audio that sets the rhythm. The model generates new content in that established style, set to that audio.
Audio-visual content for streaming
Podcast clips, music previews, and audio-led social content all benefit from an audio-first generation approach. Wan 2.6's 15-second duration covers most social audio clip lengths.
Example prompts
Pair these with audio and optional video reference uploads in the project.video composer.
"Abstract fluid color forms morph and pulse in response to the uploaded electronic music track, deep navy and electric blue palette, smooth organic motion, 16:9"
"Brand film: a product travels from raw material to finished form, cinematic color grading matching reference video, paced to the uploaded audio track's rhythm, 16:9"
"Landscape panoramic sequence driven by uploaded ambient music — clouds, water, and light move responsively to the audio's ebb and flow, golden hour palette, 16:9"