Every video diffusion model released in the last year has followed the same playbook: train bigger, throw more VRAM at inference, charge accordingly. Alibaba's Tongyi Lab just published something structurally different. Wan2.2 is the first open-source video model built on Mixture-of-Experts, and the architecture choice isn't academic — it changes what the model is good at, what hardware it demands, and who can actually run it.

Two Experts, One Denoising Pass

The MoE implementation here is narrower than what you'd find in a language model like Mixtral. Instead of routing tokens across eight or sixteen experts, Wan2.2 uses exactly two, each with roughly 14B parameters, for a combined 27B — but only 14B active at any given step.

The split follows the diffusion timestep. A high-noise expert handles early denoising — global composition, camera placement, scene layout. A low-noise expert takes over for the later steps — sharpening faces, stabilizing textures, getting fabric folds right. One brain blocks out the scene, the other finishes it.

This matters because video diffusion has a known tension: the skills needed to plan a coherent 5-second sequence aren't the same skills needed to render convincing skin pores at 720p. Dense models compromise between these competing demands at every layer. The two-expert split lets each half overfit to its phase of the job.

In practice, Wan2.2 produces noticeably better temporal coherence on complex motion — dance sequences, parkour, figure skating — than its predecessor. Training data grew by 66% for images and 83% for video compared to Wan2.1, but the architectural change does real work on top of that data scaling. Wan-Bench 2.0 evaluations show the A14B variants scoring above several commercial models on motion coherence specifically, even where they trail on texture quality or prompt adherence.

The Hardware Reality

Here's where the open-source story gets complicated. The A14B MoE models — the good ones, the ones that justify the architecture — need 80GB+ VRAM. That means an H100 or A100. Not your gaming rig.

Model Total / Active Params Min VRAM Resolution ~Time (5s clip)
T2V-A14B 27B / 14B 80 GB 480p / 720p ~6 min (A100)
I2V-A14B 27B / 14B 80 GB 720p ~7 min (A100)
TI2V-5B 5B / 5B 24 GB 720p ~9 min (4090)

The TI2V-5B variant is the consumer-hardware escape hatch. A dense 5B model — no MoE — with a high-compression VAE (16×16×4 ratio) that squeezes 720p at 24fps out of a single RTX 4090. Nine minutes for five seconds isn't blazing, but it's usable for prototyping and iteration loops.

Memory optimization flags help if you're pushing boundaries: --offload_model True --convert_model_dtype --t5_cpu will wrestle the bigger models into tighter VRAM budgets, though generation time balloons.

Where It Wins, Where It Doesn't

Wan2.2 does not dethrone Runway Gen-4.5 for raw visual fidelity. It doesn't match Kling 2.6's speed for social media clips. And it still produces occasional artifacts on static scenes — handheld-style camera drift and slow-motion warping that you'd never see in Veo 3.1's output.

What it does offer:

No subscription. Weights ship under Apache 2.0. Run them forever on your own hardware or a rented instance. For studios cranking out hundreds of clips per month, the cost math shifts fast — a 7-minute A100 render at $2.50/hour costs roughly 30 cents per clip.

Full pipeline coverage. Text-to-video, image-to-video, speech-to-video (through the S2V-14B variant with CosyVoice integration), and a dedicated character animation model. Runway gates equivalent features behind separate pricing tiers. Here it's all one repo.

Motion coherence on hard subjects. The MoE split genuinely helps here. Bodies in motion — twisting, jumping, spinning — hold together better than any open-weight alternative I've tested. The high-noise expert plans plausible trajectories, and the low-noise expert doesn't smear the details.

ComfyUI as a first-class citizen. If your pipeline already runs through ComfyUI, Wan2.2 plugs in as a native node. Prompt extension supports either Dashscope's API or a self-hosted Qwen2.5-7B for fully local operation. Multi-GPU scaling uses FSDP with DeepSpeed Ulysses sequence parallelism — real distributed inference, not a hack.

Where it falls short is predictable: cinematic lighting lacks Runway's post-processing polish, text rendering in generated video remains broken (an industry-wide problem, but still), and photorealistic human faces held in frame for more than two seconds will occasionally melt.

Don't Ignore the 5B Model

TI2V-5B doesn't have the MoE architecture. It's a dense model. But it's one of a very small number of checkpoints that generate 720p at 24fps on a single consumer GPU, and it unifies text-to-video and image-to-video inference in one set of weights.

The practical workflow: iterate concepts on TI2V-5B locally, lock down your prompts and compositions, then render final cuts on a rented A100 with the full A14B. Cloud A100 time at current pricing means your final render costs a quarter. That math beats every subscription at volume.

What Two Experts Teach Us

The lasting contribution might not be this specific model. It's proof that Mixture-of-Experts works for video diffusion at all. Language models showed that MoE scales capacity without proportionally scaling compute — the same principle now applies to temporal generation.

Alibaba open-sourcing the architecture means the community can extend it. The two-expert, timestep-based split is simple enough to invite experimentation: four experts for different motion regimes, or temporal specialists for different clip lengths. Expect fine-tuned variants and LoRA adapters within weeks.

For now, the A14B is a serious production tool if you have the hardware. TI2V-5B is a gift to everyone who doesn't.