ByteDance Put a Director's Chair Inside CapCut

A video model that ingests nine reference images, three video clips, and three audio tracks in a single generation pass just hit number one on Artificial Analysis. Seedance 2.0 posted an Elo of 1,269 in text-to-video, beating Veo 3, Runway Gen-4.5, and whatever Sora 2 would have scored if OpenAI had submitted it. But the interesting part isn't the leaderboard — it's where ByteDance chose to put it.

The Input System That Actually Matters

Most video generators accept a text prompt. Some accept an image too. Seedance 2.0 treats generation like a production brief. You feed it reference material across three channels, each processed through a separate extraction pathway:

Input Type	Capacity	What Gets Extracted
Reference images	Up to 9	Composition, color palette, subject appearance, style
Video clips	Up to 3	Motion patterns, camera work, timing
Audio tracks	Up to 3	Rhythm, pacing, tonal qualities

The model fuses these references with your text prompt before generation begins. So instead of writing "cinematic slow-mo shot of a dancer in warm golden light with piano music," you hand it a photo of the dancer, a clip showing the camera movement you want, and the piano track — then describe what's different.

This matters because text is a terrible interface for describing motion. Explaining a specific dolly zoom in words is harder than showing the model one. ByteDance figured this out and built the whole input system around it.

The audio-video synchronization deserves its own mention. This isn't "generate video then layer on sound." The model synthesizes Foley effects, music, and voice in the same forward pass, with frame-level precision. Impact events trigger audio at the exact visual frame — glass breaking sounds when the glass breaks, not 200ms later. Material-specific sound design (metal, fabric, wood) comes built in. Single-subject lip sync works well enough for dialogue scenes, though multi-person conversations still drift.

Shipped in CapCut, Not a Developer Portal

Here's the strategic move worth paying attention to: ByteDance didn't announce API access first. They pushed Seedance 2.0 into Dreamina and CapCut starting March 24, initially in Indonesia and Brazil, then globally as a free limited-time perk for CapCut paid users.

The first people using the top-ranked video model are TikTok creators and small production teams — not AI researchers running benchmarks. Runway charges $12/month minimum and targets professional post-production. ByteDance handed its model to the people already editing short-form content, inside the tool they already have open.

What It Costs Through Third-Party APIs

ByteDance's official API still doesn't exist publicly — IP disputes are reportedly delaying it. But third-party access is already live. fal.ai runs six endpoints:

Text-to-video: $0.30/second at 720p (standard) or$ 0.24/second (fast)
Image-to-video: Same pricing tiers
Reference-to-video: $0.30/second, with a 0.6× discount when video inputs are included

Maximum output is 15 seconds per generation at up to 1080p. The model can produce multi-shot sequences with natural cuts and transitions in a single pass — no manual stitching required.

For context, Kling 3.0 charges $0.075/second and delivers 4K at 60fps through a globally available official API. That's roughly a quarter of fal.ai's Seedance pricing at higher resolution. If you're doing volume work, that math matters.

What Breaks

ByteDance's own documentation acknowledges the weak spots. Fast-motion scenes lose detail stability — quick pans and action sequences produce artifacts that wouldn't survive a client review. Multi-person lip sync is inconsistent beyond a single speaking character.

The 15-second ceiling is also a constraint. Both Runway and Kling support longer generations. For anything beyond a single shot, you're chaining clips together and managing cross-clip consistency yourself.

And about that Elo score: Artificial Analysis measures it through blind user voting without model identification. The sample size is still growing — rankings from the first two weeks of a model's inclusion tend to fluctuate as more votes accumulate. Treat it as a strong signal, not a settled verdict.

Runway and Kling Have Different Advantages

The leaderboard lead doesn't automatically translate to a production lead. Runway Gen-4.5 has motion brushes, scene consistency tools, and an integrated editing ecosystem that Seedance can't match yet. Professional agencies paying for Runway are buying the workflow, not just generation quality.

Kling 3.0 offers native 4K at 60fps with a stable, globally accessible API at dramatically lower cost. For teams grinding out social content at scale, Kling wins on economics alone.

Google's Veo 3 still arguably handles native audio-video sync better despite a lower Elo of 1,226, and it comes with full Vertex AI enterprise support — compliance, SLAs, the whole package.

What Seedance 2.0 does that nothing else does is the reference-based composition workflow. Nine images, three videos, three audio tracks — all processed simultaneously. If you have specific visual and sonic references and want the model to interpret them rather than invent from a text description, this input stack is unprecedented.

The Distribution Play

ByteDance is betting that the people generating the most video content in the world — short-form creators on TikTok and CapCut — will adopt AI generation if you embed it inside tools they already use daily. They're not trying to sell API minutes to post-production houses. They're trying to make AI video feel as routine as applying a CapCut filter.

Whether Seedance 2.0 holds the top Elo spot by May matters less than whether CapCut's 800 million users start treating it like a feature rather than a novelty.

#The Input System That Actually Matters

#Shipped in CapCut, Not a Developer Portal

#What It Costs Through Third-Party APIs

#What Breaks

#Runway and Kling Have Different Advantages

#The Distribution Play