Research

Cosmos 3 thinks before it renders

At GTC Taipei, NVIDIA shipped Cosmos 3: a world model that reasons about a scene before it generates one. It folds the generate-and-render core of a synthetic-data pipeline into a single model. The validation half, the part that decides whether the data trains a production model or quietly breaks it, did not move.

datadoo research

Jun 17, 20266 min read

On June 1, in Taipei, NVIDIA released Cosmos 3. Most of the coverage led with the same word we did last month: open. The weights, the code, the datasets, even the benchmarks now ship under OpenMDW 1.1, the Linux Foundation license NVIDIA adopted days earlier. That is a real shift, and we will come back to it. But the license is not why Cosmos 3 matters. What matters is what the model does differently. It looks at a scene and works out what is happening in it before it generates a single frame. For synthetic data, that reordering matters more than another few billion parameters ever would.

Reason first, render second

Most video models generate and hope. You give them a prompt, they paint something plausible, and whether it obeys physics is whatever happens to fall out of the training data. Cosmos 3 splits the job in two. A reasoning tower reads the scene first. It is an autoregressive vision-language model, built on Qwen3-VL weights, and its only job is to understand: which objects are moving, where their paths might cross, what state probably comes next. Then a diffusion tower paints the frames and the actions, conditioned on what the first tower concluded.

The information only flows one way. The generator leans on the reasoner and cannot talk back. So the part of the model producing your pixels has already been handed a structured read on the scene instead of inferring it mid-render. NVIDIA calls the reasoning tower the model's brain, which is marketing, but the ordering is real, and it is part of why the outputs hold together better than a single-stage model's do.

One model where you used to need a pipeline

The same architecture runs in three directions. Give it a state and an action, it predicts the outcome. Give it the before and after, it infers the action between them. Or ask for an action cold, and it hands you a policy. The outputs are not only video, either. Cosmos 3 emits synchronized audio and raw numerical action data, joint angles, gripper positions, trajectory points, across embodiments that range from a single robot arm to a humanoid to a vehicle.

It ships in three sizes. Nano, at 16B, runs on a workstation GPU. Super, at 64B, is built for the datacenter. Edge is smaller and coming later, for on-device inference. Alongside the model NVIDIA released six synthetic-data datasets and HUE, a human-evaluation benchmark, with an automated judge variant, that grades generated video on physics, geometry, and visual integrity. That last one is worth watching. Last month we argued the new question to ask a vendor was show me your evaluation traces. NVIDIA just shipped a way to make them.

What this collapses

If you have built a synthetic data pipeline for physical AI, you know its shape. Curate real footage. Generate scenes, usually with a procedural engine and a heap of domain randomization. Render. Label. Score. Throw away what does not survive. Each stage was its own tool, often its own vendor, held together with glue that nobody enjoyed maintaining.

Cosmos 3 collapses the generate-and-render core into a single model. Reasoning, generation, and action that used to be separate stages now happen in one place, and a lot of that stitching goes away. Need a thousand variations of a near-collision your fleet has seen exactly twice? You no longer stage it on a track, and you no longer hand-build the scene in an engine. You describe it, the model reasons it into being, and the audio and the trajectories come attached. The horizontal plumbing that used to pass for a moat is now a download.

What it does not collapse

Here is the part the launch posts skipped. NVIDIA's own model card lists what the model gets wrong: temporal inconsistency, unstable motion, imprecise physical interactions, audio that drifts out of sync with the video, and what it calls action-state drift, the gap widening over longer sequences. Then it adds the line every physical-AI engineer should pin to the wall:

Applications involving robotics control, autonomous systems, scientific simulation, or safety-critical planning require additional validation, external constraints, system-level safety analysis, and domain-specific guardrails before deployment. - NVIDIA Cosmos 3 model card

One of the most capable open generators you can download today ships with its maker telling you, in the documentation, that the model alone is not enough. A frame that looks right is not a frame you can certify. And a clip being plausible says nothing about whether it will make your perception model better or quietly worse. That distance is still crossed by hand, still specific to your domain, and still where projects live or die.

Where the work goes now

So the generation half of the pipeline got dramatically cheaper. The validation half barely moved. HUE grades whether a clip looks physically right; it does not tell you whether that clip matches the road your car will actually drive, or keep the records a safety auditor will ask for. That work is still yours. Picking which of a million clips are true for your domain. Proving the synthetic distribution matches the one your model meets in production. Keeping lineage clean enough that a regulator, or your own safety team, can trace a deployed behavior back to the frames that taught it. None of it comes in the box.

We have been building on Cosmos and Omniverse for a while now, and Cosmos 3 makes our generation step faster and cheaper, which we will happily take. It changes nothing about the work that decides whether a physical-AI project ships. NVIDIA opened the stack and handed everyone the same imperfect starting point. What you do with the imperfection is the business.

Ready to get started?