Research

Robots Are Shipping. Training Data Is Not.

Humanoids are deploying, $6B flowed into physical AI in Q1 2026, and the bottleneck has shifted from hardware to training data. Physics-accurate synthetic data is the binding constraint.

datadoo research

May 5, 20266 min read

Research

Robots Are Shipping. Training Data Is Not.

Unitree shipped 5,500 humanoid robots in 2025. Agility has the first commercial Robotics-as-a-Service deployment. Figure 03 runs continuous unsupervised operation. Accenture, Vodafone, and SAP have humanoids working in a live warehouse in Duisburg, Germany. The hardware is real, and it is shipping.

The bottleneck has moved. The constraint on scaling physical AI is no longer motors, actuators, or compute. It is training data. Specifically: training data that is physically accurate enough to transfer from simulation to the real world.

The numbers tell the story

In Q1 2026 alone, 27 physical AI startups raised over $6 billion. Physical Intelligence closed a round at $11 billion valuation, nearly doubling in four months. Eclipse VC raised a dedicated $1.3 billion fund exclusively for physical AI. Jeff Bezos' Project Prometheus is reportedly approaching $10 billion at a $38 billion valuation.

The capital is not speculative. MarketsandMarkets projects the physical AI market will grow from $1.5 billion in 2026 to $15.2 billion by 2032, a 47% compound annual growth rate. Warehouse robotics is the primary driver, followed by autonomous vehicles and industrial automation.

Every one of these companies needs the same thing: a way to train perception and control models that work in the real world, not just in simulation. And every one of them is hitting the same wall.

The sim-to-real gap is the binding constraint

A robot trained in simulation learns to see and act in a simulated world. When it encounters the real world, the differences accumulate: lighting is different, surfaces reflect differently, objects deform differently, sensor noise patterns do not match. This is the sim-to-real gap, and it is the single most important unsolved problem in physical AI deployment.

The industry knows this. NVIDIA's GTC 2026 keynote centered on Cosmos Predict 2.5, Cosmos Transfer 2.5, and Cosmos Reason 2, all designed to make simulation more physically faithful. Their new Physical AI Data Factory Blueprint, released in March 2026, is an open architecture specifically for massive-scale synthetic data generation, reinforcement learning, and evaluation. Jensen Huang's framing was unambiguous: every industrial company will become a robotics company, and every robotics company needs a data factory.

At ATEC2026, a new competition launched as a "Turing Test for embodied AI," rigorously testing whether robots can transfer simulated skills to the real world across locomotion, target detection, grasping, and placement in a single continuous run. The fact that this benchmark exists tells you where the field's attention has shifted.

Hardware is outpacing data infrastructure

Consider the state of humanoid deployment. China controls roughly 90% of the global humanoid market. Unitree is the volume leader with confirmed commercial sales. AGIBOT, backed by CATL, is scaling production. In the West, Figure AI's Figure 03 runs unsupervised via their Helix 02 model, and 1X has opened consumer pre-orders for NEO at $20,000.

Meanwhile, ABB sold its robotics division to SoftBank. Audi and BMW are piloting humanoids on production lines. Agility's Digit is the first humanoid deployed under a commercial RaaS model. The Accenture/Vodafone/SAP warehouse pilot trains its humanoids in digital twins built on NVIDIA Omniverse, using imitation learning and reinforcement learning.

All of this hardware needs training data. Not generic image data. Not internet-scraped video. Physics-accurate synthetic data that models the exact lighting, materials, sensor characteristics, and dynamics of the deployment environment. The gap between what the hardware can do and what training pipelines can supply is widening, not narrowing.

Why generic synthetic data does not close the gap

The rise of world foundation models (Cosmos, GAIA, and others) has made it easier to generate visually plausible simulation environments. But visually plausible and physically accurate are not the same thing.

A world model can generate a photorealistic video of a robot picking up a box. But does the box deform correctly under the gripper's pressure? Does the surface reflect light according to its actual BRDF? Does the depth sensor return noise patterns that match the real Intel RealSense or Orbbec camera the robot uses? These details determine whether the model trained on that data will work when deployed.

This is where physics-based rendering separates from generative approximation. A diffusion model learns statistical patterns of what scenes look like. A physics engine computes what scenes are. The training data that closes the sim-to-real gap comes from the engine, not the model.

The data factory is the new moat

NVIDIA's Physical AI Data Factory Blueprint signals an industry-wide recognition: the companies that win in physical AI will be the ones that build the best data infrastructure. Not the ones with the most impressive demo videos. Not the ones with the largest language models. The ones that can generate training data at scale, with the right physics, for the right deployment environment, fast enough to keep up with hardware iteration cycles.

The blueprint covers the full loop: synthetic data generation, domain randomization, model training, evaluation in simulation, and deployment to the edge. Every stage depends on the fidelity of the training data. If the data is wrong, every downstream stage inherits the error.

Sony's Project Ace, which produced the first autonomous system competitive with elite human table tennis players, is a case in point. The system did not reach expert-level performance by collecting millions of hours of real gameplay. It got there through simulation with physically accurate ball dynamics, racket physics, and spin models. The physics was the training signal.

What this means for teams building physical AI

If you are building robots, autonomous vehicles, or industrial automation systems, the data question is no longer optional. It is the primary engineering challenge.

Your simulation environment needs to match the deployment domain in physics, not just appearance.

Your sensor models need to reproduce the noise characteristics of your actual hardware.

Your training data needs to cover the long tail of edge cases that real-world collection cannot reach.

Your iteration cycle needs to be hours, not months. Hardware teams are shipping quarterly. Data pipelines need to keep pace.

The companies that figure this out first will deploy physical AI at scale. The ones that do not will have impressive hardware sitting in labs, waiting for data that is good enough to ship.