Research

The EU AI Act's Article 10 Is an Argument for Synthetic Data

Article 10 of the EU AI Act demands training data that is representative, complete, and free of errors. Real-world data rarely meets that bar. Synthetic data does.

datadoo research

Apr 20, 20268 min read

The EU AI Act's Article 10 Is an Argument for Synthetic Data

The regulation nobody is reading carefully enough

On August 2, 2026, the EU AI Act enters full enforcement for high-risk AI systems. That includes autonomous vehicles. Medical imaging. Surgical robotics. Industrial safety systems. Any AI that touches human safety or fundamental rights.

Most of the compliance conversation has focused on documentation requirements, conformity assessments, and risk management systems. These matter. But the provision that should be keeping ML teams awake at night is Article 10, the data governance requirement, because it imposes a standard on training data that most real-world datasets cannot meet.

The irony is that Article 10 is not an obstacle for companies using synthetic data. It is, almost line by line, a description of what synthetic data pipelines already do. If you are building computer vision models for regulated applications, Article 10 does not make your life harder. It makes the case for you.

What Article 10 actually says

Article 10(3) is the operative paragraph. It requires that training, validation, and testing datasets for high-risk AI systems are:

Relevant, sufficiently representative, and to the best extent possible, free of errors and complete in view of the intended purpose.

Four requirements. Relevant. Representative. Error-free. Complete. Each of these is a technical claim about the dataset itself, not about the model's output. The regulation evaluates your training data on its own terms, before your model has seen a single sample.

Article 10(2) specifies the data governance practices required to produce such datasets. Among them:

An assessment of the availability, quantity, and suitability of the data sets that are needed.

An examination of possible biases that are likely to affect the health and safety of persons.

The identification of relevant data gaps or shortcomings that prevent compliance with this Regulation, and how those gaps and shortcomings can be addressed.

That last clause is critical. The regulation does not simply ask whether your data has gaps. It asks what you did about them. A dataset with known gaps and no remediation plan is non-compliant by design.

Article 10(4) adds a contextual dimension: datasets must account for the characteristics or elements that are particular to the specific geographical, contextual, behavioural or functional setting within which the high-risk AI system is intended to be used. A self-driving system trained in Californian daylight and deployed in Nordic winter fails this requirement on its face.

Why real data fails Article 10

Consider the four requirements against any real-world visual dataset.

Representative. Real data is collected where you happen to collect it. If your autonomous vehicle data comes predominantly from sunny, suburban California, your dataset is not representative of the conditions in which the vehicle will operate. Rare scenarios, a pedestrian stepping off a curb at night, heavy snow on a construction zone, a wheelchair user at an intersection, are underrepresented not because you chose to exclude them but because they are statistically rare. You cannot collect what does not happen in front of your cameras. Article 10 does not care why your data is unrepresentative. It asks whether it is.

Complete. Real data has structural holes. You cannot photograph a surgical complication that has not yet occurred. You cannot capture a robotic arm collision that safety protocols prevent. You cannot drive a car into a pedestrian to collect training data for the exact scenario your model must handle. The long tail of dangerous, rare, or ethically uncollectable events is precisely the distribution region where safety-critical AI must perform, and precisely the region where real data is thinnest.

Error-free. Real data requires human annotation. Bounding boxes are drawn by hand. Segmentation masks are painted pixel by pixel. Annotation error rates of five to ten percent are considered normal in production labeling pipelines. For safety-critical applications, normal annotation noise is a compliance liability. Article 10(3) asks for data that is to the best extent possible, free of errors. If there exists a feasible alternative with lower error rates, your current process is not best extent possible.

Relevant. Real data is collected for operational purposes and repurposed for training. The original capture conditions, camera angles, lighting, resolution, sensor configuration, may not match the deployment context. Article 10(2)(b) explicitly requires documenting the origin of data, and in the case of personal data, the original purpose of the data collection. Repurposed operational data carries this provenance burden.

Synthetic data, by construction, does not have these problems. You define the distribution. You control the coverage. Labels are generated by the rendering engine, not by a human annotator. The error rate on a synthetic label is zero: every pixel is accounted for because the scene is fully known. You can generate a pedestrian stepping off a curb at night in Helsinki in December because you build the scene and press render.

The regulatory inversion

Article 10(5) contains a provision that makes the regulatory logic explicit. Before a provider of a high-risk AI system may process special categories of personal data (biometric data, health data, data revealing racial or ethnic origin) for bias detection and correction, the regulation requires that:

The bias detection and correction cannot be effectively fulfilled by processing other data, including synthetic or anonymised data.

Synthetic data is named in the text of the law. It is positioned as the preferred alternative, the option that must be exhausted before sensitive real data can be touched. The regulation does not treat synthetic data as a fallback. It treats it as the first resort.

This is a structural inversion from how most AI teams think about synthetic data. The common framing is: use real data as the default, supplement with synthetic when you cannot collect enough real data. Article 10 inverts this for bias work: use synthetic data as the default, escalate to real data only when synthetic provably cannot solve the problem.

The logic extends beyond Article 10(5). If synthetic data can produce datasets that are more representative, more complete, and more accurately labeled than real data, and for computer vision, in many documented cases it can, then Article 10(3)'s best extent possible standard points toward synthetic, not away from it.

Who is affected

The EU AI Act classifies high-risk AI systems in Annex III. Three categories map directly to computer vision applications.

Autonomous vehicles. AI systems used as safety components in the management and operation of road traffic, including emergency braking, lane keeping, and collision avoidance, are high-risk. Every autonomous vehicle company deploying in the EU must demonstrate Article 10 compliance for its perception models.

Medical imaging. AI systems that constitute or are embedded in regulated medical devices (Class IIa and above) are automatically classified as high-risk. A model that reads a CT scan, detects a tumor in an X-ray, or segments a surgical field is subject to Article 10 in full.

Industrial robotics. Robots are high-risk when their AI qualifies under one of the Annex III categories. A surgical robot's perception system falls under the medical device pathway. An industrial safety robot's collision avoidance falls under critical infrastructure. The classification follows the function, not the form.

For these verticals, Article 10 is not optional. It is a condition of market access.

What this means practically

The penalty structure removes any ambiguity about enforcement intent. Violations of high-risk AI system obligations, including Article 10, carry fines of up to fifteen million euros or three percent of global annual turnover, whichever is higher. For startups and SMEs, the calculation is inverted: whichever is lower. The regulation is calibrated to be survivable for small companies and punitive for large ones.

But fines are not the primary risk. The primary risk is market access. A high-risk AI system that cannot demonstrate compliant data governance cannot be placed on the EU market. For an autonomous vehicle manufacturer or a medical device company, losing access to 450 million consumers is not a fine. It is an existential event.

The practical response is not to abandon real data. It is to close the gaps that real data leaves open, and to do so with a method whose provenance, coverage, and label accuracy can be documented and audited. Synthetic data pipelines produce datasets with known distributions, zero annotation error, full label coverage, and complete traceability from scene specification to rendered frame. Every parameter, lighting, weather, sensor noise, object placement, is recorded. Every label is deterministic.

This is what Article 10(2)(h) asks for: identify your data gaps, and show how you addressed them. A synthetic data pipeline is, in regulatory terms, a documented remediation plan for the structural shortcomings of real-world data collection.

The compliance case is the technical case

The argument for synthetic data in regulated computer vision was always technical: better coverage, cleaner labels, cheaper iteration, privacy by design. Article 10 does not change the technical argument. It gives the technical argument legal force.

For companies building perception AI in autonomous vehicles, medical imaging, or robotics, the question is no longer whether synthetic data is good enough. The regulation has reframed the question: is your real data good enough? And if it is not, if it has gaps, biases, annotation noise, or coverage holes, what are you doing about it?

August 2 is less than four months away. The answer had better be documented.

Ready to get started?