Exclusive: Where Physical AI Gets Its Data, and Why Almost No One Is Talking About It

Kasun Illankoon

By: Kasun Illankoon

Friday, April 24, 2026

Apr 24, 2026

6 min read

Everyone is debating what AI knows. But the more urgent question, the one that actually keeps robotics researchers up at night, is where physical AI learns to move.

by Kasun Illankoon, Editor in Chief at Tech Revolt

[For more news, click here]

The data debate in artificial intelligence has become a familiar one. Model scrapes web. Web owners object. Lawyers get involved. Congress holds a hearing. Repeat. But that entire conversation sits firmly in the territory of language: text, code, images scraped from the open internet, digitised books, Reddit threads, academic papers.

That debate, as loud and consequential as it is, misses something fundamental.

There is another category of AI coming into the world that does not learn from the internet at all. It learns from your street. Your warehouse floor. Your living room. It learns by watching bodies move through physical space, in real time, in conditions that cannot be replicated with a crawl of web pages.

This is physical AI, and the data problem sitting at its core is almost entirely invisible in mainstream conversation.

The Robot Has to Learn Somewhere

To understand why this matters, it helps to understand what physical AI actually requires to function.

Large language models, the GPT and Claude and Gemini family of systems, are trained on text. The data is abundant, relatively cheap to process, and already exists on the internet at planetary scale. The annotation challenges are real but manageable. You can hire workers anywhere in the world with a browser and a reasonable grasp of language.

Robots are different. A robotic arm learning to pick and place objects in a warehouse does not get smarter by reading about warehouses. It needs to see, in precise sensor-captured detail, what it looks like when a human hand reaches for a box, how fingers curl, how weight shifts, what the moment of contact actually looks like in three-dimensional space.

Hood Khizer, CEO and Founder of Trouve Labs, the R&D engine behind AHOY Technology, frames it plainly: "Physical-world data is fundamentally different. It's generated by sensors in homes, streets, warehouses and workplaces, not scraped text. That creates distinct challenges around provenance, ownership, and visibility that don't exist for web corpora."

Provenance. Ownership. Visibility. Three words that, in the context of scraped text, trigger policy arguments. In the context of sensor data captured inside private and semi-private spaces, they trigger something closer to alarm.

Where the Data Actually Comes From

Physical AI data is collected through cameras, LiDAR, radar, GPS, inertial sensors, and an expanding constellation of IoT devices embedded in the built environment. It is captured in warehouses and logistics hubs. It is captured on public roads. It is captured, increasingly, inside homes by the wave of domestic robots now entering consumer markets.

The collection is often continuous, high-resolution, and geometrically rich. A single autonomous vehicle can generate between one and forty terabytes of data per hour, depending on its sensor configuration. A humanoid robot learning to fold laundry in a domestic environment is capturing detailed spatial data about your home's layout, your movement patterns, and your daily routine.

Unlike text pulled from a website, this data did not exist in any prior form. It is created specifically through the act of observation. And the question of who owns it, who has rights over it, and who is informed about its collection is, at best, inconsistently answered.

Khizer is direct about the visibility gap: "That creates distinct challenges around provenance, ownership, and visibility that don't exist for web corpora." The language is measured, but the implication is significant. The frameworks we have for thinking about data collection, privacy, and consent were largely built for digital data. Physical-world sensing operates differently, at a different scale, with different intimacy.

The Labelling Problem Nobody Wants to Talk About

Beyond collection, there is the question of annotation.

For language data, labelling is difficult but tractable. You can show a human annotator two sentences and ask which one is more coherent, more helpful, more factually accurate. The cognitive load is manageable. The task scales.

For physical AI, particularly for the class of problems involving human and machine movement, labelling becomes a genuinely hard technical problem. The subdiscipline here is kinematics: the study of how bodies move through space, the mathematics of joints, the geometry of reach and rotation. Forward kinematics calculates where a limb ends up given the angles of each joint. Inverse kinematics works backwards: given a target position, what joint angles produce it?

Khizer names this directly as one of the hardest challenges in the field. "Labelling is one of the toughest. For tasks like forward and inverse kinematics the annotation requirements are subtle and specialized; unlike language data, high-quality labeled examples are scarce. That makes supervised collection and sample-efficient learning much more difficult."

Scarcity of high-quality labelled examples has a downstream consequence that reaches beyond research labs. When labelled data is rare and expensive to produce, it creates economic pressure toward shortcuts: cheaper annotation pipelines, less rigorous quality control, or the use of synthetic data that may not accurately reflect the complexity of real physical environments.

It also creates pressure toward gig-economy labour models in which physically demanding motion capture and annotation work is distributed to workers who may have limited understanding of how their movements are being used or what systems they are ultimately training.

The Sovereign Question

One dimension of this conversation that rarely surfaces in Western technology coverage is the sovereignty dimension: the question of which nations, which institutions, and which companies control the physical-world data that trains the robots of the next decade.

Trouve Labs, where Khizer has built an R&D team focused on real-world intelligence systems, operates explicitly in this space. The company's work spans geo-temporal intelligence, computer vision, and the fusion of telecom, IoT, and probe data for traffic and mobility intelligence. The governments, enterprises, and critical infrastructure operators that depend on this work are not abstract users. They are entities for which the provenance and custody of physical-world data is a strategic question, not merely a product consideration.

"We study these problems directly in the context of physical-AI research," Khizer says. "Because the data constraints change the research and engineering tradeoffs compared with cloud or LLM work."

That sentence, modest as it reads, points toward something important. The engineering decisions made in physical AI, around what data to collect, how to label it, how to generalise from it, are not purely technical decisions. They are decisions with consequences for privacy, for labour, for national security, and for the distribution of power between the entities that control sensor infrastructure and those who live within its range.

Trouve Labs is expanding its research agenda into large AI models, deep tech, and biomedical AI, with a stated ambition of moving beyond system optimisation toward direct improvement of human lives. That is a trajectory that requires physical-world data at scales that do not yet exist.

The field, broadly, is moving in the same direction. The next generation of surgical robots, elder-care assistants, construction automation, and last-mile logistics systems will all require dense, high-quality, real-world physical data. The collection infrastructure for that data is being built now, in many cases faster than the regulatory and ethical frameworks designed to govern it.

The internet data debate taught us something useful: that the rules written in the early years of a technology tend to calcify and persist long after anyone fully understands what they actually permitted. The physical AI data debate, if it is not had now, while the infrastructure is still taking shape, risks producing a second generation of those early rules: frameworks designed before anyone knew what they were actually governing.

Share this article