reference

The AI Architectures Actually Powering Robots

World models, VLAs, and simulation platforms are enabling robots to learn from the internet. A clear guide to the AI architectures powering the next generation of physical intelligence.

··11 min read
roboticsfoundation-modelsworld-modelsphysical-aivla

The same AI revolution that produced ChatGPT is now being adapted for physical robots. But the architectures are different. Language models predict the next word. Robotics AI needs to predict what happens when you push a cup, understand a spoken command, and output the precise motor actions to pick it up—all in real-time.

This is a reference guide to the foundation models powering that transition: world models that simulate physics, vision-language-action models that connect perception to motion, and the simulation platforms that make training possible. (For context on the humanoid robot companies using these models, see The Humanoid Robot Race.)

The Mental Model: Two Types of Robot Brains#

Think of robotics AI as having two complementary approaches:

World Models — AI that learns to simulate reality. Given a scene, it predicts what will happen next: physics, object permanence, cause-and-effect. World models create synthetic training environments where robots can practice millions of scenarios safely before touching the real world.

Vision-Language-Action (VLA) Models — AI that connects seeing, understanding language, and outputting robot actions in a single model. Instead of separate systems for perception, reasoning, and control, VLAs do it all end-to-end.

Most advanced robot systems use both: world models generate training data, VLAs control the robot. Understanding this split helps you navigate the landscape.


World Models#

World models predict what happens next in a physical environment. They don't control robots directly—they simulate the world so robots can learn from synthetic experience.

NVIDIA Cosmos#

What it is: NVIDIA's platform of world foundation models, launched at CES 2025. Purpose-built for physical AI development.

Why it matters: Cosmos lets developers generate massive amounts of photoreal, physics-based synthetic data without collecting real-world footage. This is expensive and dangerous to do physically—imagine crashing 10,000 cars to train an autonomous vehicle. Cosmos simulates those crashes instead.

Key capabilities:

  • Cosmos Predict 2.5 — Generates up to 30-second videos with multi-camera support. Can create driving scenarios, warehouse environments, factory floors.
  • Cosmos Transfer 2.5 — Adds weather, lighting, and terrain variations to existing simulations. Turn a sunny road into a snowy one.
  • Cosmos Reason — A multimodal model that lets robots "reason" about what steps to take next, using physics understanding and common sense.

Who's using it: 1X, Agility Robotics, Figure AI, Fourier, XPENG, Uber. Most major humanoid companies have adopted Cosmos for training data generation.

Availability: Open model license. Available on NVIDIA NGC, Hugging Face, and the NVIDIA API catalog.

Source: NVIDIA Cosmos


Google DeepMind Genie 3#

What it is: A general-purpose world model that generates interactive 3D environments from text prompts. Named one of TIME's Best Inventions of 2025.

Why it matters: Genie 3 is the first real-time interactive world model—it generates environments at 24 fps that you (or a robot) can navigate. Unlike video generation, these are persistent worlds with consistent physics.

Key capabilities:

  • Object permanence — Changes you make (moving objects, opening doors) persist over time. This sounds basic, but most generative models "forget" what they've generated.
  • Consistent physics — Objects fall, collide, and interact realistically without explicit physics programming.
  • Trained on 200,000 hours of internet video — Vast diversity of environments and scenarios.

Robotics application: Google describes Genie 3 as a "vast space to train agents like robots and autonomous systems." They've tested it with their SIMA agent performing tasks like "approach the bright green trash compactor" in generated warehouse environments.

Connection to AGI: DeepMind views world models as key stepping stones toward AGI, since they enable unlimited training scenarios for embodied agents.

Availability: Research preview, not yet publicly available.

Source: Google DeepMind Genie 3


JEPA and AMI Labs (Yann LeCun)#

What it is: Joint-Embedding Predictive Architecture (JEPA) is Yann LeCun's alternative to autoregressive models like GPT. Instead of predicting pixels or tokens, JEPA predicts abstract representations of the world.

Why it matters: LeCun argues that language models fundamentally can't understand reality—they just predict text patterns. JEPA is designed to build actual world understanding: physics, causality, common sense. If he's right, this architecture could be foundational for robots that truly reason about their environment.

The approach:

  • I-JEPA — Learns by comparing abstract representations of images, not pixels
  • V-JEPA — Extends to video, predicting what happens next in abstract representation space
  • VL-JEPA — Adds language understanding (released late 2025)

2025 development: LeCun launched AMI Labs (Advanced Machine Intelligence), a startup reportedly seeking $500M+ at a $3.5B valuation. The company is focused on world model AI as an alternative to LLMs—specifically addressing the hallucination problem by grounding AI in physical reality.

Current status: Research-stage. Meta has released I-JEPA and V-JEPA as open source, but these aren't yet integrated into commercial robotics systems.

Source: Meta AI V-JEPA, TechCrunch on AMI Labs


Vision-Language-Action Models#

VLA models are the "brains" that actually control robots. They take in camera images and language commands, and output motor actions—all in a single end-to-end model.

NVIDIA GR00T N1#

What it is: The world's first open foundation model for humanoid robots. Announced at GTC in March 2025.

Why it matters: GR00T N1 is designed to be the "GPT for humanoids"—a pre-trained model that robot companies can fine-tune for their specific hardware and tasks, rather than training from scratch.

Architecture: Dual-system design inspired by human cognition:

  • System 2 — A vision-language module that interprets the environment and understands instructions (slow, deliberate thinking)
  • System 1 — A diffusion transformer that generates smooth motor actions in real-time (fast, reflexive actions)

Both systems are trained end-to-end together.

Training scale: Up to 1,024 GPUs, roughly 50,000 H100 GPU hours for the 2B parameter model. Trained on real robot data, human videos, and synthetic data from simulations.

Synthetic data impact: NVIDIA generated 780,000 synthetic trajectories (equivalent to 9 months of human demonstrations) in just 11 hours. Combining synthetic + real data improved performance by 40%.

Who's using it: 1X, Agility Robotics, Boston Dynamics, Mentee Robotics, NEURA Robotics. At GTC, Jensen Huang demonstrated 1X's Neo robot performing household tasks using a policy built on GR00T N1.

Availability: Open model on Hugging Face. Developers can fine-tune for their specific robot hardware.

Source: NVIDIA GR00T N1, Hugging Face Model


Google RT-2#

What it is: Robotics Transformer 2, Google DeepMind's vision-language-action model. The foundational work that established the VLA paradigm in 2023.

Why it matters: RT-2 proved that you could take a vision-language model trained on internet data and adapt it to output robot actions—transferring web knowledge to physical control. This was a conceptual breakthrough.

How it works: Robot actions are represented as text tokens (like "move arm 0.3 units left"). The model is co-trained on internet vision-language tasks (like image captioning) and robot demonstration data. This lets it transfer concepts from the web to robot control.

Key innovation: Chain-of-thought reasoning for robots. You can ask RT-2 "pick up something I could use as a hammer" and it reasons: rocks are hard, hammers hit things, therefore pick up the rock. This emergent reasoning wasn't explicitly programmed.

Performance: On tasks the robot had never seen, RT-2 achieved 62% success vs. 32% for the previous RT-1 model—nearly doubling generalization capability.

Training data: Collected with 13 robots over 17 months in an office kitchen environment, plus internet-scale vision-language data.

Current status: Research model. Google has continued developing this line with newer architectures, but RT-2 remains the seminal VLA paper.

Source: Google DeepMind RT-2


Figure Helix#

What it is: Figure AI's proprietary VLA model, powering their Figure 03 humanoid robot. First VLA to output high-rate continuous control of an entire humanoid upper body.

Why it matters: Helix represents the commercial state-of-the-art. It controls perception, movement, and reasoning on-board and in real-time—no cloud connection required. Figure ended its collaboration with OpenAI in 2025, stating that LLMs are "getting smarter yet more commoditized." They're betting on their own AI stack.

Key capabilities:

  • Full upper-body control — Including wrists, torso, head, and individual fingers at high frequency
  • Multi-robot coordination — First VLA to operate two robots simultaneously on shared tasks
  • 3-gram force sensitivity — Fingertip sensors detect the weight of a paperclip, enabling secure grip detection before slipping occurs

Dual-system architecture:

  • System 2 — High-level planning at 7-9 Hz
  • System 1 — Low-level motor control at 200 Hz

Demonstrated tasks: Folding clothes, loading dishwashers, operating washing machines, clearing tables, tossing balls for dogs. Real household tasks, not just lab demos.

Availability: Proprietary to Figure AI. Not publicly available.

Source: Figure AI Helix


Simulation Platforms#

Training robots in the real world is slow, expensive, and dangerous. Simulation platforms let developers train in virtual environments first.

NVIDIA Isaac Sim#

What it is: A robotics simulation platform built on NVIDIA Omniverse. The industry standard for high-fidelity robot training.

Why it matters: Isaac Sim provides photorealistic rendering, accurate physics simulation, and integration with NVIDIA's AI training infrastructure. You can generate millions of training scenarios that would be impossible to collect in the real world.

Key capabilities:

  • Domain randomization — Automatically vary lighting, textures, object positions to make trained models robust
  • Sensor simulation — Accurate simulation of cameras, lidar, depth sensors
  • Synthetic data generation — Create labeled training datasets automatically
  • Integration with Cosmos and GR00T — Seamless pipeline from simulation to model training

Source: NVIDIA Isaac


Newton Physics Engine#

What it is: A new open-source physics engine developed by NVIDIA in collaboration with Google DeepMind and Disney Research. Purpose-built for robotics.

Why it matters: Existing physics engines (like PyBullet or MuJoCo) weren't designed for modern robot learning. Newton is optimized for the specific needs of training embodied AI: accurate contact dynamics, fast simulation, and integration with GPU-accelerated training pipelines.

Status: Under development. Announced at GTC 2025 alongside GR00T N1.


Hugging Face LeRobot#

What it is: An open-source platform for sharing robotics datasets, trained models, and simulation environments. Think "Hugging Face for robots."

Why it matters: The robotics community has historically been fragmented—everyone trains on private data, models aren't shared. LeRobot creates a "data flywheel": researchers contribute datasets and models, others build on them, progress accelerates.

Key features:

  • Dataset hub — Shared robot demonstration datasets
  • Model zoo — Pre-trained policies you can fine-tune
  • Simulation integration — Works with Isaac Sim and other platforms
  • NVIDIA partnership — Joint effort to accelerate open robotics R&D

Source: Hugging Face LeRobot


The Landscape Summary#

ModelTypeCreatorKey StrengthAvailability
CosmosWorld ModelNVIDIASynthetic data for driving/roboticsOpen
Genie 3World ModelGoogle DeepMindReal-time interactive environmentsResearch preview
JEPAWorld ModelMeta/LeCunAbstract representation learningOpen (research)
GR00T N1VLANVIDIAOpen foundation model for humanoidsOpen
RT-2VLAGoogle DeepMindWeb knowledge transferResearch
HelixVLAFigure AICommercial humanoid controlProprietary

What to Watch#

Convergence of world models and VLAs — The next generation of robot AI will likely combine both: world models for planning and imagination, VLAs for real-time control. NVIDIA's Cosmos Reason is an early example.

Open vs. proprietary — NVIDIA and Google are open-sourcing foundational work. Figure is going proprietary. The bet on which strategy wins will shape the industry.

Synthetic data scaling — GR00T N1's 40% improvement from synthetic data suggests we're early in understanding how much simulation can help. Expect massive investment in synthetic data generation.

Yann LeCun's bet — If JEPA-style architectures prove superior to autoregressive models for physical reasoning, it could reshape the entire field. AMI Labs is the venture to watch.


Resources#