Foundation Model

Definition

A foundation model is a large neural network trained on broad, diverse data at scale, then adapted to specific downstream tasks through fine-tuning, prompting, or in-context learning. The term was coined by the Stanford HAI Center in 2021 to describe models like GPT, CLIP, and DALL-E that serve as a common base for many applications. In robotics, foundation models aim to provide general visual, semantic, and physical understanding that transfers across robot embodiments, tasks, and environments.

The promise of foundation models for robotics is to escape the single-task paradigm. Traditional robot policies are trained from scratch for each new task, each new robot, and each new environment. A foundation model, by contrast, encodes broad knowledge — what objects look like, how they behave physically, what language instructions mean — that can be quickly adapted to new situations with minimal additional data. This mirrors the success of foundation models in NLP and computer vision, where a single pre-trained model powers hundreds of downstream applications.

In practice, robotics foundation models take several forms: Vision-Language-Action models (VLAs) that directly output robot actions, world models that predict future states for planning, and vision-language models used as perception backbones or reward labelers for reinforcement learning. Each represents a different way to inject broad pre-trained knowledge into the robot learning pipeline.

Types of Foundation Models in Robotics

Vision-Language-Action models (VLAs) — Accept images + language instructions and output robot actions. RT-2, OpenVLA, and π0 are the leading examples. These are the most direct application of foundation models to robot control. See the dedicated VLA & VLM glossary entry for details.
Generalist policies — Trained on cross-embodiment robot data to provide a base policy that can be fine-tuned to specific robots. Octo (800K episodes, 9 robot types) and RT-X are examples. They may or may not include language conditioning.
World models — Predict the future state of the environment given the current state and a candidate action. Used for model-based planning and for generating synthetic training data. UniSim and Genie are early examples. World models for robotics must capture physical dynamics (gravity, friction, contact), which remains challenging.
Vision-language backbones — Pre-trained models like CLIP, SigLIP, and DINOv2 used as frozen or fine-tuned feature extractors in robot perception pipelines. They provide rich visual representations without being robot-specific. Most VLAs use one of these as their vision encoder.

How They Differ from Task-Specific Policies

A task-specific policy (e.g., an ACT model trained to fold towels) learns everything from 50–200 demonstrations of that specific task. It has no knowledge of other tasks, other objects, or other robots. If the task changes even slightly (different towel, different table height), the policy may fail and need retraining.

A foundation model starts with knowledge from millions of internet images, billions of text tokens, and potentially hundreds of thousands of robot demonstrations across many embodiments. When fine-tuned on 50–200 demonstrations of towel folding, it retains its broader understanding: it knows what towels look like from many angles, understands the instruction "fold the towel in half," and has seen similar deformable-object manipulation on other robots. This background knowledge enables generalization to new towel colors, sizes, and table configurations without additional data.

The trade-off is clear: foundation models are larger (3–55B parameters vs. 1–50M for task-specific), slower at inference (5–15 Hz vs. 50–200 Hz), and require more compute for fine-tuning (4–8 A100 GPUs vs. a single RTX 4090). For well-defined, high-frequency tasks in controlled environments, task-specific policies remain the practical choice. For diverse, language-conditioned tasks in unstructured environments, foundation models offer a compelling path.

Data Requirements

Pre-training data: Foundation models for robotics require two types of pre-training data. Internet-scale image-text data (billions of pairs, from datasets like LAION and WebLI) provides visual and semantic understanding. Cross-embodiment robot data (hundreds of thousands of episodes from datasets like Open X-Embodiment) provides physical interaction understanding. Collecting the robot data is the bottleneck — the Open X-Embodiment dataset represents a community-wide effort spanning dozens of labs and years of collection.

Fine-tuning data: Adapting a foundation model to a new robot or task requires 100–1,000 teleoperated demonstrations. This is more than a task-specific policy needs (20–200) but the resulting model is far more generalizable. The demonstrations must include language annotations describing the task.

Why internet-scale data matters: A model trained only on robot data has a narrow view of the world: it knows what objects look like from the robot's camera angles, in the robot's workspace. A model that has also seen millions of internet images knows what a "red cup" looks like from every angle, in every lighting condition, in every context. This visual grounding is what enables zero-shot generalization to novel objects.

Key Models Comparison

The landscape of robotics foundation models is evolving rapidly. Here are the principal models as of 2026:

RT-2 (Google DeepMind, 2023) — 55B parameters. Built on PaLM-E VLM. Demonstrated emergent generalization to novel objects and instructions. Closed-source. Inference at ~3 Hz, limiting real-time control applications.
OpenVLA (Stanford/Berkeley, 2024) — 7B parameters. Open-source, based on Prismatic VLM + LLaMA 2. Trained on 970K episodes from Open X-Embodiment. The most accessible VLA for researchers to fine-tune. Inference at ~5–8 Hz on a single A100.
Octo (Berkeley, 2024) — 93M parameters. Lightweight generalist policy with transformer backbone and diffusion action head. Designed as a base model for fine-tuning. Supports both language and goal-image conditioning. Fast inference (~20 Hz) makes it practical for real-time control.
Pi-zero (Physical Intelligence, 2024) — 3B parameters. Flow-matching action decoder optimized for dexterous manipulation. Demonstrates strong performance on bimanual and contact-rich tasks at higher control frequencies than larger VLAs.
GR-2 (ByteDance, 2025) — Video generation world model conditioned on language and action, producing future visual predictions for planning. Represents the world-model approach to foundation models for robotics.

The trend is clear: model sizes are decreasing (from RT-2's 55B to Octo's 93M) as architectures become more efficient, while performance continues to improve through better data curation and training techniques. At SVRC, we support fine-tuning of OpenVLA and Octo on our GPU clusters, with teleoperation data collection services to generate the task-specific demonstrations these models require.

Technical Architecture

Most robotics foundation models share a three-component architecture:

Vision encoder: A pre-trained vision transformer (ViT) or SigLIP model that converts camera images into a sequence of visual tokens. The encoder is typically pre-trained on internet-scale image-text data (LAION, WebLI) and then frozen or fine-tuned during robot training. DINOv2 and SigLIP are the most common choices, providing rich visual features that transfer well to robotic scenes.

Language-conditioned backbone: A large language model (LLaMA, PaLM, Gemma) that processes both the visual tokens and a tokenized language instruction. The LLM's pre-trained knowledge of language semantics enables the model to interpret open-ended instructions and generalize to novel task descriptions. During robot fine-tuning, the LLM weights are updated to produce representations useful for action prediction.

Action head: Converts the backbone's hidden representations into robot-executable actions. Three main designs exist: (1) token-based discretizes continuous actions into bins and predicts them as text tokens (RT-2, OpenVLA); (2) diffusion-based uses a denoising diffusion process to generate continuous action chunks (Octo, some Pi-zero variants); (3) flow-matching directly regresses continuous actions via flow-matching losses (Pi-zero). Diffusion and flow-matching heads better capture multimodal action distributions but add inference latency.

The training process typically follows two stages: pre-training on internet-scale vision-language data (giving the model visual and semantic understanding), followed by robot co-training on a mix of internet data and cross-embodiment robot demonstration data. The robot data is formatted as (image, instruction, action) triplets. Fine-tuning to a specific robot and task requires an additional stage with 100–1,000 target-domain demonstrations.

Current Limitations

Physical grounding: Foundation models pre-trained on internet data understand visual appearance and language but lack deep physical intuition. They know what a glass looks like but may not predict that it will shatter if gripped too hard. Physical understanding must come from robot interaction data, which is orders of magnitude scarcer than internet data.

Inference latency: Models with 3–55B parameters run at 5–15 Hz on current hardware, compared to 50–200 Hz for lightweight task-specific policies. This limits their use in tasks requiring fast reactive control (catching, high-speed assembly). Model distillation and specialized inference hardware are active research areas.

Embodiment gap: A model trained on data from robot type A does not automatically work on robot type B with different kinematics, cameras, and action spaces. Cross-embodiment training helps but does not eliminate this gap entirely. Fine-tuning on the target embodiment remains necessary.

Key Papers

Bommasani, R. et al. (2021). "On the Opportunities and Risks of Foundation Models." Stanford HAI. The landmark report that defined the foundation model concept and analyzed its implications across domains, including robotics.
Brohan, A. et al. (2023). "RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control." CoRL 2023. The first large-scale demonstration that internet pre-training transfers to robot control through a VLA architecture.
Team, O. X.-E. et al. (2024). "Open X-Embodiment: Robotic Learning Datasets and RT-X Models." ICRA 2024. Created the data foundation for cross-embodiment robotics foundation models by aggregating 970K episodes across 22 robot types.

Related Terms

VLA & VLM — The most common foundation model architecture for robot control
Policy Learning — Foundation models are one approach to learning robot policies
Sim-to-Real Transfer — Foundation models can reduce the sim-to-real gap through pre-trained visual representations
Action Chunking (ACT) — A task-specific alternative to foundation model approaches
Diffusion Policy — Used as the action head in some foundation model architectures

Deploy Foundation Models at SVRC

Robotics Center of Silicon Valley in San Francisco and Allston provides multi-GPU clusters for foundation model fine-tuning, large-scale teleoperation data collection campaigns, and expert guidance on when to use a foundation model versus a lightweight task-specific policy. Our data platform manages datasets in Open X-Embodiment and LeRobot formats for seamless model training. We support OpenVLA and Octo fine-tuning out of the box, with custom model integration available on request.

Explore Data Services Contact Us

Definition

Types of Foundation Models in Robotics

How They Differ from Task-Specific Policies

Data Requirements

Key Models Comparison

Technical Architecture

Current Limitations

Key Papers

Related Terms

See Also

Deploy Foundation Models at SVRC

Related Pages

VLA & VLM

Policy Learning

Sim-to-Real Transfer

Action Chunking (ACT)