Imitation Learning for Robots: A Practical Guide
Imitation learning has emerged as the dominant paradigm for teaching robots dexterous manipulation skills. Instead of hand-crafting reward functions or writing motion plans, you simply show the robot what to do. This guide explains how it works, which algorithms to use, and what infrastructure you need to get results.
What Is Imitation Learning?
Imitation learning (IL) -- also called learning from demonstration (LfD) or behavioral cloning -- trains a policy to replicate actions captured from a human operator. During data collection, a skilled demonstrator teleoperates the robot through the target task while sensors record joint positions, end-effector poses, camera frames, and any other relevant state. That recorded data becomes the training set for a neural network policy.
The appeal of IL over reinforcement learning is practical: you do not need to engineer a reward signal, run millions of simulated rollouts, or solve a sparse-reward exploration problem. If a human can do the task, the robot can potentially learn it from a few hundred to a few thousand demonstrations. The challenge is generalization -- policies trained on narrow demonstrations can fail when object positions, lighting, or task variations differ from the training distribution.
Modern IL research addresses this through better architectures, larger and more diverse datasets, and pre-trained visual representations. The field has advanced rapidly since 2023, and production-quality imitation learning is now within reach of teams without access to a robotics PhD program.
IL Algorithm Taxonomy: BC, DAgger, and HG-DAgger
Behavioral Cloning (BC) is the simplest form of imitation learning. You collect a static dataset of (observation, action) pairs from demonstrations, then train a neural network to predict actions from observations via supervised learning. BC is easy to implement, fast to train, and requires no online interaction. Its fundamental weakness is compounding error: at deployment, small prediction mistakes shift the observation distribution away from training data, and subsequent predictions degrade further because the policy is extrapolating into states it never saw during training.
The compounding error problem scales with trajectory length. For a per-step error rate of epsilon, the expected total error over T steps grows as O(epsilon * T^2) under the worst-case analysis of Ross and Bagnell (2010). In practice, this means BC works well for short tasks (under 30 steps at the control frequency) but degrades rapidly on longer horizons without mitigation strategies like action chunking.
DAgger (Dataset Aggregation) addresses compounding error through online data collection. After an initial BC phase, the learned policy is deployed, and an expert labels the actions that should have been taken at the states the policy actually visited. This new data is added to the training set, and the policy is retrained. Iterating this process provably converges to a policy that performs as well as the expert in the limit. The practical cost is that DAgger requires repeated access to an expert during training -- the expert must watch the policy run and correct its mistakes in real time.
HG-DAgger (Human-Gated DAgger) is a more practical variant for robot learning. Instead of requiring an expert to label every visited state, HG-DAgger allows the human to intervene only when the policy is about to fail. The operator watches the robot execute the learned policy and takes over control when necessary. The intervention trajectories (which represent corrective demonstrations from the specific failure states the policy encounters) are added to the dataset. This approach is 3-5x more efficient than standard DAgger because the human only provides demonstrations where they are most needed.
For most practical robot learning projects at SVRC, we recommend starting with BC using action chunking (which dramatically reduces compounding error) and graduating to HG-DAgger only if the policy has persistent failure modes that additional BC data does not resolve.
Dataset Size Scaling Laws for Imitation Learning
How many demonstrations do you actually need? The answer depends on task complexity, algorithm, and how much generalization you require. Based on published results and SVRC's internal evaluation data, here are the empirical scaling behaviors:
| Dataset Size | Typical Result (single task, ACT) | When to Stop Adding Data |
|---|---|---|
| 10-25 demos | 20-40% success; policy captures gross motion but misses details | Never -- this is only useful for sanity checks |
| 50-100 demos | 60-80% success on in-distribution objects/positions | Acceptable for a research prototype with fixed conditions |
| 200-500 demos | 80-92% success; handles moderate position/object variation | Sufficient for most single-task deployments |
| 500-2000 demos | 88-95% success; robust to diverse objects, positions, lighting | Diminishing returns unless adding diversity, not volume |
A critical insight from scaling laws research: beyond approximately 200 demonstrations of the same task under the same conditions, adding more identical data produces diminishing returns. The high-leverage move is to add diverse data -- new object instances, new positions, new lighting -- rather than more repetitions. SVRC's data collection protocols are structured around diversity targets for exactly this reason.
State Representation Choices
The observation space fed to your policy significantly affects learning efficiency and generalization. The three practical options, ranked by complexity:
Joint angles + images (recommended default). The observation at each timestep is the vector of robot joint positions (7-dimensional for a 7-DOF arm, plus 1 for the gripper) concatenated with one or more camera images (typically 224x224 or 256x256 RGB). Joint angles provide precise proprioceptive state that the policy can use for closed-loop control. Images provide the visual context needed to identify objects and workspace layout. This is the standard input format for ACT, Diffusion Policy, and most VLA models.
End-effector pose + images. Instead of joint angles, the observation includes the end-effector position (x, y, z) and orientation (quaternion or rotation matrix) in the robot's base frame. This representation is more compact and directly encodes task-relevant spatial information. It works well when the policy only needs to control end-effector motion (as opposed to specific joint configurations). The downside: end-effector representations can create ambiguity in redundant manipulators where multiple joint configurations produce the same end-effector pose.
Joint angles only (no images). For fixed-environment tasks with known object positions (e.g., factory assembly with fixtured parts), pure proprioceptive policies trained on joint angles alone can achieve high success rates with very few demonstrations (as few as 20-50). The policy learns a joint-space trajectory rather than a visuomotor mapping. This approach is fast to train, easy to deploy, and very reliable -- but zero generalization to any visual change.
Action Space Design: Absolute vs. Delta
How actions are represented matters more than most practitioners realize. The two dominant choices:
Absolute joint positions. Each action is the target joint angle vector for the next timestep. The advantage: actions are interpretable and bounded by joint limits. The disadvantage: the policy must learn the full mapping from observations to absolute joint targets, and small observation changes can require large action changes (since the absolute target may be far from the current position). ACT uses absolute joint targets by default.
Delta (relative) actions. Each action is the change in joint angles or end-effector pose from the current state. The advantage: delta actions are naturally small, making them easier to predict accurately, and they generalize better to starting states not seen during training (because the policy learns relative motions rather than absolute targets). The disadvantage: delta actions can accumulate drift over long trajectories if there is no absolute reference correction. Diffusion Policy typically uses delta end-effector actions.
Practical recommendation: for tasks under 100 steps at 10-20 Hz control, delta end-effector actions with occasional absolute waypoints provide the best combination of generalization and accuracy. For bimanual tasks where joint coordination matters, absolute joint targets (as in ACT) avoid the synchronization issues that delta actions can create between two arms.
ACT: Action Chunking with Transformers
ACT, introduced alongside the ALOHA bimanual robot platform from Stanford, treats robot control as a sequence prediction problem. The policy predicts a chunk of future actions -- typically 50-100 timesteps -- rather than a single next action. This action chunking reduces compounding error, which is the main failure mode of naive behavioral cloning where small prediction mistakes accumulate over a trajectory.
ACT uses a CVAE (Conditional Variational Autoencoder) during training to capture the multimodality of human demonstrations -- the fact that there is often more than one correct way to complete a task. At inference time, the decoder generates action sequences conditioned on the current camera observations and joint state. The result is a policy that handles the natural variation in human-demonstrated tasks without mode-averaging artifacts.
Key implementation details that affect performance: the KL divergence weight in the CVAE loss controls the tradeoff between reconstruction accuracy and latent space regularization. Start with a weight of 10 and sweep [1, 5, 10, 50]. The temporal ensemble technique -- averaging overlapping action chunks with exponential weighting -- smooths transitions between chunks and reduces jitter. Use an exponential weight of 0.01 for the temporal ensemble.
ACT is a strong starting point for bimanual manipulation tasks. It requires relatively modest data volumes (50-200 demonstrations per task) and trains on a single GPU in 2-4 hours. If you are working with ALOHA hardware or a similar bimanual setup, ACT should be your first algorithm to try. SVRC's data services include pre-processed ACT-compatible datasets collected on ALOHA-class platforms.
Diffusion Policy: Handling Multimodal Action Distributions
Diffusion Policy applies score-matching diffusion models -- the same class of models that powers Stable Diffusion for images -- to the robot action space. Rather than predicting a single best action, the policy learns the full distribution of actions that a human demonstrator might take. At inference time it runs a denoising process to sample a high-quality action from that distribution.
The key advantage over ACT is how it handles multimodal tasks: scenarios where a human might grasp an object from the left or the right, or approach a target from multiple valid angles. Standard behavioral cloning averages these modes together, producing a policy that goes down the middle and fails. Diffusion Policy samples from the correct mode given the current context, producing more robust behavior on ambiguous tasks.
The tradeoff is inference speed. Diffusion Policy with a UNet backbone requires 100 denoising steps at inference by default (DDPM), which takes approximately 500ms on an RTX 3090 -- too slow for real-time control. The DDIM sampler reduces this to 10 steps (~200ms), and consistency distillation variants achieve single-step generation (~50ms), making real-time operation viable. For a detailed comparison of inference timing and when to use each variant, see our ACT vs. Diffusion Policy decision guide.
Diffusion Policy generally benefits from more demonstrations than ACT but rewards dataset diversity more than raw quantity. A practical rule: use Diffusion Policy when you have 200+ demonstrations and your task has multiple valid strategies that you want the policy to handle.
Vision-Language-Action Models: IL at Scale
VLAs like OpenVLA, pi0, and RT-2 extend imitation learning by pre-training on internet-scale visual and language data before fine-tuning on robot demonstrations. The pre-trained backbone provides a rich representation of objects, scenes, and relationships that transfers powerfully to robot manipulation. Fine-tuning requires far fewer demonstrations than training from scratch -- sometimes as few as 10-50 task-specific examples.
The practical tradeoffs to understand: VLAs require significantly more compute for both training and inference. OpenVLA (7B parameters) requires an A100 or H100 GPU for fine-tuning and runs inference at approximately 3 Hz -- adequate for slow manipulation but not for reactive tasks. Smaller distilled variants are emerging but remain less capable than the full models. For teams that can afford the compute and licensing requirements, VLAs represent the current frontier of IL performance. They generalize better to novel objects, new environments, and language-specified task variations.
SVRC provides fine-tuning datasets and teleoperation infrastructure compatible with the data formats expected by major VLA training pipelines. See our VLA models explained guide for a deeper technical breakdown.
Beyond BC: GAIL, IBC, and Inverse RL
While behavioral cloning and DAgger dominate practical robot IL, several other methods deserve attention for specific scenarios.
GAIL (Generative Adversarial Imitation Learning). GAIL trains a discriminator to distinguish between expert demonstrations and the learned policy's rollouts, then uses the discriminator's output as a reward signal for reinforcement learning. The result is a policy that matches the expert's state-action distribution rather than individual actions, which provides better generalization than BC when the demonstration dataset is small (under 50 episodes). The cost: GAIL requires online rollouts in an environment (simulation or real), making it 10-100x more computationally expensive than BC. Practical use case: tasks where you have very few demonstrations (10-20) but access to a good simulator (e.g., peg insertion, where MuJoCo models the physics accurately).
IBC (Implicit Behavioral Cloning). IBC represents the policy as an energy-based model (EBM) rather than an explicit action predictor. Instead of outputting a single action, the model assigns an energy score to every candidate (observation, action) pair. At inference, the policy finds the action that minimizes energy via gradient-based optimization or Langevin dynamics sampling. The advantage: IBC naturally handles multimodal action distributions without the architectural complexity of diffusion models or CVAEs. The disadvantage: inference is slow (100-500ms per step on an RTX 3090) because it requires iterative optimization at each timestep. IBC has shown strong results on precise contact-rich tasks (insertion, fitting) where the action distribution has sharp, well-separated modes.
Inverse RL (IRL). Rather than learning actions directly, IRL recovers the reward function that the expert was implicitly optimizing, then trains a policy via RL using that recovered reward. This approach generalizes better than BC to novel initial conditions because the policy learns the underlying objective rather than a fixed mapping. The practical barrier: IRL requires repeated RL training in an inner loop, which is computationally expensive and requires either a simulator or extensive real-world interaction. IRL is most practical for autonomous driving and navigation tasks where simulation is mature and reward specification is genuinely difficult. For manipulation, BC with modern architectures is typically sufficient.
Failure Analysis Framework for IL Policies
When a trained policy fails on the real robot, systematic failure analysis saves days of trial-and-error debugging. Use this five-step framework to diagnose and fix policy failures.
Step 1: Classify the failure mode. Watch 10+ failure episodes and categorize each into one of these types:
- Approach failure: The robot does not move toward the object correctly (wrong direction, too fast/slow, stops short).
- Grasp failure: The robot reaches the object but fails to grasp it (misaligned fingers, insufficient force, wrong angle).
- Transport failure: The robot grasps successfully but drops the object during transport (grip loosening, collision with obstacles).
- Placement failure: The robot transports successfully but fails at the final placement (wrong orientation, position overshoot).
- Recovery failure: The robot enters an off-trajectory state (e.g., after a partial grasp) and cannot recover.
Step 2: Check for calibration and hardware issues. Before blaming the policy, verify that cameras are in their calibrated positions (measure with a ruler against reference marks), joint encoders are not drifting (command the arm to a known pose and check visually), and the gripper is closing to the expected width. Hardware issues masquerading as policy failures are a common source of wasted debugging time.
Step 3: Compare failure observations to training distribution. Extract the camera frame at the moment of failure and compare it visually to the training data. Is the object in a position the policy has never seen? Is the lighting drastically different? Is there an occluding object not present during training? If so, the fix is data collection, not hyperparameter tuning.
Step 4: Check for mode averaging. If the robot moves toward a point between two valid targets (e.g., between two objects when it should pick one), the policy is averaging over modes in the demonstration data. Switch to a multimodal architecture (Diffusion Policy or ACT with CVAE enabled) or clean the demonstrations to enforce consistent strategy choice.
Step 5: Plot per-timestep action error. If the policy starts well but degrades after 20-40 steps, compounding error is the primary issue. Increase the action chunk size, add temporal ensembling, or collect HG-DAgger data from the specific failure states.
Python Training Snippet: ACT with LeRobot
Here is a minimal working training script for ACT using the Hugging Face LeRobot framework, annotated with the key configuration decisions.
# train_act.py -- Minimal ACT training with LeRobot
from lerobot.common.datasets.lerobot_dataset import LeRobotDataset
from lerobot.common.policies.act.modeling_act import ACTPolicy
from lerobot.common.policies.act.configuration_act import ACTConfig
import torch
# 1. Load dataset (HDF5 format, collected via LeRobot record)
dataset = LeRobotDataset("svrc/openarm-pick-place-v1")
print(f"Episodes: {dataset.num_episodes}, Steps: {len(dataset)}")
# 2. Configure ACT policy
config = ACTConfig(
chunk_size=100, # Predict 100 future actions per step
kl_weight=10.0, # CVAE KL divergence weight; sweep [1, 5, 10, 50]
dim_model=512, # Transformer hidden dimension
n_heads=8, # Multi-head attention heads
n_encoder_layers=4, # Visual encoder depth
n_decoder_layers=7, # Action decoder depth
input_shapes={
"observation.images.top": [3, 480, 640],
"observation.state": [14], # 7 joints x 2 arms (bimanual)
},
output_shapes={
"action": [14], # Joint position targets
},
)
# 3. Train
policy = ACTConfig(config)
optimizer = torch.optim.AdamW(policy.parameters(), lr=1e-5, weight_decay=1e-4)
for epoch in range(2000):
for batch in torch.utils.data.DataLoader(dataset, batch_size=8, shuffle=True):
loss = policy.forward(batch) # Returns dict with "loss" key
loss["loss"].backward()
optimizer.step()
optimizer.zero_grad()
if epoch % 100 == 0:
print(f"Epoch {epoch}: loss={loss['loss'].item():.4f}")
# 4. Export for deployment
policy.save_pretrained("checkpoints/act-pick-place-v1")
This script assumes you have collected data using LeRobot's lerobot record command and stored it in HDF5 format. Training time on a single RTX 4090: approximately 2-3 hours for 2,000 epochs on 200 demonstrations. For full hyperparameter sweeps and multi-GPU training, see the LeRobot framework guide.
Common Data Quality Issues and Fixes
| Symptom | Likely Cause | Fix |
|---|---|---|
| Policy jerky, oscillates near grasp | Inconsistent operator technique across demos | Use single operator; filter demos by trajectory smoothness |
| High train loss, low success | Misaligned camera timestamps (>50ms offset) | Re-record with hardware sync; check USB bandwidth |
| Overfit: 95% train, 30% eval | Insufficient pose/lighting diversity | Add color jitter + random crop augmentation; collect 50+ demos with varied object positions |
| Policy ignores one camera | Redundant viewpoints; model shortcuts to easier camera | Random camera dropout during training (p=0.1-0.3) |
| Gripper never closes / always closed | Gripper action normalization error | Verify gripper action range matches hardware limits; check open/close polarity |
| Works day 1, fails day 2 | Camera or arm bumped; lighting changed | Add daily calibration check to deployment routine; mount cameras rigidly |
Task Success Metrics: Measuring Policy Quality
A trained policy's quality should be measured through a structured evaluation protocol, not informal observation. The metrics that matter:
- Task success rate. The primary metric: what fraction of evaluation trials does the policy complete the full task successfully? Run a minimum of 20 evaluation trials per condition. Report 95% confidence intervals -- with 20 trials, the interval is wide (roughly +/- 15%), so do not over-interpret small differences.
- Partial completion rate. For multi-step tasks, track which subtasks are completed even when the full task fails. A policy that consistently reaches the object but fails at grasping has a different failure mode than one that fails at the approach phase.
- In-distribution vs. out-of-distribution performance. Always report success rates separately for conditions seen during training and conditions not seen during training (held-out objects, positions, or environments). A policy achieving 90% in-distribution and 40% out-of-distribution has fundamentally different deployment readiness than one achieving 80%/70%. See our policy generalization guide for evaluation protocols.
- Trajectory quality metrics. Beyond binary success/failure: mean jerk (smoothness), path length efficiency (actual path length vs. shortest feasible path), and execution time compared to expert demonstrations. Jerky policies that technically succeed will break hardware over time and produce worse data if used for further collection.
Training Monitoring: Validation Loss Patterns to Watch
During policy training, monitor these signals to diagnose problems early:
Validation loss plateau. If validation loss stops decreasing while training loss continues to drop, you are overfitting to your training set. The standard fix: add data augmentation (color jitter, random crop) or collect more diverse demonstrations.
Validation loss oscillation. Large swings in validation loss indicate an unstable training configuration. Reduce the learning rate by 2-5x. For Diffusion Policy, also check that the noise schedule variance is not too aggressive.
KL divergence collapse (ACT only). If the KL term in ACT's loss goes to zero early in training, the CVAE latent space has collapsed and the policy is ignoring the latent variable. Increase the KL weight or use a KL warmup schedule that gradually increases the weight over the first 20% of training.
Action prediction error by timestep. Plot per-timestep prediction error across the action chunk. If error increases sharply at later timesteps (e.g., step 60+ in a 100-step chunk), reduce the chunk length. If error is uniformly high, the model capacity may be insufficient -- increase the transformer hidden dimension or the number of attention heads.
Inference Deployment: From Trained Model to Real Robot
Deploying a trained IL policy on real hardware introduces practical challenges that training pipelines do not expose:
Camera calibration drift. If the camera is bumped or re-mounted between training data collection and deployment, even a 2-3 cm shift can degrade policy performance by 10-20%. Always verify camera positions before deployment using a calibration procedure. Log camera intrinsics and extrinsics as part of your deployment checklist (see our deployment checklist).
Control frequency matching. The policy must run at the same control frequency as the demonstrations were collected. If demonstrations were collected at 50 Hz but your inference pipeline runs at 20 Hz (because the GPU cannot keep up), the policy's action predictions will be temporally misaligned. Either downsample your training data to match the deployment control frequency, or ensure your inference hardware is fast enough. ACT on an RTX 3090 achieves approximately 20 Hz with a single camera; add a second camera and throughput drops to roughly 15 Hz.
Action smoothing and safety limits. Raw policy outputs should be filtered through a safety layer before being sent to the robot: clamp actions to joint limits, apply velocity limits (typically 1.0-1.5 rad/s per joint), and use a low-pass filter (5-10 Hz cutoff) to remove high-frequency prediction noise. This adds 1-2 frames of latency but prevents the hardware damage that unfiltered policy outputs can cause.
Data Requirements for Imitation Learning
The minimum viable dataset for a single manipulation task is typically 50 demonstrations for ACT, 100-200 for Diffusion Policy, and 20-50 for VLA fine-tuning. These are floor estimates under favorable conditions -- consistent lighting, fixed camera viewpoints, and objects in predictable positions. Real-world deployment requires 3-5x more data to cover the variation your system will encounter in production.
Data quality matters as much as quantity. Demonstrations should be collected by skilled operators who complete the task consistently and cleanly. Failed attempts, hesitations, and corrections that enter the training set as labeled successes will degrade policy performance. SVRC's managed data collection service ($2,500 pilot / $8,000 full campaign) provides trained operators, quality-filtered episode selection, and structured dataset packaging -- saving your engineering team weeks of data pipeline work.
Sensor diversity is also important. Policies trained on a single wrist camera frequently fail when that camera is occluded. Best practice is to collect from at least two camera viewpoints -- one fixed overhead or side view and one wrist-mounted -- and include proprioceptive state (joint angles and velocities) alongside visual observations.
Hardware and Infrastructure for IL Research
The minimal hardware stack for an imitation learning research project includes: a robot arm with sufficient degrees of freedom for your task (at least 6-DOF for general manipulation), a leader-follower or VR-based teleoperation system for data collection, two or more cameras, and a workstation with at least one NVIDIA GPU (RTX 3090 or better for ACT/Diffusion Policy; A100 or H100 recommended for VLA fine-tuning).
SVRC's hardware catalog includes the OpenArm 101 ($4,500), which ships with a compatible teleoperation leader arm and mounting hardware for standard camera configurations. For bimanual research, the DK1 platform provides dual-arm teleoperation with synchronized recording. The SVRC platform provides the software layer: episode recording, dataset management, policy training pipelines, and evaluation tooling. Teams can lease rather than buy hardware for short-term projects through the robot leasing program, which is often the fastest path to a working IL prototype.
For teams that want to start with data before investing in hardware, SVRC offers access to curated multi-task demonstration datasets collected at our Mountain View facility. These datasets cover common manipulation primitives -- picking, placing, pouring, folding, assembly -- and are formatted for direct use with ACT, Diffusion Policy, and Hugging Face LeRobot. Contact our team to discuss dataset access options.
Algorithm Comparison Table: Which IL Method for Which Scenario
| Method | Min Demos | Multimodal? | Inference Speed | GPU Required | Best For |
|---|---|---|---|---|---|
| BC (MLP/ResNet) | 50 | No | ~1ms | RTX 3060+ | Simple short-horizon tasks, proprioception-only |
| ACT | 50 | Yes (CVAE) | ~50ms | RTX 3090+ | Bimanual, long-horizon, ALOHA-class hardware |
| Diffusion Policy | 200 | Yes (diffusion) | ~200ms (DDIM) | RTX 3090+ | Multi-strategy tasks, diverse demonstrations |
| VLA (OpenVLA) | 20 | Yes (language) | ~300ms | A100/H100 | Novel object generalization, language-conditioned tasks |
| GAIL | 10 | N/A (RL-based) | ~1ms | A100 (training) | Few demos + good sim (e.g., peg insertion in MuJoCo) |
| IBC | 100 | Yes (EBM) | ~200ms | RTX 3090+ | Contact-rich precision tasks with sharp modes |
The decision flowchart for most teams: start with ACT if you have bimanual hardware or long-horizon tasks with fewer than 200 demos. Use Diffusion Policy if you have 200+ demos and your task has multiple valid strategies. Use VLA fine-tuning if you need language conditioning or novel object generalization with minimal task-specific data. BC is appropriate only for the simplest, shortest tasks or as a diagnostic baseline.
Multi-Task IL: Training One Policy for Multiple Tasks
Training a single policy to handle multiple manipulation tasks (pick-place, drawer open, pouring, etc.) is increasingly practical with modern architectures. The key considerations:
Language conditioning is essential for multi-task. Without language instructions, the policy has no way to know which task to execute. Language-conditioned policies accept instructions like "pick up the red cup" or "open the top drawer" and route behavior accordingly. ACT and Diffusion Policy both support language conditioning through additional encoder heads.
Data balancing across tasks. If you collect 500 demos of pick-place but only 50 of pouring, the policy will strongly favor pick-place behavior. Use weighted sampling during training to balance task frequency, or explicitly set task sampling probabilities inversely proportional to dataset size per task.
Multi-task benefits generalization. Counterintuitively, training on 5 tasks with 100 demos each often produces better per-task performance than training on 1 task with 100 demos. The multi-task training acts as implicit regularization, forcing the visual encoder to learn features that are relevant across tasks rather than memorizing task-specific visual shortcuts. The Open X-Embodiment results confirmed this effect at scale.
Expected overhead. Multi-task training requires 2-5x more compute than single-task training (more data, larger batch sizes for stability). Inference speed is unchanged. SVRC's data services support multi-task collection campaigns where operators cycle through multiple task definitions within a single collection session.
Hyperparameter Sensitivity: What Actually Matters
IL practitioners spend too much time tuning hyperparameters that do not matter and too little on the ones that do. Based on SVRC's experience training hundreds of policies across ACT and Diffusion Policy, here is a ranked sensitivity analysis.
| Hyperparameter | Sensitivity | Recommended Default | When to Tune |
|---|---|---|---|
| Action chunk size (ACT) | Very High | 100 steps | Reduce to 20-50 for reactive tasks; increase to 150-200 for slow, smooth tasks |
| KL weight (ACT) | High | 10.0 | Increase (50-100) if multimodal demos; decrease (1-5) if all demos use same strategy |
| Noise schedule (Diffusion) | High | Cosine schedule, 100 diffusion steps | Reduce diffusion steps to 10-20 with DDIM for faster inference |
| Learning rate | Medium | 1e-5 (ACT), 3e-4 (Diffusion) | If training diverges, reduce 5x; if too slow, increase 2x |
| Batch size | Low | 8 (single GPU) | Increase to 16-32 if GPU memory allows for more stable training |
| Number of epochs | Low | 2000 | Use early stopping on validation loss; 2000 is typically sufficient for 200 demos |
| Image resolution | Low | 224x224 or 480x640 | Only increase if task requires fine visual detail (text reading, small object ID) |
Interaction effects: Action chunk size and KL weight interact strongly in ACT. A large chunk size (150+) with low KL weight (<5) produces overly smooth, averaged trajectories that miss precise motions. A large chunk size with high KL weight (50+) produces sharp, distinct action modes but may oscillate between modes mid-task. The default pairing (chunk=100, KL=10) works for most tasks. Only tune them jointly: if you increase chunk size, increase KL weight proportionally to maintain action diversity within each chunk.
The practical implication: if your policy is underperforming, tune action chunk size and KL weight (ACT) or noise schedule (Diffusion Policy) first. These have 10-20% impact on success rate. Learning rate and batch size have 2-5% impact. Image resolution and number of epochs have <2% impact unless your baseline is severely misconfigured.
DAgger and HG-DAgger: When BC Is Not Enough
Behavioral cloning suffers from compounding error: small prediction errors accumulate over time because the policy encounters states it was never trained on (the policy's errors push it off the expert's demonstrated trajectory). DAgger (Dataset Aggregation) and its human-guided variant HG-DAgger address this by iteratively collecting data from the learned policy's own state distribution.
Standard DAgger protocol:
- Train an initial BC policy on your demonstration dataset.
- Deploy the policy and let it run autonomously. The human expert watches and records what action they would have taken at each timestep (relabeling the policy's trajectory with expert actions).
- Add the relabeled trajectory to the training set and retrain.
- Repeat 3-5 iterations until the policy's trajectory matches the expert's.
HG-DAgger (Human-Gated DAgger) is the practical variant for real robot learning. Instead of the expert relabeling every timestep, the human watches the policy execute and only intervenes when the policy is about to fail. When the human takes over (gating), the system records the human correction and switches back to the policy once the recovery is complete. This is 3-5x faster than standard DAgger because the expert only acts during the critical failure states.
Expected impact: HG-DAgger with 3 correction rounds (each adding 20-50 correction trajectories) typically improves long-horizon task success rate by 15-25% over pure BC, with the largest improvements on the specific failure states targeted by corrections. Budget 2-4 hours of expert time per DAgger round.
Temporal Ensembling and Action Chunking: Implementation Details
Two techniques are critical for reducing the compounding error that limits BC performance: action chunking (predicting multiple future actions at once) and temporal ensembling (averaging overlapping action predictions). Understanding their interaction is key to getting good results.
Action chunking predicts the next K actions (chunk size K) from each observation. The robot executes all K actions before re-querying the policy. This reduces the number of policy queries per episode from T (episode length) to T/K, which means the policy has K times fewer opportunities to make errors that compound. Typical chunk sizes: K=50-100 for ACT (predicting 1-2 seconds of future actions at 50 Hz). Larger chunks produce smoother motion but respond slower to unexpected events; smaller chunks are more reactive but more susceptible to compounding error.
Temporal ensembling averages the predictions from multiple overlapping chunks. At timestep t, the policy has made predictions for this timestep from the current chunk and from previous overlapping chunks. The exponentially weighted average of these predictions produces a smoother, more consistent trajectory. The temporal ensembling weight (w=0.01 in ACT's default configuration) controls how much weight recent predictions get vs. older ones. Lower w values produce smoother motion; higher values make the policy more reactive.
Choosing chunk size: The optimal chunk size depends on the task's temporal structure. For tasks with distinct phases (approach, grasp, lift, place), chunk size should be long enough to cover at least one complete phase -- typically 50-100 steps at 50 Hz (1-2 seconds). For reactive tasks where the policy must respond to environmental changes within 200ms (dynamic catching, force-controlled insertion), reduce chunk size to 10-20 steps. A simple heuristic: set chunk size to the median duration of the shortest task phase in your demonstration dataset.
Critical implementation detail: Temporal ensembling must be applied before sending actions to the robot, not after. Some implementations incorrectly average executed actions retrospectively, which provides no benefit. The correct implementation maintains a buffer of pending predicted actions from all active chunks and computes the weighted average at each timestep before commanding the robot.
SVRC's data collection operators are trained in both standard teleoperation and HG-DAgger correction protocols. For teams that have already trained a policy and want to improve it through targeted corrections rather than collecting entirely new demonstrations, HG-DAgger data collection is available as part of our data services.
Multi-Task Imitation Learning: Architecture and Data Considerations
Training a single policy to perform multiple tasks is more sample-efficient than training separate policies, but requires specific architectural and data design choices.
Task conditioning. The policy must know which task to perform. The three proven conditioning mechanisms are:
- Language conditioning: Provide a natural language instruction ("pick up the red cup") as input to the policy. The instruction is encoded by a frozen language model (CLIP text encoder or sentence-BERT) and concatenated with visual features. This is the most flexible approach: the policy can generalize to novel language instructions that combine known concepts. Requires language annotations in the training data.
- Task ID embedding: Assign each task a learnable embedding vector. Simpler than language conditioning and works when the task set is fixed. Does not generalize to new tasks without retraining.
- Goal image conditioning: Provide an image of the desired goal state as additional input. The policy learns to match current observations to the goal. Requires goal images at inference time but does not need language annotations during training.
Data balance across tasks. If task A has 500 demonstrations and task B has 50, the policy will be heavily biased toward task A. Use temperature-weighted sampling during training: sample each task with probability proportional to N_demos^(1/T), where T=2 works well for moderate imbalance. For severe imbalance (10x+), collect more data for the underrepresented task -- sampling tricks cannot fully compensate for missing diversity.
Common Training Pitfalls and Their Solutions
| Symptom | Likely Cause | Diagnostic | Fix |
|---|---|---|---|
| Robot moves to average of two positions | Mode averaging from MSE loss | Check if demos have multimodal actions for same observation | Switch to Diffusion Policy or increase ACT KL weight |
| Low validation loss but low success rate | Compounding error (policy drifts off-distribution) | Plot per-step error over rollout; look for divergence after step 20-30 | Increase chunk size, add temporal ensembling, or collect DAgger data |
| Training loss plateaus at high value | Noisy or inconsistent demonstrations | Visualize 20 random demos; check for strategy inconsistency | Filter demos by smoothness; retrain on clean subset |
| Robot overshoots targets consistently | Action space mismatch (delta vs absolute) | Check if demo actions are delta-position or absolute-position | Ensure training and inference use identical action representation |
| Works on one camera but not another | Camera extrinsics changed between collection and deployment | Compare camera mount position to training-time calibration | Recalibrate camera to match training position; or add camera pose to observation |
| Policy freezes mid-task | Observation out of training distribution | Log the observation embedding distance from training mean | Collect more demos in the OOD region; or add data augmentation |
Debugging workflow: When a trained policy fails on the real robot, follow this diagnostic sequence before concluding the model is bad or the data is insufficient:
- Verify action normalization statistics match between training config and deployment config.
- Check camera positions have not moved since data collection (compare visual overlap with a reference image).
- Run the policy on a pre-recorded evaluation episode in "replay mode" to verify the action outputs match expected values.
- Visualize the policy's attention maps or intermediate activations on the current observation -- if attention is on background rather than task-relevant objects, the visual input may be corrupted or misaligned.
- If all above pass, the issue is likely data quality or quantity. Add 20-50 more demonstrations specifically targeting the failure mode before re-training.
This sequence resolves 80% of real-robot deployment failures within 1-2 hours, avoiding premature data recollection.
The most common pitfall for beginners is the action space mismatch: the policy is trained on delta-position actions (move 5mm right) but deployed in absolute-position mode (go to x=0.35), or vice versa. This produces dramatic failure (overshooting or barely moving) that looks like a broken model but is actually a configuration error. Always verify action space conventions before debugging the model.
Quick-Start Checklist for Your First IL Project
- Set clear success criteria before you start. Define what "success" means for your task in measurable terms before collecting any data. For pick-and-place: "object is within 2cm of target position and stable (not rolling/falling) for 1 second after release." For insertion: "peg is fully inserted (within 1mm of target depth) without exceeding 30N force." Ambiguous success criteria lead to inconsistent demonstration labels and degrade policy training. Write the success criteria in a shared document that all operators and annotators reference.
- Choose your task. Start with a single-arm, single-object pick-and-place task. This is the "hello world" of imitation learning and will surface all integration issues before you tackle harder problems.
- Set up hardware. Mount your arm, calibrate cameras (intrinsics + extrinsics), and verify teleoperation control works end-to-end before collecting any data.
- Collect 50 demonstrations. Use consistent task setup. Reject failed attempts. This initial dataset is for validating your training pipeline, not for a deployable policy.
- Train ACT. Use the LeRobot training script with default hyperparameters. Monitor validation loss for 100 epochs. Expect training to take 2-4 hours on a single GPU.
- Evaluate. Run 20 trials on the real robot with objects in training positions. Target: 60%+ success rate. If below 40%, debug data quality before collecting more.
- Add diversity. Collect 100-200 more demonstrations with varied object positions, 2-3 object instances, and minor lighting changes. Retrain and evaluate.
- Iterate. Identify failure modes through structured evaluation. Collect targeted data addressing those modes. Repeat until deployment-ready (85%+ success on representative conditions).