Why Machine Learning for Robotics Now

For decades, robots were programmed with explicit instructions: move to coordinate X, close gripper, lift 30mm, move to coordinate Y. This worked for structured factory environments where every object was in a known position. But it failed catastrophically in unstructured settings — homes, labs, warehouses, hospitals — where objects vary, lighting changes, and no two situations are identical.

Machine learning changed this equation by letting robots learn behaviors from data rather than from handwritten rules. Instead of programming every possible scenario, you show the robot examples and let it generalize. The concept is not new — researchers have worked on learning-based robotics since the 1990s. What changed in the last few years is that it finally works well enough to deploy.

Three Breakthroughs That Made It Practical

  • Transformer architectures applied to robotics (2022-2024): The same attention mechanisms that power large language models turned out to be remarkably effective for processing robot observation sequences. Google's RT-2, Stanford's ACT, and the Diffusion Policy work showed that transformer-based models can learn complex manipulation behaviors from relatively modest datasets. This was a step change from prior approaches that required millions of interaction samples.
  • Foundation models and Vision-Language-Action models (2024-2026): Models like OpenVLA, RT-X, and Octo demonstrated that a single pretrained model can generalize across different robot embodiments, tasks, and environments. Rather than training from scratch for each new task, you fine-tune a foundation model with a few hundred demonstrations. This reduced the data requirement from tens of thousands of episodes to hundreds — making real-world robot learning practical for small teams.
  • Affordable capable hardware: A complete robot learning setup — 6-DOF arm with gripper, wrist cameras, and compute — now costs under $5,000 with platforms like OpenArm and SO-101. Five years ago, the equivalent setup required $50,000+ of industrial hardware. The cost reduction opened robot ML research to university labs, startups, and individual practitioners who previously could not afford the hardware.

The result: in 2026, a graduate student or startup engineer can train a robot to perform a new manipulation task in a weekend using imitation learning with 50-200 demonstrations. That was science fiction five years ago. This guide covers the approaches, tools, and practical steps to get you there.

The 4 Main Approaches to Robot ML

Robot machine learning is not a single technique. There are four major families of approaches, each with different strengths, data requirements, and use cases. Understanding when to use each one is the most important strategic decision you will make.

Approach How It Works Data Needed Best For Limitations
Imitation Learning Robot learns from human demonstrations (teleoperation recordings) 50-500 demonstrations per task Manipulation tasks with clear human strategy: pick-and-place, assembly, tool use Cannot exceed demonstrator skill; struggles with tasks requiring exploration
Reinforcement Learning Robot learns by trial and error, maximizing a reward signal Millions of trials (usually in simulation) Locomotion, dynamic tasks, behaviors that exceed human capability Requires careful reward design; sample-inefficient; sim-to-real gap
Sim-to-Real Transfer Train in simulation, transfer learned policy to real robot Unlimited (simulated), plus domain randomization Tasks where simulation is accurate: locomotion, grasping simple shapes, navigation Reality gap for contact-rich manipulation; sim fidelity limits policy quality
Foundation Models / VLAs Large pretrained models fine-tuned for robot control using vision and language 10-100 demonstrations for fine-tuning (pretrained on millions of diverse robot episodes) General-purpose manipulation, language-conditioned tasks, multi-task robots Compute-heavy inference; still maturing; may not match specialist policies on specific tasks

For most beginners, imitation learning is the right starting point. It has the lowest barrier to entry, produces working policies fastest, and builds the data collection and evaluation skills you will need regardless of which approach you eventually specialize in.

Deep Dive: Imitation Learning

Imitation learning (also called learning from demonstration) is the most practical approach for getting a robot to perform a new manipulation task. The core idea: a human teleoperates the robot through the desired behavior many times, and the robot learns a mapping from observations (camera images, joint positions) to actions (joint velocity or position commands) by training on those demonstrations.

Behavioral Cloning

The simplest form of imitation learning is behavioral cloning (BC): treat the problem as supervised learning. Collect demonstration trajectories (sequences of observation-action pairs), then train a neural network to predict the action given the observation. At inference time, the robot observes the current state, feeds it through the trained network, and executes the predicted action.

Behavioral cloning is easy to implement and fast to train, but it has a fundamental flaw: compounding error. If the robot drifts slightly off the demonstrated trajectory (and it always does), it encounters observations it never saw during training, leading to increasingly poor predictions. This is why naive behavioral cloning often fails on long-horizon tasks. The solutions below address this problem.

ACT (Action Chunking with Transformers)

ACT, introduced by Tony Zhao et al. at Stanford in 2023, addresses compounding error by predicting chunks of future actions rather than single next-step actions. The model takes a sequence of observations and outputs the next 50-100 timesteps of action in one forward pass. This "action chunking" means the robot commits to a trajectory segment, reducing the frequency at which prediction errors accumulate.

ACT uses a transformer encoder to process observation history and a CVAE (Conditional Variational Autoencoder) to handle the multimodality of demonstrations — different operators may perform the same task slightly differently, and the model needs to capture this variation rather than averaging over it. ACT has become one of the most widely reproduced methods in the robotics ML community, particularly for bimanual manipulation with ALOHA-style hardware.

For a deeper treatment, see our imitation learning guide.

Diffusion Policy

Diffusion Policy, introduced by Cheng Chi et al. at Columbia in 2023, applies the denoising diffusion framework (the same core idea behind image generation models like Stable Diffusion) to robot action prediction. Instead of directly predicting actions, the model learns to iteratively denoise a random action sequence into a coherent trajectory conditioned on the current observation.

The advantage of diffusion-based action generation is that it handles multimodal action distributions naturally. When there are multiple valid ways to perform a task (pick from the left or right side, approach from above or the side), diffusion models represent this uncertainty explicitly rather than averaging, which produces smoother and more natural robot behavior. Diffusion Policy achieves state-of-the-art performance on many manipulation benchmarks and is particularly effective for contact-rich tasks like insertion, wiping, and folding.

The tradeoff is inference speed: generating actions via iterative denoising takes 5-20 forward passes per action chunk, compared to a single forward pass for ACT. On modern GPUs this runs at 10-30 Hz, which is adequate for most manipulation but may be limiting for high-speed tasks.

Deep Dive: Reinforcement Learning

Reinforcement learning (RL) takes a fundamentally different approach: instead of showing the robot what to do, you define a reward function that measures how well the robot is doing, and let it discover effective behaviors through trial and error. The robot takes actions, observes outcomes, receives reward signals, and gradually improves its policy to maximize cumulative reward.

Key Algorithms: PPO and SAC

PPO (Proximal Policy Optimization) is the most widely used RL algorithm for robotics. Developed by OpenAI, PPO is a policy gradient method that constrains policy updates to avoid catastrophic performance drops. It is stable, relatively easy to tune, and works well for both continuous and discrete action spaces. PPO is the algorithm behind most of the impressive quadruped locomotion demonstrations (walking, running, parkour) from companies like Unitree and Boston Dynamics' research team.

SAC (Soft Actor-Critic) is an off-policy algorithm that maximizes both reward and entropy (randomness) in the policy. The entropy bonus encourages exploration and produces policies that are robust to perturbations. SAC is more sample-efficient than PPO for many tasks because it reuses past experience from a replay buffer. It is particularly effective for manipulation tasks trained in simulation.

When to Use RL vs. Imitation Learning

RL excels at tasks where:

  • The optimal behavior is hard for humans to demonstrate (dynamic throwing, in-hand reorientation, agile locomotion)
  • You want the robot to discover strategies that exceed human capability
  • You have a reliable simulation environment where the robot can run millions of episodes
  • The reward function is straightforward to define (reach target, maintain balance, maximize distance)

RL struggles when:

  • The reward function is hard to specify precisely (what does "fold a shirt neatly" look like as a scalar reward?)
  • The task requires contact-rich manipulation where simulation accuracy is poor
  • You need a working policy quickly (RL training typically takes hours to days of GPU time)
  • You do not have a good simulation environment and must train on real hardware (sample efficiency is insufficient for most real-world RL)

Simulation Environments: Isaac Sim and MuJoCo

NVIDIA Isaac Sim is a GPU-accelerated physics simulator built on Omniverse. Its key advantage is massive parallelism: Isaac Sim can run thousands of robot environments simultaneously on a single GPU, generating millions of experience samples per hour. This makes it the preferred platform for RL training of locomotion and manipulation policies. Isaac Sim also provides photorealistic rendering for sim-to-real visual transfer and native integration with Isaac Gym for RL training.

MuJoCo (Multi-Joint dynamics with Contact), now open-source under DeepMind, is the most widely used physics engine in robot learning research. MuJoCo is fast, numerically stable, and produces accurate contact dynamics for manipulation tasks. It runs on CPU (and increasingly GPU via MJX), integrates with all major RL frameworks, and has the largest ecosystem of pre-built robot models and benchmark tasks. For beginners, MuJoCo is often the easier starting point because of its extensive documentation and the availability of standardized benchmark tasks (Gymnasium robotics environments).

Data Requirements: How Much Is Enough?

The most common question beginners ask is "how many demonstrations do I need?" The honest answer depends on the approach, task complexity, and model architecture. Here are practical guidelines based on published results and SVRC's own experience.

Imitation Learning Data Requirements

  • Simple single-arm pick-and-place (fixed objects, fixed positions): 20-50 demonstrations. ACT and Diffusion Policy both achieve >80% success rates at this scale.
  • Variable pick-and-place (randomized positions, varying objects): 100-200 demonstrations. More variation in object positions and types requires more data to generalize.
  • Contact-rich manipulation (insertion, wiping, folding): 200-500 demonstrations. These tasks have narrow success regions where small errors cause failure, requiring denser data coverage.
  • Multi-step tasks (assembly sequences, cooking steps): 300-1,000 demonstrations. Long-horizon tasks need data at every stage of the sequence.

What Makes Good Robot Training Data

Not all demonstrations are equally valuable. Good robot training data has these properties:

  • Diversity over volume: 100 demonstrations with varied object positions, lighting, and approach angles train better than 500 demonstrations that all look the same. Deliberately vary the initial conditions.
  • Consistent quality: Remove failed demonstrations, hesitation episodes, and demonstrations where the operator made and corrected errors mid-task. Noise in the training data teaches the model to reproduce that noise.
  • Matched distribution: Your training data should match the conditions the robot will face at deployment. If you train with objects on a white table but deploy on a wooden surface, performance will degrade. Capture data in realistic conditions.
  • Multi-camera coverage: Two or three camera angles (wrist + overhead, or wrist + side + overhead) significantly improve policy performance versus a single camera. The model needs to observe the task from viewpoints that reveal depth, contact, and spatial relationships.
  • Proper synchronization: Camera frames, joint states, and action commands must be temporally aligned. Even 50ms of desynchronization degrades performance. Use hardware timestamps, not software arrival times.

For detailed guidance on collecting high-quality robot data, see our robot data collection guide.

Key Frameworks and Tools

You do not need to build everything from scratch. These frameworks provide tested implementations of the algorithms and infrastructure you need.

LeRobot (Hugging Face)

LeRobot is the most beginner-friendly framework for robot imitation learning. Developed by Hugging Face, it provides end-to-end tooling: data collection scripts for common robot arms, dataset management with standardized formats, training implementations of ACT and Diffusion Policy, and evaluation utilities. LeRobot integrates with the Hugging Face Hub for dataset sharing and model distribution. If you are starting your first robot ML project, LeRobot is the recommended starting point. It supports SO-101, ALOHA, and custom arms out of the box.

RoboMimic (Stanford)

RoboMimic is a research framework for studying imitation learning algorithms. It provides standardized simulation benchmarks, multiple BC algorithm implementations (BC-RNN, HBC, IRIS), and careful evaluation protocols. RoboMimic is less turnkey than LeRobot for real-robot deployment, but it is the standard benchmarking framework cited in most imitation learning papers. Use RoboMimic when you want to rigorously compare algorithm variants or reproduce published results.

ROS2 (Robot Operating System 2)

ROS2 is the de facto middleware for robot software. It handles communication between sensors, actuators, and compute nodes via a publish-subscribe message system. ROS2 is not an ML framework, but it is the glue that connects your ML policy to real robot hardware. You need ROS2 (or a simpler alternative like direct serial/USB communication) to send action commands to motors and receive observation data from cameras and encoders. ROS2 Humble (LTS) on Ubuntu 22.04 is the recommended version for new projects.

NVIDIA Isaac Sim

Isaac Sim provides GPU-accelerated simulation for both RL training and synthetic data generation. Its parallel environment execution makes RL training feasible at the scale required (millions of episodes). Isaac Sim also supports domain randomization — automatically varying textures, lighting, and physics parameters to improve sim-to-real transfer. The learning curve is steeper than MuJoCo, but Isaac Sim is the right choice when you need large-scale RL training or photorealistic synthetic data.

Stable Baselines3

Stable Baselines3 (SB3) is a PyTorch library providing reliable implementations of standard RL algorithms: PPO, SAC, TD3, A2C, and DQN. SB3 is designed for ease of use — you can train an RL policy on a Gymnasium environment in under 10 lines of code. It is the recommended starting point for learning RL concepts before moving to more specialized frameworks like Isaac Gym or rl_games for production RL training.

Hardware to Get Started

You need a robot to do robot ML. Here are three hardware options at different price points, each with a proven ecosystem for learning and development.

OpenArm (Budget: $2,000-$4,000)

OpenArm is SVRC's open-source 6-DOF robot arm designed specifically for ML research and education. At under $4,000 for a complete system (arm + gripper + wrist camera + controller), it is the most accessible entry point for hands-on robot ML. OpenArm runs LeRobot natively, supports both teleoperation data collection and autonomous policy execution, and has an active community contributing datasets and pretrained models. The arm uses Dynamixel servos with 400mm reach and 500g payload — sufficient for tabletop manipulation tasks that are the bread and butter of imitation learning research.

SO-101 (Budget: $500-$1,000)

The SO-101 (also known as SO-100 and its variants) is an ultra-low-cost 6-DOF arm built from commodity servo motors and 3D-printed components. A complete leader-follower pair costs under $1,000. The SO-101 has become the most popular arm for LeRobot community projects due to its low cost and the extensive documentation from Hugging Face. The tradeoff is lower precision, payload capacity, and build quality compared to OpenArm or commercial arms. For a first project or a classroom setting, SO-101 is hard to beat on price.

Franka FR3 (Budget: $25,000-$35,000)

The Franka Emika FR3 (formerly Panda) is the most widely used research-grade robot arm in academic ML labs. It offers 7-DOF with excellent torque sensing, sub-millimeter repeatability, and native integration with ROS2 and MuJoCo. The FR3 is the arm used in many landmark papers (ACT, Diffusion Policy, RT-X). If you are at a university or well-funded startup doing publishable research, the FR3 is the standard platform. The higher cost buys you precision, reliability, and the ability to directly reproduce results from published work.

Budget Breakdown for a Complete Setup

Component Budget Option Mid-Range Option Research-Grade
Robot arm + gripper SO-101 pair: $800 OpenArm: $3,500 Franka FR3: $30,000
Cameras (2-3x) Logitech C920: $150 RealSense D405: $600 RealSense D435i: $900
Compute (training) Cloud GPU rental: $50/mo RTX 4070 workstation: $1,800 RTX 4090 workstation: $3,500
Compute (inference) Laptop with GPU: $0 (existing) Same as training: $0 Dedicated inference PC: $2,000
Total ~$1,000 + cloud ~$6,000 ~$36,000

Learning Path: From Zero to Deployed Policy

Here is a four-stage curriculum that takes you from fundamentals to deploying a trained policy on real hardware. Each stage builds on the previous one. Do not skip ahead — the foundational skills matter.

Stage 1: Fundamentals (2-4 weeks)

Before touching a robot, build your ML foundations:

  • Learn PyTorch basics: tensors, autograd, training loops, dataset/dataloader patterns. The official PyTorch tutorials are sufficient.
  • Understand convolutional neural networks (CNNs) for image processing — your robot's cameras produce images that the policy must process.
  • Learn the transformer architecture at a conceptual level: attention, positional encoding, sequence processing. You do not need to implement a transformer from scratch, but you need to understand what it does.
  • Read the original ACT paper (Zhao et al., 2023) and Diffusion Policy paper (Chi et al., 2023). Focus on understanding the problem formulation and evaluation methodology, not every mathematical detail.

Stage 2: Simulation (2-4 weeks)

Practice in simulation before touching real hardware. Simulation costs nothing and you can experiment freely.

  • Install MuJoCo and run the standard Gymnasium robotics environments (FetchReach, FetchPush, FetchPickAndPlace).
  • Train a basic RL policy using Stable Baselines3 on FetchReach. This teaches you the RL training loop without any hardware complexity.
  • Use RoboMimic to run a behavioral cloning experiment on the Lift and Can tasks. This teaches you the imitation learning pipeline: dataset loading, model training, evaluation.
  • Experiment with hyperparameters: learning rate, batch size, number of training epochs, action chunk size. Develop intuition for what affects performance.

Stage 3: Real Hardware (4-8 weeks)

Now transfer your skills to a physical robot:

  • Set up your robot arm (OpenArm, SO-101, or whatever you have access to) with LeRobot.
  • Collect 50 teleoperation demonstrations of a simple pick-and-place task. Focus on data quality: consistent demonstrations, good camera angles, varied object positions.
  • Train an ACT policy on your collected data using LeRobot's training scripts.
  • Evaluate the policy on the real robot. Measure success rate over 20 trials. You will likely get 40-70% success on your first attempt.
  • Iterate: collect more demonstrations targeting the failure modes you observed, retrain, and re-evaluate. The goal is to understand the data collection → training → evaluation loop deeply.

Stage 4: Deployment and Generalization (Ongoing)

Advance to harder tasks and robust deployment:

  • Try more complex tasks: multi-step assembly, bimanual manipulation, tasks with variable objects.
  • Experiment with Diffusion Policy and compare results to ACT on the same task and dataset.
  • Explore foundation model fine-tuning: use a pretrained VLA model and fine-tune on your task with fewer demonstrations.
  • Work on robustness: test your policy under different lighting, with unseen objects, and after physical perturbation. Deploy policies that work reliably, not just in controlled conditions.
  • Contribute back: share your datasets on Hugging Face, publish your results, and participate in the SVRC and LeRobot communities.

6 Common Mistakes Beginners Make

We see these mistakes repeatedly from teams starting their robot ML journey. Each one is avoidable with awareness.

1. Skipping Data Quality for Data Quantity

The mistake: Collecting hundreds of sloppy demonstrations because "more data is always better." Demonstrations with hesitation, partial failures, and inconsistent strategies inject noise that confuses the model.

The fix: Curate ruthlessly. Delete any demonstration where the operator hesitated, made a mistake, or used an inconsistent strategy. 100 clean demonstrations outperform 500 noisy ones consistently.

2. Training Without Evaluation Rigor

The mistake: Training a policy, running it on the robot a few times, seeing it work once, and calling it done. One success does not validate a policy.

The fix: Evaluate every policy over a minimum of 20 rollouts with randomized initial conditions. Report success rate as X/20, not "it works." Track improvements across training runs with consistent evaluation protocols.

3. Ignoring Camera Placement

The mistake: Placing cameras wherever is convenient rather than where they provide the most useful information. A camera that cannot see the gripper-object contact point gives the policy no information about the most critical phase of the task.

The fix: Use a wrist-mounted camera (provides close-up view of manipulation) plus an overhead or side camera (provides global context). Ensure cameras can see the task workspace clearly with no occlusion during critical contact phases.

4. Overcomplicating the First Task

The mistake: Trying to learn a complex multi-step task as your first project. When the policy fails (and it will), you cannot diagnose whether the issue is the algorithm, the data, the hardware, or the task complexity.

The fix: Start with single-step pick-and-place of a single object. Get this working reliably (>90% success). Then add complexity incrementally: multiple objects, varied positions, longer sequences. Each increment tells you exactly what broke.

5. Neglecting Action Space Design

The mistake: Using joint-space actions when Cartesian-space actions would be more natural, or vice versa. Using absolute positions when delta (incremental) actions would produce smoother motion. The action space representation has a dramatic effect on learning difficulty.

The fix: For most manipulation tasks, delta end-effector position control (Cartesian dx, dy, dz + gripper open/close) is the easiest action space to learn. Joint-space control is better for tasks requiring specific arm configurations or when inverse kinematics is unreliable. Experiment with both and compare.

6. Not Matching Train and Deploy Conditions

The mistake: Collecting training data with one set of lighting conditions, camera positions, and table surfaces, then deploying the policy in a different environment and wondering why it fails.

The fix: Deploy in the same conditions you trained in, or deliberately vary conditions during data collection to build in robustness. If you must deploy in different conditions, collect supplementary demonstrations in the target environment and mix them into training.

Essential Resources

Key Papers

  • ACT — Learning Fine-Grained Bimanual Manipulation with Low-Cost Hardware (Zhao et al., 2023): Introduced action chunking with transformers and the ALOHA hardware platform. The foundational paper for modern imitation learning.
  • Diffusion Policy — Visuomotor Policy Learning via Action Diffusion (Chi et al., 2023): Applied denoising diffusion models to robot action prediction. State-of-the-art on many manipulation benchmarks.
  • RT-X — Open X-Embodiment: Robotic Learning Datasets and RT-X Models (Open X-Embodiment Collaboration, 2024): Demonstrated cross-embodiment transfer learning using data from 22 robot platforms. Established that robot learning benefits from large-scale diverse data, analogous to language model scaling.
  • OpenVLA — An Open-Source Vision-Language-Action Model (Kim et al., 2024): An open-weight VLA model that can be fine-tuned for specific robot tasks. Demonstrated that pretrained foundation models significantly reduce per-task data requirements.

Courses and Tutorials

  • CS 224R: Deep Reinforcement Learning (Stanford, available on YouTube): Comprehensive graduate-level coverage of RL for robotics.
  • LeRobot Getting Started Tutorial (Hugging Face documentation): Step-by-step guide to collecting data and training your first policy with LeRobot.
  • Spinning Up in Deep RL (OpenAI): Excellent self-contained introduction to RL algorithms with clear implementations.
  • MuJoCo Tutorials (DeepMind documentation): Physics simulation basics for robot learning research.

Community

  • SVRC Community: Join the SVRC Academy for hands-on workshops, shared hardware access, and a community of robotics ML practitioners in the Bay Area.
  • LeRobot Discord: Active community for LeRobot users sharing datasets, troubleshooting, and project ideas.
  • Robotics Stack Exchange: Q&A for technical robotics questions.