What Is Mobile Manipulation?
Mobile manipulation refers to robots that combine locomotion (the ability to move through an environment) with manipulation (the ability to grasp, move, and interact with objects). This combination is what gives a robot human-like utility — the ability to navigate to where the work is, not just work on objects brought to a fixed location.
The combination is also what makes it hard. A fixed-base arm operates in a known, calibrated workspace. A mobile manipulator must grasp from an imprecisely-known base position, on a potentially moving platform, in an environment that may differ from any training configuration. Grasping from a non-fixed base increases effective grasp error by 3-5x compared to fixed-base manipulation.
Architecture Taxonomy: Decoupled vs. End-to-End
The fundamental architectural choice in mobile manipulation is whether to treat navigation and manipulation as separate subsystems or as a unified end-to-end policy.
Decoupled (navigate-then-manipulate): The classical approach. A navigation planner moves the base to a pre-computed pose within the arm's workspace, then a manipulation policy takes over. The navigation module and manipulation module are separate systems with separate models, separate training data, and a handoff protocol between them. This is the dominant approach in industrial deployments because it is modular, debuggable, and each component can be validated independently. The limitation: the base pose is computed without considering the manipulation task, which can result in sub-optimal approach configurations. Re-positioning costs 5-15 seconds per attempt and introduces localization error.
End-to-end (unified policy): A single neural network takes sensor inputs (cameras, lidar, proprioception) and outputs both base velocities and arm joint commands. Mobile ALOHA (Zipeng Fu et al., Stanford, 2024) demonstrated that an ACT policy can learn whole-body bimanual mobile manipulation from 50 teleoperation demonstrations — the policy simultaneously controls the wheels and both arms. End-to-end approaches produce smoother, more coordinated motions but are harder to debug, require more demonstrations, and offer fewer safety guarantees.
Hierarchical (task-level planning + learned skills): A high-level planner (potentially a foundation model or TAMP system) decomposes the task into a sequence of skills ("navigate to kitchen," "open fridge," "grasp milk"), and each skill is executed by a specialized policy. SayCan (Google, 2022) and TidyBot (Wu et al., Princeton, 2023) use this architecture. It combines the flexibility of learned policies with the interpretability of explicit planning.
Key Platforms in 2025
| Platform | Price | Type | Payload | Arms | Key Strength |
|---|---|---|---|---|---|
| Hello Robot Stretch 3 | $28,000 | Wheeled + telescoping arm | 1.5 kg | 1 (4-DOF) | ROS2, elder care, affordable |
| Mobile ALOHA (Stanford) | $32K + base | Wheeled + bimanual | 3 kg each | 2 (6-DOF ViperX) | Bimanual, open-source, ACT-ready |
| Spot + Arm (Boston Dynamics) | $100,000+ | Legged + arm | 4 kg | 1 (6-DOF) | Rough terrain, enterprise support |
| Unitree G1 | $16,000 | Humanoid (bipedal) | 3 kg each | 2 (7-DOF) | Lowest cost humanoid, growing ecosystem |
| Unitree H1 | $90,000 | Humanoid (bipedal) | 5 kg each | 2 (7-DOF) | Strongest humanoid arms, whole-body control |
| Fetch Robotics (OTTO) | $150,000+ | Wheeled + arm | 6 kg | 1 (7-DOF) | Warehouse logistics, enterprise AMR |
| TIAGo Pro (PAL Robotics) | $80,000+ | Wheeled + arm | 3 kg | 1-2 (7-DOF) | ROS2 native, RoboCup standard |
Key Papers and Systems
Mobile ALOHA (Stanford, 2024)
Mobile ALOHA demonstrated that end-to-end imitation learning can produce surprisingly capable whole-body mobile manipulation. The system uses two ViperX 300 arms on a wheeled base, with a human teleoperator who walks behind the robot and controls both arms plus base movement simultaneously. The ACT (Action Chunking with Transformers) policy trained on just 50 teleoperation demos learned to cook shrimp, open cabinets, clean kitchen counters, and push chairs — tasks that require coordinated base-arm movement.
Key insight: co-training with static ALOHA datasets (from the fixed-base ALOHA) improved mobile manipulation success by 30-50% even though the static data contains no base movement. The shared manipulation skills transfer, and the policy learns to compose them with base movement from the mobile demos alone. This reduces the real data collection burden significantly.
TidyBot (Princeton, 2023)
TidyBot tackled the long-horizon tidying problem: given a messy room, pick up all misplaced objects and put them in their correct locations. It uses an LLM (GPT-4) for high-level reasoning ("the cup goes on the shelf," "the shirt goes in the closet") and a learned manipulation policy for the actual picking and placing. Navigation uses a classical SLAM + planner stack. The key contribution is showing that language models can provide the common-sense reasoning needed for task planning in open-ended household tasks, while low-level execution is handled by reliable learned skills.
SayCan (Google, 2022)
SayCan connected language models to grounded robot affordances. Given a natural language instruction ("I spilled my drink, can you help?"), the LLM proposes a plan as a sequence of primitive skills ("find sponge," "pick up sponge," "navigate to spill," "wipe surface"). Each skill has an associated success probability estimated from the robot's current observation. The product of the LLM's plan probability and the skill success probability selects the best action. SayCan ran on a mobile Everyday Robot with a single 7-DOF arm, demonstrating that LLM-based task planning can handle the open-ended instruction following that mobile manipulation requires.
Navigation-Manipulation Coordination Challenges
Arm-base coordination: When should the base move and when should the arm extend? Whole-body control optimization for humanoid platforms (H1, G1) treats base and arm as a unified kinematic chain. Wheeled platforms typically use decoupled planning — navigate base to within arm reach, then execute arm motion — which is simpler but suboptimal for dynamic tasks.
Dynamic stability during manipulation: Applying force through the arm creates reaction forces on the base. A legged robot applying 20N horizontal force through its arm during a constrained task must simultaneously adjust foot contacts to maintain stability. This whole-body force coordination is an active research problem. Wheeled platforms have an advantage here: reaction forces are absorbed by ground friction at the wheels, which is much simpler to model and control than foot contact dynamics.
Grasp pose estimation from mobile base: Perception from a moving base introduces localization uncertainty into grasp estimation. A 2cm base position error propagates to a grasp point error that may exceed grasp tolerance for precision tasks. Mobile manipulation systems need either high-precision base localization or grasp policies that are robust to base position uncertainty.
The handoff problem: In decoupled architectures, the navigation system delivers the robot to a "pre-manipulation" pose, then hands off to the manipulation controller. This handoff is a failure point: if the base pose is not within the manipulation policy's training distribution, manipulation fails even if navigation succeeded. The handoff needs to be bidirectional — the manipulation controller should be able to request base repositioning if the current pose is not suitable. This feedback loop is poorly addressed in most current systems.
Long-horizon state estimation: Over a multi-minute mobile manipulation task (e.g., tidying a room), the robot's SLAM-based localization accumulates drift. If the robot picks up an object at minute 1 and needs to place it at a specific location at minute 5, the localization error may exceed the placement tolerance. Loop closure, relocalization against known landmarks, and visual odometry updates during manipulation are all necessary but add computational overhead.
Planning Approaches Comparison
| Approach | Task Horizon | Generalization | Compute | Data Needed | Maturity |
|---|---|---|---|---|---|
| Whole-body control (WBC) | Single skill | Low (task-specific) | High (1kHz QP) | Dynamics model | Production-ready for locomotion |
| Decoupled nav + manip | Single skill | Moderate | Low | SLAM map + manip demos | Most deployed approach |
| End-to-end IL (ACT) | Multi-step skill | High (within task) | Moderate (GPU) | 50-500 teleop demos | Research demos (Mobile ALOHA) |
| TAMP (Task and Motion Planning) | Multi-room | High (combinatorial) | Very high | Domain specification | Academic demos, limited production |
| LLM + skill library (SayCan) | Open-ended | Very high (language) | High (LLM inference) | Skill demos + affordance model | Emerging (TidyBot, SayCan) |
| VLA end-to-end (pi0-style) | Multi-step | Very high | Very high (7B+ model) | Large-scale diverse demos | Early commercial (Physical Intelligence) |
Data Collection for Mobile Manipulation
Collecting teleoperation demonstrations for mobile manipulation is harder than for fixed-base manipulation, because the operator must control base movement and arm movement simultaneously. Three data collection strategies are used in practice:
- Walk-behind teleoperation (Mobile ALOHA style): The operator walks behind the robot, controlling arm movements via leader arms while the base follows the operator's walking motion. Most intuitive for the operator but requires a large physical space and limits base speed to walking pace (~1.5 m/s). Best for tasks where the base trajectory matters (e.g., navigating around furniture while carrying an object).
- VR teleoperation: The operator uses a VR headset to view from the robot's perspective and controls base + arms through hand controllers and joystick. Enables remote operation but adds latency (20-50ms network + 10ms rendering) that degrades demonstration quality for contact tasks. Best for navigation-heavy tasks where manipulation precision is moderate.
- Waypoint-based collection: Separate the base trajectory from the manipulation demonstration. First, manually drive the robot to a pre-manipulation pose (or use an autonomous navigation stack). Then collect the manipulation demonstration from the fixed base position. Simpler for the operator but does not capture coordinated base-arm motions. Best for tasks where navigation and manipulation do not overlap temporally.
SVRC's data collection infrastructure supports all three approaches. Our Mountain View facility has 800+ sq ft of configurable floor space for mobile manipulation data collection, with motion capture-grade localization for ground truth base tracking. The Allston facility supports corridor and multi-room scenarios.
Real Deployment Examples
- Diligent Robotics Moxi: Wheeled mobile manipulator deployed in hospital wards for supply delivery and linen transport. Commercial deployment in 20+ US hospitals as of 2025. Uses decoupled navigation + manipulation with structured environments (known ward layouts, labeled storage locations).
- 6 River Systems (Shopify): AMR-based warehouse picking system that assists human workers. Handles structured bin picking, not full mobile manipulation. Demonstrates that constrained mobile manipulation (known objects, known locations) is commercially viable today.
- Hello Robot Stretch clinical trials: University of Washington and Georgia Tech trials of Stretch for in-home elder care assistance tasks (fetching objects, opening doors). Published results show 65-80% success on daily living tasks. The telescoping arm design is well-suited to home environments where a traditional 6-DOF arm would not fit.
- Google DeepMind RT-2 on Everyday Robots: The RT-2 VLA model was deployed on a fleet of mobile manipulators in Google office buildings, performing kitchen cleanup and desk tidying tasks. The VLA backbone enabled instruction following ("pick up the apple near the monitor and put it in the compost bin") with zero-shot generalization to novel objects, though manipulation precision was limited to tasks with 5mm+ tolerance.
The Mobile Manipulation Data Gap
The biggest bottleneck for mobile manipulation research is data scarcity. Fixed-base manipulation datasets (Open X-Embodiment, DROID, BridgeData V2) contain millions of episodes, but mobile manipulation datasets are 10-100x smaller because collection is harder and slower. This data gap means that policies for mobile manipulation are undertrained relative to fixed-base policies, especially for long-horizon tasks.
Closing this gap requires purpose-built data collection infrastructure — spaces large enough for navigation, hardware that captures base odometry alongside arm state, and operators trained in whole-body teleoperation. SVRC's data collection services address this gap directly, with standardized mobile manipulation collection protocols and output in HDF5/RLDS format compatible with ACT, Diffusion Policy, and VLA fine-tuning pipelines. Pilot projects start at $2,500; full mobile manipulation campaigns at $8,000+.