Gemini Robotics
Notes after reviewing Gemini Robotics technical report.
2 years after introducing RT-2, Google DeepMind revealed another breakthrough in robotics foundation model: Gemini Robotics. Gemini Robotics demonstrated enhanced capabilites compared to RT-2 - better generalizability, planning capabilities, and general embodied reasoning.
Gemini Robotics - Embodied Reasoning (ER)
Gemini Robotics-ER is an advanced vision-language model (VLM) that enhances Gemini’s spatial reasoning capabilities that are necessary for robotics.
-
Architecture: Gemini Robotics-ER uses Gemini 2.0 Flash as backbone and is fine-tuned to understand spatial reasoning. However, in the technical report it is not clear how exactly it acquired better spatial reasoning capabilities.
- Enhanced spatial reasoning: The authors emphasize four different categories of improved out-of-the-box embodied reasoning:
- Object Detection: Identifying objects in images through 2D bounding boxes. Represented as $(y_0, x_0, y_1, x_1)$ quadruples.
- Pointing: Identifying points in images specific objects or object parts. Represented as $(y, x)$ tuples.
- Trajectory Prediction: Predicting object or action trajectories given a description of the motion. Represented as a sequence of points connecting two points.
- Grasp Prediction: Predicting top-down grasps (i.e. where to grab at what angle). Represented as $(y, x, \theta)$ where $\theta$ is the rotation angle.
- Connecting embodied reasoning to robot control: Using Gemini Robotics-ER’s exceptional spatial reasoning capabilities, it is possible to control robots in a zero-shot or few-shot manner. Zero-shot control leverages Gemini 2.0’s innate language capabilities and uses code API generation. Given an image, prompt, and the API specification for the robot, it can generate it’s plan and a sequence of API calls to achieve the task without ever being fine-tuned. Gemini Robotics-ER can perform more dexterous tasks through few-shot control where it is presented a high-quality reference data.
Gemini Robotics
Gemini Robotics is described as an advanced vision-language-action model (VLA) that is built upon Gemini Robotics-ER, where the key difference is that it can directly output actions, similar to RT-2.
-
Architecture: Gemini Robotics is Gemini Robotics-ER fine-tuned to directly output actions. However, the technical report omits how exactly the actions look like. One can presume that it might look similar to the prior work, RT-2, where there is a predefined set of tokens representing robotic actions. The system takes the cloud backbone + local decoder approach, where only the small action decoder is present in the robot due to sizing & latency constraints. However, it is unclear what data is being communicated between the backbone and the local model.
-
Generality: With the ability to directly output actions, Gemini Robotics has the capability to perform various robotic tasks requiring planning and dexterity. Such tasks include folding laundry, tacking plates, opening a folder, and picking up a shoe lace. Also, it achieves remarkable generality in three areas: visual (unseen background or lighting conditions), instruction (paraphrasing and typos), and action (different object placement, color, or shape). The authors present $\pi_0$ reimplemented and multi-task diffusion policy as baselines, and Gemini Robotics outperforms generalization benchmarks.
-
Specialization: Gemini Robotics can be specialized to perform more dexterous tasks with long-horizon planning. It can also be specialized to adapt to different robot types (multi-embodiment). This can be done by fine-tuning the model with a smaller set of high-qality action data demonstrating complicated tasks. As a result, it can perform impressively complicated tasks requiring high level of dexterity such as packing a lunch box, making a salad, and playing cards.
Limitations
Here are the areas where Gemini Robotics falls short or shows room for improvements.
- Highly dexterous task such as inserting shoe laces.
- “Grounding spatial relationships across long videos” (== very short memory)
- Numerical precision in pointing / object detection.
- Zero-shot cross-embodiment transfer.