Google is enhancing its robots’ navigation and task-completion skills using Gemini AI. In a new research paper, the DeepMind robotics team detailed how Gemini 1.5 Pro’s long context window, which determines the amount of information an AI model can process, improves user interactions with RT-2 robots through natural language instructions.
The process involves filming a video tour of a specific area, such as a home or office, and using Gemini 1.5 Pro to allow the robot to “watch” the video to learn about the environment. The robot can then execute commands based on its observations using verbal and/or image outputs. For instance, it can guide users to a power outlet after being shown a phone and asked, “where can I charge this?” According to DeepMind, the Gemini-powered robot had a 90 percent success rate across over 50 user instructions within a 9,000-plus-square-foot area.
Researchers also discovered “preliminary evidence” that Gemini 1.5 Pro enables robots to plan beyond simple navigation. For example, if a user with many Coke cans on their desk asks the robot if their favorite drink is available, Gemini “knows that the robot should navigate to the fridge, inspect if there are Cokes, and then return to the user to report the result.” DeepMind intends to further investigate these findings.
Google’s video demonstrations showcase the technology’s potential, though the obvious cuts after each request reveal that it takes 10-30 seconds to process these instructions, according to the research paper. While it may take some time before advanced environment-mapping robots become common in homes, these developments suggest they might soon help find missing keys or wallets.