Forget AI that can draw pictures, Google’s latest AI model can control a robot.
On Friday, Google introduced Robotics Transformer 2, or RT2 a vision-language-action (VLA) model that can take text and images and output them into robotic actions.
“Just like language models are trained on text from the web to learn general ideas and concepts, RT-2 transfers knowledge from web data to inform robot behavior,” Head of Robotics, Google DeepMind Vincent Vanhoucke explained in a blog post. “In other words, RT-2 can speak robot.”
Vanhoucke says that while chatbots can be trained by feeding them information about a topic, robots need to take things a step further and get “grounding” in the real world. The example he provided was a red apple. While you could simply explain to a chatbot what an apple is, a robot will need to not only know everything about it but also know how to distinguish it from a similar item, for instance a red ball, and they’ll also have to learn how they should pick that apple up.
RT-2 takes things a step further than Google's RT-1 and other models by using data from the web. For instance, if you wanted a previous model to throw something away you would need to train them on what trash is and how to use it. With RT-2, the robot can train itself based on web data. So, while you haven’t expectedly explained what the trash is and how to use it, the robot can figure that part out on its own.
With RT-2 robots are able to learn, and take learned knowledge and apply it to future situations as well. That said, Google notes that in the current limitations of the model, RT-2 can only help a robot get better at physical tasks it already knows how to do, not learn them from scratch.
Still, it’s a huge step forward and shows us what might be possible in the future. Google goes into detail on how RT-2 works on its DeepMind blog.