The robot ChatGPT is here: the big model enters the real world, DeepMind's heavyweight breakthrough

We know that after mastering the language and images on the Internet, the large model will eventually enter the real world, and "embodied intelligence" should be the next development direction.

Connecting the large model to the robot, using simple natural language instead of complex instructions to form a specific action plan, without additional data and training, this vision looks good, but it seems a bit far away. After all, the field of robotics is notoriously difficult.

However, AI is evolving faster than we thought.

This Friday, Google DeepMind announced the launch of RT-2: the world's first Vision-Language-Action (VLA) model for controlling robots.

Now that complex instructions are no longer used, the robot can be manipulated directly like ChatGPT.

How intelligent is RT-2? DeepMind researchers showed it with a robotic arm, told AI to choose "extinct animals", the arm stretched out, the claws opened and fell, and it grabbed the dinosaur doll.

Before that, robots couldn't reliably understand objects they'd never seen, much less reason about things like linking "extinct animals" to "plastic dinosaur dolls."

Tell the robot to give Taylor Swift the Coke can:

It can be seen that this robot is a true fan, which is good news for humans.

The development of large language models such as ChatGPT is setting off a revolution in the field of robots. Google has installed the most advanced language models on robots, so that they finally have an artificial brain.

In a paper recently submitted by DeepMind, the researchers stated that the RT-2 model is trained based on network and robot data, using the research progress of large-scale language models such as Bard, and combining it with robot data. The new model can also Understand instructions in languages other than English.

Google executives say the RT-2 is a quantum leap in the way robots are built and programmed. "Because of this change, we had to rethink our entire research plan," says Vincent Vanhoucke, director of robotics at Google's DeepMind. "A lot of things that I did before are completely useless."

How is RT-2 implemented?

DeepMind's RT-2 is disassembled and read as Robotic Transformer - the transformer model of the robot.

It is not an easy task for robots to understand human speech and demonstrate survivability like in science fiction movies. Compared with the virtual environment, the real physical world is complex and disordered, and robots usually need complex instructions to do some simple things for humans. Instead, humans instinctively know what to do.

Previously, it took a long time to train the robot, and researchers had to build solutions for different tasks individually, but with the power of the RT-2, the robot can analyze more information by itself and infer what to do next.

RT-2 builds on the Vision-Language Model (VLM) and creates a new concept: the Vision-Language-Action (VLA) model, which can learn from network and robot data and combine this knowledge Translate into general instructions that the robot can control. The model was even able to use thought-chain cues like which drink would be best for a tired person (energy drinks).

RT-2 architecture and training process

In fact, as early as last year, Google launched the RT-1 version of the robot. Only a single pre-trained model is needed, and RT-1 can generate instructions from different sensory inputs (such as vision, text, etc.) to execute multiple tasks. kind of task.

As a pre-trained model, it naturally requires a lot of data for self-supervised learning to build well. RT-2 builds on RT-1 and uses RT-1 demonstration data collected by 13 robots in an office, kitchen environment over 17 months.

DeepMind created VLA model

We have mentioned earlier that RT-2 is built on the basis of VLM, where VLMs models have been trained on Web-scale data and can be used to perform tasks such as visual question answering, image captioning, or object recognition. In addition, the researchers also made adaptive adjustments to the two previously proposed VLM models PaLI-X (Pathways Language and Image model) and PaLM-E (Pathways Language model Embodied), as the backbone of RT-2, and these models The Vision-Language-Movement versions are called RT-2-PaLI-X and RT-2-PaLM-E.

In order for the vision-language model to be able to control the robot, it is still necessary to control the motion. The study took a very simple approach: they represented robot actions in another language, text tokens, and trained them with a web-scale vision-language dataset.

The motion encoding for the robot is based on the discretization method proposed by Brohan et al. for the RT-1 model.

As shown in the figure below, this research represents robot actions as text strings, which can be a sequence of robot action token numbers, such as "1 128 91 241 5 101 127 217".

The string begins with a flag indicating whether the robot is continuing or terminating the current episode, and the robot then changes the position and rotation of the end effector and commands such as the robot's gripper as indicated.

Since actions are represented as text strings, it is as easy for a robot to execute an action command as a string command. With this representation, we can directly fine-tune existing vision-language models and convert them to vision-language-action models.

During inference, text tokens are decomposed into robot actions to achieve closed-loop control.

Experimental

The researchers performed a series of qualitative and quantitative experiments on the RT-2 model.

The figure below demonstrates the performance of RT-2 on semantic understanding and basic reasoning. For example, for the task of "putting strawberries into the correct bowl", RT-2 not only needs to understand the representation of strawberries and bowls, but also needs to reason in the context of the scene to know that strawberries should be placed with similar fruits. Together. For the task of picking up a bag that is about to fall off a table, RT-2 needs to understand the physical properties of the bag to disambiguate between the two bags and identify objects in unstable positions.

It should be noted that all of the interactions tested in these scenarios have never been seen in robotics data.

The figure below shows that the RT-2 model outperforms the previous RT-1 and vision pretrained (VC-1) baselines on four benchmarks.

RT-2 preserves the robot's performance on the original task and improves the robot's performance on previously unseen scenarios, from 32% to 62% for RT-1.

A series of results show that the vision-language model (VLM) can be transformed into a powerful vision-language-action (VLA) model, and the robot can be directly controlled by combining VLM pre-training with robot data.

Similar to ChatGPT, if such a capability is applied on a large scale, the world is estimated to undergo considerable changes. However, Google has no immediate plans to apply the RT-2 robot, saying only that the researchers believe that these robots that can understand human speech will never stop at the level of demonstrating capabilities.

Just imagine a robot with a built-in language model that can be placed in a warehouse, grab your medicine for you, or even be used as a home assistant—folding laundry, removing items from the dishwasher, and tidying up around the house.

It may really open the door to the use of robots in a human environment, and all directions that require manual labor can be taken over-that is, in the previous OpenAI report on predicting ChatGPT's impact on jobs, the part that the large model cannot affect can now is covered.

Embodied intelligence, not far from us?

Recently, embodied intelligence is a direction that a large number of researchers are exploring. This month, Stanford University's Li Feifei team demonstrated some new results. Through a large language model plus a visual language model, AI can analyze and plan in 3D space and guide robot actions.

Zhihui Jun’s universal humanoid robot start-up company “Agibot” released a video last night, which also demonstrated the automatic programming and task execution capabilities of robots based on large language models.