Google's AGI robot big move: 54-person team held back for 7 months, strong generalization and strong reasoning, new results after the merger of DeepMind and Google Brain

Original source: Qubit

The explosive big model is reshaping Google's DeepMind's robot research.

One of the latest achievements is the robot project RT-2, which took them 7 months to build, and it became popular on the Internet:

How effective is it?

Just give an order in human language, and the little guy in front of him can wave his robotic arm, think and complete the "master's task".

Like giving water to pop singer Taylor Swift, or identifying the logo of a star team:

It can even actively think and let it "pick up extinct animals", and it can accurately pick dinosaurs from the three plastic toys of lions, whales, and dinosaurs.

In the words of netizens, don't underestimate this ability. This is a logical leap from "extinct animals" to "plastic dinosaurs".

What's more "frightening" is that it can easily solve the multi-stage reasoning problem of "choosing a drink for a tired person" that needs to be combined with the chain of thought--as soon as the order is heard, the little hand will go straight to the Red Bull, Just don't be too smart.

Some netizens lamented after reading:

Can't wait, fast forward to washing dishes for humans (manual dog head)

It is understood that the achievement of Google DeepMind was jointly produced by 54 researchers, and it took 7 months before and after, before it finally became the "so easy" we saw.

According to the New York Times, Vincent Vanhoucke, Director of Robotics at Google DeepMind, believes that large models have completely changed the research direction of their department:

Because of this (big model) change, we had to rethink the entire research project. Many things we have studied before have completely failed.

So, what kind of effects can RT-2 achieve, and what exactly is this research about?

Plug the multi-modal large model into the robotic arm

The robot project, called RT-2 (Robotic Transformer 2), is an "evolutionary version" of the RT-1 released at the end of last year.

Compared with other robot research, the core advantage of RT-2 is that it can not only understand "human words", but also reason about "human words" and convert them into instructions that robots can understand, so as to complete tasks in stages.

Specifically, it has three major capabilities - symbol understanding (Symbol understanding), reasoning (Reasoning) and human recognition (Human recognition).

The first ability is "symbolic understanding", which can directly extend the knowledge of large model pre-training to data that the robot has never seen before. For example, although there is no "Red Bull" in the robot database, it can understand and grasp the appearance of "Red Bull" from the knowledge of the large model, and handle objects.

The second ability is "reasoning", which is also the core advantage of RT-2, which requires the robot to master the three major skills of mathematics, visual reasoning and multilingual understanding.

Skill 1, including the command of mathematical logical reasoning, "put the banana in the sum of 2+1":

Skill Two, Visual Reasoning, like "Put strawberries in the right bowl":

Skill 3, multilingual understanding, can complete instructions even without English, for example, command it in Spanish to "pick out the most distinctive one from a bunch of items":

The third ability is human recognition, which accurately recognizes and understands human behavior. The example of "handing water to Taylor Swift" seen at the beginning is one of the ability demonstrations.

So, how are these three abilities realized?

To put it simply, it is to combine the "reasoning", "recognition", and "mathematics" capabilities of the visual-text multimodal large model (VLM) with the operating capabilities of robots.

In order to achieve this, the researchers directly added a mode called "robot action mode" to the visual-text large model (VLM), thus turning it into a visual-text-action large model (VLA).

Subsequently, the original very specific robot action data is converted into a text token.

For example, the data such as the degree of rotation and the coordinate point to be placed are converted into text "put to a certain position".

In this way, the robot data can also be used in the visual-language dataset for training. At the same time, in the process of reasoning, the original text instructions will be re-converted into robot data to realize a series of operations such as controlling the robot.

That's right, it's that simple and rude (manual dog head)

In this research, the team mainly "upgraded" based on a series of basic large-scale models of Google, including 5 billion and 55 billion PaLI-X, 3 billion PaLI And 12 billion PaLM-E.

In order to improve the ability of the large model itself, the researchers have also put in a lot of effort, using the recently popular thinking chain, vector database and no-gradient architectures.

This series of operations also gives RT-2 a lot of new advantages compared with the RT-1 released last year.

Let's take a look at the specific experimental results.

Up to three times the performance of RT-1

RT-2 uses the data of the previous generation robot model RT-1 for training (that is to say, the data has not changed, but the method is different).

The data was collected over a period of 17 months using 13 robots in a kitchen environment set up in the office.

In the actual test (a total of 6,000 times), the author gave RT-2 many previously unseen objects, requiring RT-2 to perform semantic understanding beyond the fine-tuning data to complete the task.

The results are all done pretty well:

Including simple recognition of letters, national flags, and characters to recognition of terrestrial animals from dolls, selection of the one with a different color, and even complex commands such as picking up snacks that are about to fall off the table.

From the perspective of the three subdivision capabilities of symbol understanding, reasoning and human recognition, the two variants of RT-2 are much better than RT-1 and another visual pre-training method VC-1, with a performance of up to 3 times.

As mentioned earlier, the two variants are trained on PaLM-E with 12 billion parameters and PaLI-X with 55 billion parameters, respectively.

In terms of specific generalization ability evaluation, through multi-category subdivision tests with multiple baseline models, it was finally found that the performance of RT-2 was improved by about 2 times.

(Unfortunately, we haven't seen it compared with other teams' latest LLM-based robotic methods)

In order to better understand how different settings of RT-2 affect the generalization results, the author designed two categories of evaluations:

First, in terms of model size, only the RT-2 PaLI-X variant uses 5 billion parameters and 55 billion parameters for training;

The second is the training method, which adopts the method of training the model from scratch vs fine-tuning vs collaborative fine-tuning.

The final results show that the importance of VLM pre-trained weights and the generalization ability of the model tend to increase with the model size.

In addition, the authors also evaluate RT-2 on the open source language table benchmark, and the results show that it achieves SOTA results on the simulated benchmark (90% vs. 77% before).

Finally, since the RT-2 PaLM-E variant is a vision-language-action model that can act as an LLM, VLM, and robot controller in a single neural network, RT-2 can also perform controlled thought-chain reasoning.

Among the five reasoning tasks shown in the figure below (especially the last one is very interesting: choose an item that can replace the hammer), it will output the natural language steps after receiving the command, and then give the specific action token.

Finally, in summary, this latest RT-2 model can not only better apply to different scenes that the machine has never seen before, but also has better generalization ability; at the same time, due to better With the blessing of a large model, it has also mastered some difficult new abilities, such as reasoning.

One More Thing

Google's focus on robotics research on large models does not seem to be "groundless".

Just in the past two days, a paper on "Using Large Models to Help Acquire More Robot Operation Skills" co-authored with Columbia University is also very popular:

This paper proposes a new framework that not only allows the robot to adapt well to the large model, but also retains the basic operation and control capabilities of the original robot:

Unlike RT-2, this project has been open source:

It is true that the large model was used to drive the upgrade of the entire robot department.

Reminiscent of the embodied intelligence achievements of Li Feifei's team not long ago, it can be said that using large models to drive robots has become a research trend, and we have seen a wave of very promising progress.

What are your expectations for this research direction?

project address:

Reference link:

[1]

[2]

[3]

[4]

View Original
  • Reward
  • Comment
  • Share
Comment
No comments