Huang Renxun talks to the seven authors of the Transformer paper: We are trapped in the original model and need a more powerful new architecture

![Huang Renxun talks to the seven authors of the Transformer paper: We are trapped in the original model and need a more powerful new architecture](https://cdn-img.panewslab.com//panews/2022/3/23/images/ 3209736c2376bc78f33a30e387cc4e77.jpeg)

Author: Guo Xiaojing

Source: Tencent News

In 2017, a landmark paper - "Attention is All You Need" was published. It introduced the Transformer model based on the self-attention mechanism for the first time. This innovative architecture got rid of the constraints of traditional RNN and CNN. Through the attention mechanism of parallel processing, the problem of long-distance dependence is effectively overcome and the speed of sequence data processing is significantly improved. Transformer's encoder-decoder structure and multi-head attention mechanism set off a storm in the field of artificial intelligence. The popular ChatGPT is built on this architecture.

Imagine that the Transformer model is like your brain talking to a friend, paying attention to every word the other person says simultaneously and understanding the connections between those words. It gives computers human-like language understanding capabilities. Prior to this, RNN was the mainstream method for processing language, but its information processing speed was slow, like an old-fashioned tape player that had to be played word for word. The Transformer model is like an efficient DJ, able to control multiple tracks at the same time and quickly capture key information.

The emergence of the Transformer model has greatly improved the ability of computers to process language, making tasks such as machine translation, speech recognition, and text summarization more efficient and accurate. This is a huge leap for the entire industry.

This innovation resulted from the joint efforts of eight AI scientists who previously worked at Google. Their initial goal was simple: improve Google's machine translation service. They want machines to be able to fully understand and read entire sentences, rather than translate them word for word in isolation. This concept became the starting point of the “Transformer” architecture—the “self-attention” mechanism. On this basis, these eight authors used their respective expertise and published the paper "Attention Is All You Need" in December 2017, describing the Transformer architecture in detail and opening a new chapter in generative AI.

In the world of generative AI, Scaling Law is a core principle. In short, as the scale of the Transformer model increases, its performance also increases, but this also means that more powerful computing resources are needed to support larger models and deeper networks, and high-performance computing services are required. NVIDIA has also become a key player in this AI wave.

At this year's GTC conference, Nvidia's Jen-Hsun Huang invited the seven authors of Transformer (Niki Parmar was temporarily unable to attend for some reason) to participate in a roundtable discussion in a ceremonial manner. This was the first time for the seven authors to discuss their work in public. Group appearance.

They also made some impressive points during the conversation:

  • The world needs something better than Transformer, and I think all of us here are hoping that it will be replaced by something that will take us to a new performance plateau.
  • We did not succeed in our original goal. Our original intention of starting Transformer was to simulate the evolution process of Token. It is not just a linear generation process, but a step-by-step evolution of text or code.
  • Simple problems like 2+2, which may use trillions of parameter resources of large models. I think adaptive computing is one of the next things that has to happen, where we know how much computing resources should be spent on a particular problem.
  • I think the current model is too affordable and too small. The price of about $1 million toke is 100 times cheaper than going out and buying a paperback book.

The following is the actual content:

Jensen Huang: In the past sixty years, computer technology does not seem to have undergone fundamental changes, at least from the moment I was born. The computer systems we currently use, whether it is multi-tasking, separation of hardware and software, software compatibility, data backup capabilities, and the programming skills of software engineers, are basically based on the design principles of IBM 360 - Central Processor, Bio subsystem, multitasking, hardware and software, software system compatibility, etc.

I don't think modern computing has fundamentally changed since 1964. Although in the 1980s and 1990s, computers underwent a major transformation into the form we are familiar with today. But as time goes by, the marginal cost of computers continues to decline, reducing its cost by ten times every ten years, by a thousand times in fifteen years, and by ten thousand times in twenty years. In this computer revolution, the cost reduction was so great that in two decades, the cost of computers dropped by almost 10,000 times. This change brought huge power to society.

Try to imagine if all the expensive items in your life were reduced to one ten thousandth of their original value. For example, the car you bought for $200,000 twenty years ago now only costs $1. Can you imagine this change? ? However, the decline in computer costs did not happen overnight, but gradually reached a critical point, and then the trend of cost decline suddenly stopped. It continued to improve a little bit every year, but the rate of change stagnated.

We began to explore accelerated computing, but using accelerated computing is not easy. You need to design it bit by bit from scratch. In the past, we might have followed established steps to solve a problem step by step, but now we need to redesign those steps. This is a completely new field of science, reformulating the previous rules into parallel algorithms.

We recognize this and believe that if we can speed up even 1% of the code and save 99% of the run time, then there will be applications that will benefit from it. Our goal is to make the impossible possible, or to make the possible impossible, or to make things that are already possible more efficient. This is what accelerated computing is about.

Looking back at the history of the company, we see our ability to accelerate a variety of applications. Initially, we achieved significant acceleration in the gaming field, so effective that people mistakenly thought we were a gaming company. But in fact, our goal is much more than that, because this market is huge and big enough to drive incredible technological progress. This situation is not common, but we found a special case.

To make a long story short, in 2012, AlexNet ignited a spark, which was the first collision between artificial intelligence and NVIDIA GPUs. This marks the beginning of our amazing journey in this field. A few years later, we discovered a perfect application scenario that laid the foundation for where we are today.

In short, these achievements lay the foundation for the development of generative artificial intelligence. Generative AI can not only recognize images, but also convert text into images and even create brand new content. We now have enough technical ability to understand pixels, identify them, and understand the meaning behind them. Through the meaning behind these, we can create new content. The ability of artificial intelligence to understand the meaning behind data is a huge change.

We have reason to believe that this is the beginning of a new industrial revolution. In this revolution, we are creating something that has never been done before. For example, in the previous industrial revolution, water was a source of energy, and water entered the devices we created, and generators started working, water came in and electricity came out, like magic.

Generative AI is a brand-new "software" that can create software, and it relies on the joint efforts of many scientists. Imagine that you give AI raw materials - data, and they enter a "building" - a machine we call a GPU, and it can output magical results. It is reshaping everything and we are witnessing the birth of “AI factories”.

This change can be called a new industrial revolution. We've never really experienced change like this in the past, but now it's slowly unfolding before us. Don't miss the next ten years, because in these ten years we will create huge productivity. The pendulum of time has set in motion, and our researchers are already taking action.

Today we invited the creators of Tansformer to discuss where generative AI will take us in the future.

they are:

Ashish Vaswani: Joined the Google Brain team in 2016. In April 2022, he co-founded Adept AI with Niki Parmar, left the company in December of the same year, and co-founded another artificial intelligence startup, Essential AI.

Niki Parmar: worked at Google Brain for four years before co-founding Adept AI and Essential AI with Ashish Vaswani.

Jakob Uszkoreit: Worked at Google from 2008 to 2021. He left Google in 2021 and co-founded Inceptive. The company's main business is artificial intelligence life sciences and is committed to using neural networks and high-throughput experiments to design the next generation of RNA molecules.

Illia Polosukhin: Joined Google in 2014 and was one of the first people to leave among the eight-person team. In 2017, he co-founded the blockchain company NEAR Protocol.

Noam Shazeer: worked at Google from 2000 to 2009 and from 2012 to 2021. In 2021, Shazeer left Google and co-founded Character.AI with former Google engineer Daniel De Freitas.

**Llion Jones: **Worked at Delcam and YouTube. Joined Google in 2012 as a software engineer. Later, he left Google and founded the artificial intelligence start-up sakana.ai.

Lukasz Kaiser: Formerly a researcher at the French National Center for Scientific Research. Joined Google in 2013. In 2021, he left Google and became a researcher at OpenAI.

Aidan Gomez: graduated from the University of Toronto, Canada. When the Transformer paper was published, he was still an intern on the Google Brain team. He is the second person from the eight-person team to leave Google. In 2019, he co-founded Cohere.

![Huang Renxun talks to the seven authors of the Transformer paper: We are trapped in the original model and need a more powerful new architecture](https://cdn-img.panewslab.com//panews/2022/3/23/images/ e2cb0168e261ffba0c0ea67a5502acf8.png)

Renxun Huang: As I sit here today, please actively strive for the opportunity to speak. There is no topic that cannot be discussed here. You can even jump up from your chair to discuss issues. Let's start with the most basic question, what problems did you encounter at that time, and what inspired you to become a Transformer?

Illia Polosukhin: If you want to release models that can actually read search results, such as processing piles of documents, you need some models that can process this information quickly. The recurrent neural network (RNN) at that time could not meet such needs.

Indeed, although recurrent neural networks (RNN) and some preliminary attention mechanisms (Arnens) attracted attention at that time, they still required reading word by word, which was not efficient.

Jakob Uszkoreit: The speed at which we generate training data far exceeds our ability to train state-of-the-art architectures. In fact, we use simpler architectures, such as feed-forward networks with n-grams as input features. These architectures often outperform more complex and advanced models because they train faster, at least on large amounts of training data at Google scale.

Powerful RNNs at that time, especially long short-term memory networks (LSTM), already existed.

Noam Shazeer: It seems like this is a burning issue. We started to notice these scaling laws around 2015, and you can see that as the size of the model increases, its intelligence also increases. It's like the best problem in the history of the world here, very simple: you're just predicting the next token, and it's going to be so smart and able to do a million different things, and you just want to scale it up and make it Better.

A huge frustration is that RNN is too troublesome to handle. And then I overheard these guys talking about, hey, let's replace this with a convolution or an attention mechanism. I thought, great, let's do this. I like to compare the Transformer to the leap from steam engines to internal combustion engines. We could have completed the industrial revolution with steam engines, but that would have been painful, and the internal combustion engine made everything better.

Ashish Vaswani: I started to learn some hard lessons during my graduate school years, especially when I was working on machine translation. I realized, hey, I'm not going to learn those complicated rules of language. I think Gradient Descent - the way we train these models - is a better teacher than I am. So I'm not going to learn the rules, I'm just going to let Gradient Descent do all the work for me, and that's my second lesson.

What I learned the hard way is that general architectures that can scale will ultimately win out in the long run. Today it might be tokens, tomorrow it might be actions we take on computers, and they will start to mimic our activities and be able to automate a lot of the work we do. As we discussed, Transformer, especially its self-attention mechanism, has very broad applicability, and it also makes gradient descent better. The other thing is physics, because one thing I learned from Noam is that matrix multiplication is a good idea.

Noam Shazeer: This pattern keeps recurring. So every time you add a bunch of rules, gradient descent ends up being better at learning those rules than you are. That's it. Just like the deep learning we have been doing, we are building an AI model shaped like a GPU. And now, we are building an AI model shaped like a supercomputer. Yes, supercomputers are the model now. Yes, this is true. Yes. Supercomputer Just to let you know, we are building a supercomputer into the shape of the model.

** Jen-Hsun Huang: So what problem are you trying to solve? **

Lukasz Kaiser: Machine Translation. Thinking back to five years ago, this process seemed very difficult. You had to collect data, maybe translate it, and the result might be only marginally correct. The level at that time was still very basic. But now, these models can learn to translate even without data. You just provide one language and another language, and the model learns to translate on its own, and the ability comes naturally and satisfactorily.

Llion Jones: But the intuition of "Attention" is all you need. So I came up with this title, and basically what happened was when we were looking for a title.

We were just doing ablation and started throwing away bits and pieces of the model just to see if it would get any worse. To our surprise, it started to get better. It's much better to include throwing away all the convolutions like this. So that's where the title comes from.

Ashish Vaswani: Basically what's interesting is that we actually started with a very basic framework and then we added things, we added convolutions and then I guess we took them away. There are also many other very important things like multi-head attention.

** Jensen Huang: Who came up with the name Transformer? Why is it called Transformer? **

Jakob Uszkoreit: We like this name. We just picked it up randomly and thought it was very creative. It changed our data production model and used such a logic. All machine learning is a Transformer and a disruptor.

Noam Shazeer: We haven't thought about this name before. I think this name is very simple, and many people think this name is very good. I thought about many names before, such as Yaakov, and finally settled on "Transformer", which describes the principle of the model. It actually transforms the entire signal. According to this logic, almost all machine learning will be transformed.

Llion Jones: The reason why Transformer has become such a familiar name is not only because of the content of the translation, but also because we wanted to describe this transformation in a more general way. I don't think we did a great job, but as a change maker, as a driver and an engine, it made sense. Everyone can understand such a large language model, engine and logic. From an architectural perspective, this is a relatively early start period.

But we did realize that we were actually trying to create something that was very, very versatile that could really turn anything into anything else. And I don't think we predicted how good this would actually be when Transformers were used for images, which is a bit surprising. This may seem logical to you guys, but actually, you can chunk the image and label each little dot, right. I think this was something that existed very early on in architecture.

So when we were building tensor-to-tensor libraries, what we really focused on was scaling up autoregressive training. It’s not just language, but also image, audio components.

So Lukasz said what he was doing was translating. I think he underestimated himself, and all of these ideas, we're now starting to see these patterns come together, they all add to the model.

But really, everything was there early on and the ideas are percolating and it takes some time. Lukasz's goal is that we have all these academic datasets that go from image to text, text to image, audio to text, text to text. We should train for everything.

This idea really drove the extension work, and eventually it worked, and it was so interesting that we could translate images to text, text to images, and text to text.

You're using it to study biology, or biological software, which might be similar to computer software in that it starts as a program and then you compile it into something that can run on a GPU.

The life of a biological software begins with the specification of certain behaviors. Let's say you want to print a protein, like a specific protein in a cell. And then you learned how to use deep learning to convert that into an RNA molecule, but actually exhibit these behaviors once it gets into your cells. So the idea is really not just about translating into English.

**Jensen Huang: Did you create a large lab to produce all of this? **

Aidan Gomez: A lot is available and actually remains publicly available because these data are often still largely publicly funded. But in reality, you still need data to clearly illustrate the phenomenon you're trying to achieve.

Trying to model within a given product, let's say protein expression and mRNA vaccines and things like that, or yeah, in Palo Alto we have a bunch of robots and people in lab coats, both learning research personnel, including former biologists.

Now, we consider ourselves the pioneers of something new, working to actually create these data and validate the models that design these molecules. But the original idea was to translate.

** Jen-Hsun Huang: The original idea was machine translation. What I want to ask is, what are the key nodes seen in the strengthening and breakthrough of the architecture? And what impact do they have on the design of Transformer? **

Aidan Gomez: Along the way, you have seen it. Do you think there is really a big additional contribution on top of the basic Transformer design? I think on the inference side, there's been a lot of work to speed up these models and make them more efficient.

I still think it's a little disturbing to me because of how similar our original forms were. I think the world needs something better than Transformer, and I think all of us here want it to be replaced by something that takes us to a new plateau of performance.

I want to ask everyone here a question. What do you think will happen next? Like it's an exciting step because I think it's so similar to stuff from 6-7 years ago, right?

Llion Jones: Yeah, I think people would be surprised at how similar you say it is, right? People do like to ask me what happens next because I'm the author of this paper. Like magic, you wave the magic wand and what happens next? What I want to point out is how this specific principle was designed. Not only do we need to be better, we need to be demonstrably better.

Because if it's just slightly better, then that's not enough to push the entire AI industry into something new. So we're stuck with the original model, although technically it's probably not the most powerful thing we have right now.

But everyone knows what kind of personal tools they want, you want better contextual windows, you want the ability to generate tokens faster. Well, I'm not sure if you like this answer, but they are using too many computing resources right now. I think people do a lot of wasted calculations. We are working hard to improve efficiency, thank you.

** Jensen Huang: I think we are making this more effective, thank you! **

Jakob Uszkoreit: But I think it's mainly about how resources are distributed, rather than how many resources are consumed in total. For example, we don’t want to spend too much money on an easy problem, or spend too little on a too difficult problem and end up not getting a solution.

Illiya Polosukhin: This example is like 2+2, if you feed it into this model correctly, it uses a trillion parameters. So I think adaptive computing is one of the things that has to come next, where we know how much computing resources should be spent on a particular problem.

Aidan Gomez: We know how much power computers currently have, and I think this is the next issue that needs to be focused on. I think this is a cosmic changer, and this is also the future development trend.

Lukasz Kaiser: This concept existed before Transformer, and it was integrated into the Transformer model. In fact, I am not sure if everyone here knows that we did not succeed in our original goal. Our original intention when starting this project was to simulate the evolution process of Token. It is not just a linear generation process, but a step-by-step evolution of text or code. We iterate, we edit, which makes it possible for us to not only imitate how humans develop texts, but also use them as part of that process. Because if you could generate content as naturally as humans do, they would actually be able to provide feedback, right?

All of us had read Shannon's paper, and our original idea was to just focus on language modeling and perplexity, but that didn't happen. I think this is also where we can develop further. It's also about how we now organize computing resources intelligently, and this organization now applies to image processing as well. I mean, diffusion models have an interesting property of being able to continually refine and improve their quality through iteration. And we currently don't have such capabilities.

I mean, this fundamental question: What knowledge should be built into the model and what knowledge should be outside the model? Are you using a retrieval model? The RAG (Retri-Augmented Generation) model is an example. Likewise, this also involves the question of inference, i.e. which inference tasks should be performed externally via symbolic systems and which inference tasks should be performed directly within the model. This is very much a discussion about efficiency. I do believe that large models will eventually learn how to do calculations like 2+2, but if you want to calculate 2+2 and do it by adding up numbers, that's obviously inefficient.

** Jen-Hsun Huang: If the AI only needs to calculate 2+2, then it should use the calculator directly to complete this task with the least amount of energy, because we know that the calculator is the most effective tool for doing 2+2 calculations. However, if someone asks the AI, how did you arrive at the 2+2 decision? Did you know that 2+2 is the correct answer? Will this consume a lot of resources? **

![Huang Renxun talks to the seven authors of the Transformer paper: We are trapped in the original model and need a more powerful new architecture](https://cdn-img.panewslab.com//panews/2022/3/23/images/ 943398d349cf0e17db81b1469281b267.png)

Noam Shazeer: Exactly. You mentioned an example before, but I'm also convinced that the artificial intelligence systems that everyone here develops are smart enough to actively use calculators.

Global public goods (GPP) currently do just that. I think the current model is too affordable and too small. The reason it's cheap is because of technology like NV, thanks to its output.

The computational cost per operation is approximately $10 to $18. In other words, roughly on this order of magnitude. Thank you for creating so many computing resources. But if you look at a model with 500 billion parameters and one trillion calculations per token, that's about a dollar per million tokens, which is 100 times cheaper than going out and buying a paperback book and reading it. Our application is a million times or more more valuable than efficient computation on giant neural networks. I mean, they're certainly more valuable than something like curing cancer, but it's more than that.

Ashish Vaswani: I think making the world smarter means how to get feedback from the world and whether we can achieve multi-tasking and multi-line parallelization. If you really want to build such a model, this is a great way to help us design such a model.

** Jensen Huang: Can you quickly share why you started your company? **

Ashish Vaswani: In our company, our goal is to build models and solve new tasks. Our job is to understand the goals and content of the assignment and adapt those content to meet the client's needs. In fact, starting in 2021, I find that the biggest problem with models is that you can't just make the models smarter, you also need to find the right people to interpret these models. We hope to make the world and the model intertwined, making the model larger and more outstanding. There is a certain amount of progress required in the learning process that cannot be accomplished initially in the vacuum environment of a laboratory.

Noam Shazeer: In 2021, we co-founded this company. We have such great technology, but it’s not reaching a lot of people. Imagine if I were a patient hearing you say this, I would think there are tens of billions of people with different tasks that they need to complete. This is what deep learning is about, we improve technology through comparison. In fact, because of the continuous development of technology, driven by Jensen Huang, our ultimate goal is to help people all over the world. You have to test, and we now need to develop faster solutions that enable hundreds of people to use these applications. Like initially, not everyone was using these apps, a lot of people were using them just for fun, but they did work, they did work.

Jakob Uszkoreit: Thanks. I want to talk about the ecological software system we created. In 2021, I co-founded this company, and our goal is to solve some problems with real scientific impact. In the past, we were dealing with quite complex content. But when I had my first child, the way I saw the world changed. We hope to make human life more convenient and contribute to protein research. Especially after having children, I hope to change the existing medical structure, and hope that the development of science and technology can have a positive impact on human survival and development. For example, protein structure and deconstruction have been affected to some extent, but currently we lack data. We must base our efforts on data, not only as a duty but as a father.

** Jen-Hsun Huang: I like your point of view. I'm always interested in the design of new medicines and the process of letting computers learn how to develop and generate new medicines. If new drugs could be learned and designed, and a laboratory could test them, it would be possible to determine whether such a model would work. **

Llion JonesLlion Jones: Yeah, I'm the last one to share. The company we co-founded is called Sakana AI, which means "fish." The reason why we named our company after the Japanese "fish" is because we are like a school of fish, which naturally inspires us to find intelligence. If we can combine many of the examined elements, we can create something complex and beautiful. Many may not understand the specifics of the process and content, but our core philosophy internally is "Learning Always Wins."

Whether you want to solve a problem or want to learn anything, learning will always help you win. In the process of generative AI, learning content will also help us win. As a researcher present, I would like to remind everyone that we give real meaning to computer AI models, so that they can truly help us understand the mysteries of the universe. In fact, I also wanted to tell you that we are about to announce a new development that we are very excited about. While we now have a body of research as a building block, we are experiencing a transformative development where current model management is organized and allows people to truly engage. We make these models more feasible, using these large models and transformative models to change the way people understand the world and the universe. this is our target.

Aidan Gomez: My original intention of starting the company was similar to Noam Shazeer's. I think computing is entering a new paradigm that is changing existing products and the way we work. Everything is computer based, and it changes within the technology to a certain extent. What is our role? I'm actually bridging the gap, bridging the chasm. We can see different companies creating such platforms, allowing each company to adapt and integrate products, which is a way to directly face users. This is how we advance technology and make it more affordable and more ubiquitous.

** Jensen Huang: What I particularly appreciate is that when Noam Shazeer seems particularly calm, you look very excited. The differences in your personalities are so stark. Now, I give the floor to Lukasz Kaiser. **

Lukasz Kaiser: My experience at OpenAI was very disruptive. It's a lot of fun in the company and we crunch lots of data to do calculations, but at the end of the day, my role is still that of a data cruncher.

Illiya Polosukhin: I was the first one to leave. I firmly believe that we will make significant progress and software will change the entire world. The most direct way is to teach machines to write code and make programming accessible to everyone.

At NEAR, although our progress is limited, we are committed to integrating human wisdom and obtaining relevant data, such as further inspiring people to realize that we need a basic methodology. This model is a fundamental development. This large model is widely used around the world. It has many applications in aerospace and other fields. It is related to communication and interaction in various fields and actually provides us with capabilities. With the deepening of use, we found that it brought more models, and there are currently not many disputes about copyright.

We are now in a new generative era, an era that celebrates innovation and innovators, and we want to actively participate and embrace change, so we looked for different ways to help build a really cool model.

** Jensen Huang: This positive feedback system is very beneficial to our overall economy. We are now better able to design our economy. Someone asked, in this era when GPT models are training billions of token-scale databases, what is the next step? What will the new modeling technology be? What do you want to explore? What is your data source? **

Illia Polosukhin: Our starting point is vectors and displacements. We need models that have real economic value, that people can evaluate and ultimately put your techniques and tools into practice to make the whole model better.

** Jen-Hsun Huang: How do you domain train the model? What were the initial interactions and interaction patterns? Is it communication and interaction between models? Or are there generative models and techniques? **

Illia Polosukhin: In our team, everyone has their own technical expertise.

Jakob Uszkoreit: The next step is reasoning. We all recognize the importance of reasoning, but a lot of the work is still done manually by engineers. We're actually teaching them to answer in an interactive question-and-answer format, and we hope they understand why together and provide a strong reasoning pattern together. We hope that the model can generate the content we want, and this generation method is what we are pursuing. Whether it's video, text or 3D information, they should all be integrated.

Lukasz Kaiser: I think, do people understand that inference actually comes from data? If we start to reason, we have a set of data and we think about why is this data different? Then we will learn that various applications are actually based on the process of data reasoning. Thanks to the power of computers, thanks to systems like this, we can start to develop further from there. We can reason about relevant content and conduct experiments.

Many times, these are derived from data. I think inference is evolving very quickly, data models are very important, and there will be more interactive content in the near future. We haven't done enough training yet, it's not the key content and element, we need to make the data more fleshed out.

Noam Shazeer: Designing some data, such as designing a teaching machine, may involve hundreds or hundreds of millions of different tokens.

Ashish Vaswani: The point I want to make is that in this area, we have many partners who have achieved some milestones. What is the best automated algorithm? In fact, it is to break down real-world tasks into different contents. Our model is also very important, it helps us get the data and see if the data is in the right place. On the one hand, it helps us focus on the data; on the other hand, such data provides us with high-quality models to complete abstract tasks. Therefore, we believe that measuring this progress is also a way of creativity, a way of scientific development, and a way of our automation development.

** Jen-Hsun Huang: You can't do great projects without a good measurement system. Do you have any questions for each other? **

Illia Polosukhin: No one really wants to know what steps they took. But in fact, we hope to understand and explore what we are doing, obtain enough data and information, and make reasonable inferences. For example, if you have six steps, but you can actually skip one step by reasoning through five steps. Sometimes you don’t need six steps, and sometimes you need more steps, so how do you replicate a scenario like this? What do you need to move further from Token?

Lukasz Kaiser: My personal belief is that how to reproduce such a large model is a very complicated process. Systems will evolve, but essentially you need to devise a method. Human beings are creatures that are good at recurrence. Throughout human history, we have repeatedly reproduced successful scenes.

** Jen-Hsun Huang: I am very happy to communicate with you, and I hope you will have the opportunity to communicate with each other and produce indescribable magic. Thank you for participating in this meeting, thank you very much! **

View Original
  • Reward
  • Comment
  • Share
Comment
No comments