The hottest big language models all love "nonsense". Who has the worst "illusion" problem?

Source: Wall Street News

Author: Du Yu

Arthur AI, a New York-based artificial intelligence startup and machine learning monitoring platform, released its latest research report on Thursday, August 17, comparing Microsoft-backed OpenAI, Metaverse Meta, Google-backed Anthropic, and Nvidia-backed generation The ability of large language models (LLMs) to "hallucinate" (AKA nonsense) from companies like AI unicorn Cohere.

Arthur AI regularly updates the aforementioned research program, dubbed "Generative AI Test Evaluation," to rank the strengths and weaknesses of industry leaders and other open-source LLM models.

The latest tests selected GPT-3.5 from OpenAI (contains 175 billion parameters) and GPT-4 (1.76 trillion parameters), Claude-2 from Anthropic (parameters unknown), Llama-2 from Meta (70 billion parameters) , and Command (50 billion parameters) from Cohere, and ask challenging questions about these top LLM models both quantitatively and qualitatively.

In the "AI Model Hallucination Test," the researchers examined the answers given by different LLM models with questions in categories as diverse as combinatorics, U.S. presidents, and Moroccan political leaders. Multiple steps of reasoning about the information are required."

The study found that, overall, OpenAI's GPT-4 performed the best of all the models tested, producing fewer "hallucinating" problems than the previous version, GPT-3.5, such as reduced hallucinations on the math problem category 33% to 50%.

At the same time, Meta's Llama-2 performed in the middle of the five tested models, and Anthropic's Claude-2 ranked second, second only to GPT-4. And Cohere's LLM model is the most capable of "nonsense" and "very confidently giving wrong answers".

Specifically, in complex mathematical problems, GPT-4 ranks first, followed by Claude-2; in the question of the US president, the accuracy of Claude-2 ranks first, and GPT-4 ranks first Second; on Moroccan political questions, GPT-4 returned to the top spot, with Claude-2 and Llama 2 choosing almost entirely not to answer such questions.

The researchers also tested the extent to which the AI models would "hedge" their answers with irrelevant warning phrases to avoid risk, common phrases including "As an AI model, I cannot provide an opinion."

GPT-4 had a 50% relative increase in hedging warnings over GPT-3.5, which the report says "quantifies the more frustrating experience users cited with GPT-4." And Cohere's AI model provides no hedge at all in the above three problems.

By contrast, Anthropic's Claude-2 was the most reliable in terms of "self-awareness," the ability to accurately measure what it knows and what it doesn't know, and only answer questions backed by training data.

Adam Wenchel, co-founder and CEO of Arthur AI, pointed out that this is the first report in the industry to "comprehensively understand the incidence of hallucinations in artificial intelligence models", and it does not just provide a single data to illustrate the ranking of different LLMs:

"The most important takeaway from this kind of testing for users and businesses is that you can test exact workloads, and it's critical to understand how LLM performs what you want to accomplish. Many previous LLM-based metrics are not what they are in real life way of being used."

On the same day that the above-mentioned research report was published, Arthur Company also launched Arthur Bench, an open source AI model evaluation tool, which can be used to evaluate and compare the performance and accuracy of various LLMs. Enterprises can add customized standards to meet their own business needs. The goal is to help Businesses make informed decisions when adopting AI.

"AI hallucinations" (hallucinations) refer to chatbots completely fabricating information and appearing to spout facts in response to user prompt questions.

Google made untrue statements about the James Webb Space Telescope in a February promotional video for its generative AI chatbot Bard. In June, ChatGPT cited a "bogus" case in a filing in New York federal court, and the lawyers involved in the filing could face sanctions.

OpenAI researchers reported in early June that they had found a solution to the "AI illusion", that is, training the AI model to give self-reward for each correct step in deducing the answer, not just waiting until the correct final conclusion is inferred Only rewarded. This "process supervision" strategy will encourage AI models to reason in a more human-like "thinking" way.

OpenAI acknowledged in the report:

"Even state-of-the-art AI models are prone to lie generation, and they exhibit a tendency to fabricate facts in moments of uncertainty. These hallucinations are especially problematic in domains that require multi-step reasoning, where a single logical error can be enough to destroy a more Big solution."

Soros, the investment tycoon, also published a column in June saying that artificial intelligence can most aggravate the polycrisis facing the world at the moment. One of the reasons is the serious consequences of the AI illusion:

"AI destroys this simple model (Wall Street notes: using facts to tell right from wrong) because it has absolutely nothing to do with reality. AI creates its own reality when the artificial reality does not correspond to the real world (this often happens), the AI illusion is created. This makes me almost instinctively against AI, and I completely agree with the experts that AI needs to be regulated. But AI regulations must be enforced globally, because the incentive to cheat is too great, and those who evade the regulations will gain an unfair advantage. Unfortunately, global regulation is out of the question. Artificial intelligence is developing so fast that it is impossible for ordinary human intelligence to fully understand it. No one can predict where it will take us. ...that's why I'm instinctively against AI, but I don't know how to stop it. With a presidential election in the US in 2024, and likely in the UK, AI will undoubtedly play an important role that will not be anything but dangerous. AI is very good at creating disinformation and deepfakes, and there will be many malicious actors. What can we do about it? I don't have an answer. "

Previously, Geoffrey Hinton, who was regarded as the "godfather of artificial intelligence" and left Google, publicly criticized the risks brought by AI many times, and may even destroy human civilization, and predicted that "artificial intelligence only takes 5 to It can surpass human intelligence in 20 years."

View Original
  • Reward
  • Comment
  • Share
Comment
No comments