📢 Countdown: Just 1 Week Left! Are You Ready?
🗓 On November 14, @Gate_Ventures and @HackQuest_ are joining forces for the #WEB3 DEV HUDDLE# side event at Gaysorn Tower in Bangkok, Thailand!
🔥We’re excited to have @ZKcandyHQ, @iGAM3_ai, @flow_blockchain, @botanika_sol and @kol4u_xyz as our gold sp
The hottest big language models all love "nonsense". Who has the worst "illusion" problem?
Source: Wall Street News
Author: Du Yu
Arthur AI, a New York-based artificial intelligence startup and machine learning monitoring platform, released its latest research report on Thursday, August 17, comparing Microsoft-backed OpenAI, Metaverse Meta, Google-backed Anthropic, and Nvidia-backed generation The ability of large language models (LLMs) to "hallucinate" (AKA nonsense) from companies like AI unicorn Cohere.
Arthur AI regularly updates the aforementioned research program, dubbed "Generative AI Test Evaluation," to rank the strengths and weaknesses of industry leaders and other open-source LLM models.
The latest tests selected GPT-3.5 from OpenAI (contains 175 billion parameters) and GPT-4 (1.76 trillion parameters), Claude-2 from Anthropic (parameters unknown), Llama-2 from Meta (70 billion parameters) , and Command (50 billion parameters) from Cohere, and ask challenging questions about these top LLM models both quantitatively and qualitatively.
In the "AI Model Hallucination Test," the researchers examined the answers given by different LLM models with questions in categories as diverse as combinatorics, U.S. presidents, and Moroccan political leaders. Multiple steps of reasoning about the information are required."
The study found that, overall, OpenAI's GPT-4 performed the best of all the models tested, producing fewer "hallucinating" problems than the previous version, GPT-3.5, such as reduced hallucinations on the math problem category 33% to 50%.
At the same time, Meta's Llama-2 performed in the middle of the five tested models, and Anthropic's Claude-2 ranked second, second only to GPT-4. And Cohere's LLM model is the most capable of "nonsense" and "very confidently giving wrong answers".
Specifically, in complex mathematical problems, GPT-4 ranks first, followed by Claude-2; in the question of the US president, the accuracy of Claude-2 ranks first, and GPT-4 ranks first Second; on Moroccan political questions, GPT-4 returned to the top spot, with Claude-2 and Llama 2 choosing almost entirely not to answer such questions.
The researchers also tested the extent to which the AI models would "hedge" their answers with irrelevant warning phrases to avoid risk, common phrases including "As an AI model, I cannot provide an opinion."
GPT-4 had a 50% relative increase in hedging warnings over GPT-3.5, which the report says "quantifies the more frustrating experience users cited with GPT-4." And Cohere's AI model provides no hedge at all in the above three problems.
By contrast, Anthropic's Claude-2 was the most reliable in terms of "self-awareness," the ability to accurately measure what it knows and what it doesn't know, and only answer questions backed by training data.
Adam Wenchel, co-founder and CEO of Arthur AI, pointed out that this is the first report in the industry to "comprehensively understand the incidence of hallucinations in artificial intelligence models", and it does not just provide a single data to illustrate the ranking of different LLMs:
On the same day that the above-mentioned research report was published, Arthur Company also launched Arthur Bench, an open source AI model evaluation tool, which can be used to evaluate and compare the performance and accuracy of various LLMs. Enterprises can add customized standards to meet their own business needs. The goal is to help Businesses make informed decisions when adopting AI.
"AI hallucinations" (hallucinations) refer to chatbots completely fabricating information and appearing to spout facts in response to user prompt questions.
Google made untrue statements about the James Webb Space Telescope in a February promotional video for its generative AI chatbot Bard. In June, ChatGPT cited a "bogus" case in a filing in New York federal court, and the lawyers involved in the filing could face sanctions.
OpenAI researchers reported in early June that they had found a solution to the "AI illusion", that is, training the AI model to give self-reward for each correct step in deducing the answer, not just waiting until the correct final conclusion is inferred Only rewarded. This "process supervision" strategy will encourage AI models to reason in a more human-like "thinking" way.
OpenAI acknowledged in the report:
Soros, the investment tycoon, also published a column in June saying that artificial intelligence can most aggravate the polycrisis facing the world at the moment. One of the reasons is the serious consequences of the AI illusion:
Previously, Geoffrey Hinton, who was regarded as the "godfather of artificial intelligence" and left Google, publicly criticized the risks brought by AI many times, and may even destroy human civilization, and predicted that "artificial intelligence only takes 5 to It can surpass human intelligence in 20 years."