Categories: NEWS

AI now outperforms humans in nearly all performance metrics

Stanford University’s Institute for Human-Centered Artificial Intelligence (HAI) has published the latest edition of its comprehensive AI Index report, authored by a diverse team of academic and industry specialists.

This edition has more content than previous editions, reflecting the rapid evolution of AI and its growing significance in our everyday lives. It examines everything from which sectors use AI the most to which country is most nervous about losing jobs to AI. But one of the most salient takeaways from the report is AI’s performance when pitted against humans.

For people that haven’t been paying attention, AI has already beaten us in a frankly shocking number of significant benchmarks. In 2015, it surpassed us in image classification, then basic reading comprehension (2017), visual reasoning (2020), and natural language inference (2021).

AI is getting so clever, so fast, that many of the benchmarks used to this point are now obsolete. Indeed, researchers in this area are scrambling to develop new, more challenging benchmarks. To put it simply, AIs are getting so good at passing tests that now we need new tests – not to measure competence, but to highlight areas where humans and AIs are still different, and find where we still have an advantage.

It’s worth noting that the results below reflect testing with these old, possibly obsolete, benchmarks. But the overall trend is still crystal clear:

AI has already surpassed many human performance benchmarks

Look at those trajectories, especially how the most recent tests are represented by a close-to-vertical line. And remember, these machines are virtual toddlers.

The new AI Index report notes that in 2023, AI still struggled with complex cognitive tasks like advanced math problem-solving and visual commonsense reasoning. However, ‘struggled’ here might be misleading; it certainly doesn’t mean AI did badly.

Performance on MATH, a dataset of 12,500 challenging competition-level math problems, improved dramatically in the two years since its introduction. In 2021, AI systems could solve only 6.9% of problems. By contrast, in 2023, a GPT-4-based model solved 84.3%. The human baseline is 90%.

And we’re not talking about the average human here; we’re talking about the kinds of humans that can solve test questions like this:

An example MATH question asked of the AI. Yikes!

That’s where things are at with advanced math in 2024, and we’re still very much at the dawn of the AI era.

Then there’s visual commonsense reasoning (VCR). Beyond simple object recognition, VCR assesses how AI uses commonsense knowledge in a visual context to make predictions. For example, when shown an image of a cat on a table, an AI with VCR should predict that the cat might jump off the table or that the table is sturdy enough to hold it, given its weight.

The report found that between 2022 and 2023, there was a 7.93% increase in VCR, up to 81.60, where the human baseline is 85.

A sample question used to test an AI’s visual commonsense reasoning

Nowadays, AI generates written content across many professions. But, despite a great deal of progress, large language models (LLMs) are still prone to ‘hallucinations,’ a very charitable term pushed by companies like OpenAI, which roughly translates to “presenting false or misleading information as fact.”

Last year, AI’s propensity for ‘hallucination’ was made embarrassingly plain for Steven Schwartz, a New York lawyer who used ChatGPT for legal research and didn’t fact-check the results. The judge hearing the case quickly picked up on the legal cases the AI had fabricated in the filed paperwork and fined Schwartz US$5,000 (AU$7,750) for his careless mistake. His story made worldwide news.

HaluEval was used as a benchmark for hallucinations. Testing showed that for many LLMs, hallucination is still a significant issue.

Truthfulness is another thing generative AI struggles with. In the new AI Index report, TruthfulQA was used as a benchmark to test the truthfulness of LLMs. Its 817 questions (about topics such as health, law, finance and politics) are designed to challenge commonly held misconceptions that we humans often get wrong.

GPT-4, released in early 2024, achieved the highest performance on the benchmark with a score of 0.59, almost three times higher than a GPT-2-based model tested in 2021. Such an improvement indicates that LLMs are progressively getting better when it comes to giving truthful answers.

What about AI-generated images? To understand the exponential improvement in text-to-image generation, check out Midjourney’s efforts at drawing Harry Potter since 2022:

How text-to-image generation has improved with progressive versions of Midjourney

Using the Holistic Evaluation of Text-to-Image Models (HEIM), LLMs were benchmarked for their text-to-image generation capabilities across 12 key aspects important to the “real-world deployment” of images.

Humans evaluated the generated images, finding that no single model excelled in all criteria. For image-to-text alignment or how well the image matched the input text, OpenAI’s DALL-E 2 scored highest. The Stable Diffusion-based Dreamlike Photoreal model was ranked highest on quality (how photo-like), aesthetics (visual appeal), and originality.

Next year’s report is going to be bananas

You’ll note this AI Index Report cuts off at the end of 2023 – which was a wildly tumultuous year of AI acceleration and a hell of a ride. In fact, the only year crazier than 2023 has been 2024, in which we’ve seen – among other things – the releases of cataclysmic developments like Suno, Sora, Google Genie, Claude 3, Channel 1, and Devin.

Each of these products, and several others, have the potential to flat-out revolutionize entire industries. And over them all looms the mysterious spectre of GPT-5, which threatens to be such a broad and all-encompassing model that it could well consume all the others.

AI isn’t going anywhere, that’s for sure. The rapid rate of technical development seen throughout 2023, evident in this report, shows that AI will only keep evolving and closing the gap between humans and technology.

We know this is a lot to digest, but there’s more. The report also looks into the downsides of AI’s evolution and how it’s affecting global public perceptions of its safety, trustworthiness, and ethics. Stay tuned for the second part of this series, in the coming days!

Source : NEW ATLAS

Tags: New Atlas
Main author of PublicSphereTech

Recent Posts

The Role of AI in Food Trend Analysis

In an era where consumer preferences are dynamic, AI food trend analysis has emerged as a revolutionary tool to decipher…

1 week ago

Predictive Maintenance in the Food Industry

Predictive maintenance has become a powerful tool across various industries, including the food industry, where equipment reliability and uptime are…

4 weeks ago

AI in Food Supply Chain Optimization: Efficiency and Sustainability

AI in food supply chain optimization is transforming how companies manage inventory, predict demand, and minimize waste. By analyzing large…

1 month ago

AI in Food Quality Control: Revolutionizing the Food Industry

Artificial Intelligence (AI) has been making waves across various industries, but its impact on food quality control has been especially…

2 months ago

AI-Powered Food Sorting: Revolutionizing the Food Industry

Artificial Intelligence (AI) has transformed various sectors, and the food industry is no exception. One of the most promising applications…

2 months ago

AI-driven Agriculture: Revolutionizing Farming for Sustainability

The agricultural industry is facing numerous challenges, including climate change, population growth, and resource scarcity. These challenges have created a…

2 months ago