Large Language Models Benchmarks

Measuring What Matters in Large Language Model Performance

As large language models (LLMs) gain momentum worldwide, there’s a growing need for reliable ways to measure their performance. Benchmarks that evaluate LLM outputs allow developers to track ...

21h

If you code Android apps with AI, Google’s new benchmark makes it easier to pick the right model

For Android app developers relying on AI to code, picking the right model can be tricky. Not all models are built the same, and many are not specifically trained for Android development workflows. To ...

IFLScience

"Humanity's Last Exam" Reveals How Accurate AI Actually Is. Chatbots Might Want To Look Away Now.

In updated tests published to the Humanity's Last Exam website, Gemini's 3.1 Pro model achieved 45.9 percent accuracy, with a ...

Tech Xplore on MSN

HEART benchmark assesses ability of LLMs and humans to offer emotional support

Large language models (LLMs), artificial intelligence (AI) systems that can process human language and generate texts in ...

Scientific American

Hey ChatGPT, write me a fictional paper: these LLMs are willing to commit academic fraud

Mainstream chatbots presented varying levels of resistance to deliberate requests for fabrication, study finds ...

Qwen 3.5 35B vs Sonnet 4.5 : Benchmarks vs Reality Results Across Three Tasks

The rivalry between Qwen 3.5 and Sonnet 4.5 highlights the shifting priorities in large language model development. Qwen 3.5, ...

Why ‘winning’ the AI race is so hard to define

AI development is often framed as a race among countries, companies and academic researchers. But figuring out who’s actually ...

Model Show: Coding, OCR, and Chinese New Year

February brought new coding models, and vision-language models impress with OCR. Open Responses aims to establish itself as a ...

How Large Scale Speech Models Will Impact Voice AI

A duplex speech-to-speech model changes the premise: The intelligence layer consumes audio and produces audio directly. The ...

Mercury 2 : World’s Fastest Reasoning AI Model Built for Production Applications

The new Mercury 2 AI model uses diffusion reasoning to generate 1,000 tokens per second; it runs about 5x faster than Haiku, ...

Crypto Briefing

Anthropic launches AI exposure index to assess which white-collar jobs face automation risk

Anthropic's new AI Exposure Index ranks computer programmers as the most vulnerable to LLM automation, with 75% of tasks ...

Neuroscience News

“Humanity’s Last Exam”: The Super-Benchmark AI Is Currently Failing

Researchers debut "Humanity’s Last Exam," a benchmark of 2,500 expert-level questions that current AI models are failing.

Some results have been hidden because they may be inaccessible to you

Show inaccessible results