When AI Flunks Humanity’s Hardest Test

How smart is today’s artificial intelligence, really? Not in marketing terms, not in sci fi language, but in the sober light of difficult questions like… How many tendons attach to a tiny bone in a hummingbird’s tail? Which syllables in a Biblical Hebrew verse are “closed” according to the latest specialist scholarship? Those are not trivia questions; they are examples from “Humanity’s Last Exam,” a new benchmark that is reshaping how we think about AI progress.[1]

The benchmark comes from a Nature paper, “A benchmark of expert-level academic questions to assess AI capabilities,” and is unpacked in a plain-language article, “AI is failing ‘Humanity’s Last Exam’ so what does that mean for machine intelligence?” Together, they tell a story that is less about AI “getting close to human” and more about how we measure, misunderstand, and sometimes overhype what these systems can do.

To see why this matters, it helps to start with understanding what a benchmark is. In simple terms, an AI benchmark is like a standardized test for models. You collect a set of questions or tasks, you define what counts as a correct answer, and you see how different systems score. Benchmarks are intended to provide a common metric that enables us to assess whether Model A outperforms Model B in mathematics, biology, programming, or reading comprehension. Over the past few years, models have raced up the scoreboards on popular benchmarks, often scoring above 90 percent on tests like MMLU, a huge exam covering many school and university subjects.

The problem is that once a test becomes familiar, AI developers start training and tuning their systems to do well on that specific test. It is akin to teaching for exams in schools. Exactly, like that, as a matter of fact. Scores increase, but you are no longer sure whether you are measuring deep ability or merely clever test preparation. Humanity’s Last Exam, or HLE, was created precisely to escape that trap and to probe what is still beyond the reach of current systems.

HLE is a collection of 2,500 questions across more than 100 subjects, ranging from advanced mathematics and physics to classics, linguistics, ecology, and computer science. Nearly 1,000 experts from over 500 institutions worldwide contributed questions, most of whom were professors, researchers, or holders of graduate degrees. The questions are not scraped from textbooks. They are original, precise, and often sit at the frontier of current human knowledge. Many require graduate-level expertise or very specialized domain knowledge.

Here is where HLE really differs from earlier benchmarks. Before a question was accepted, it was tested against leading AI models. If the models could already answer it correctly, the question was thrown out. Only questions that caused frontier models to fail were permitted. That is the opposite of how we usually think about test design for humans, where we want a spread of easy, medium, and hard questions. For HLE, the point was to build a test that sits just beyond the current AI frontier, so that any improvement in scores would really mean new capability, not just familiarity.

The results were brutal, and that was by design. When HLE was first released in early 2025, GPT 4o scored about 2.7%, Claude 3.5 Sonnet about 4.1%, and OpenAI’s reasoning-focused model o1 about 8%. Even as newer models arrived, early scores remained in the single digits. The benchmark had done its job. It showed that beneath the impressive chat interfaces and polished demonstrations, there remained a substantial gap between AI performance and expert human knowledge on tightly defined, verifiable academic questions.

At this point, some commentators started to talk about HLE as a stepping stone toward artificial general intelligence, or AGI, the idea of systems that can perform any task at human or superhuman levels. The logic is tempting. If a model can eventually ace a test built from the hardest questions experts can think of, does that not mean it has become “like us”? The authors of the TechXplore article argue that this is a mistake, and the HLE paper itself is careful on this point. High scores on HLE would show expert-level performance on closed, exam-style questions, but not autonomous research ability or general intelligence.

The key distinction is between performance and understanding. When a human passes the bar exam, we infer that they have learned the law in a way that transfers to real practice. They can reason with clients, navigate messy situations, and exercise judgment. When an AI model passes the same exam, all we know is that it can produce answers that match the marking scheme. It lacks a body, experiences, or goals. It does not care about justice or consequences. It has learned patterns in text, not lived reality. Treating its test score as evidence of “being like a lawyer” confuses output with inner competence.

This is why benchmarks that only measure performance can mislead us about intelligence. Human intelligence is grounded in a lifetime of interaction with the world and with other people. Language is a tool we use to express that deeper intelligence. For large language models, language is all there is. Their “intelligence” is the ability to predict plausible next words from large-scale training data. There is nothing underneath in the human sense. Thus, when we use human exams as AI benchmarks, we are borrowing a tool designed to measure something very different.

The HLE researchers did several things to make their benchmark as rigorous as possible. They recruited domain experts to write questions, enforced strict rules about clarity and non-searchability, required detailed solutions, and ran a multistage human review process to refine and approve each item. They also evaluated multiple state-of-the-art models, measured not only accuracy but also calibration, and analyzed how performance changed as models generated longer chains of reasoning.

Their findings are enlightening. First, accuracy is low across the board, even for the strongest models. Second, models are badly calibrated. They often respond with high confidence even when they are wrong, which is especially concerning in expert domains where overconfident errors can be costly. Third, more “thinking” in the form of longer reasoning traces helps up to a point, but then starts to hurt, suggesting that simply giving models more compute is not a panacea for better answers.

Since HLE was published online, scores have climbed. Newer systems, such as Gemini 3 Pro Preview and GPT-5, now achieve percent accuracy in the twenties and thirties. That sounds impressive until you remember how the benchmark was built. Once a test is public, developers can optimize it. The TechXplore authors describe this as AI “cramming” for the exam. The models are getting better at the kinds of questions HLE contains, but that does not mean they are converging on human like intelligence. It means the benchmark has become another target in the optimization game.

What should we take from all this? First, HLE is a valuable reality check. It punctures the illusion that current AI systems are already “almost there” in terms of general intelligence. Second, it highlights the need for benchmarks that are more closely aligned with real-world work. OpenAI’s GDPval, for example, tries to measure how useful models are on tasks drawn from actual documents, analyses, and deliverables in professional settings, rather than exam-style questions. That is a step toward evaluating what matters in practice, not just what looks impressive on a leaderboard.

For organizations and individuals using AI, the practical message is that they should not be dazzled by benchmark scores, even on something as ambitious as Humanity’s Last Exam. A model that shines on HLE might still struggle with your specific mix of writing, coordination, customer interaction, or domain-specific judgment. The most useful “benchmark” you can run is your own, built from the tasks you care about, with success criteria that reflect your real constraints and risks.

Looking ahead, HLE points toward a more mature conversation about AI. It reminds us that intelligence is not a single ladder that machines are climbing rung by rung until they reach us at the top. It is a landscape of different abilities, some of which current systems handle remarkably well, and others where they still fail basic expert tests. The real work now is to design evaluations that keep us honest about those gaps, to focus on usefulness rather than myth, and to ensure that as AI systems become more capable, they do so in ways that genuinely serve human needs rather than just scoring higher on the next big exam.

This article is shared at no charge for educational and informational purposes only.

Red Sky Alliance is a Cyber Threat Analysis and Intelligence Service organization. We provide indicators of compromise information via a notification service (RedXray) or an analysis service (CTAC). For questions, comments, or assistance, please get in touch with the office directly at 1-844-492-7225 or feedback@redskyalliance.com

Weekly Cyber Intelligence Briefings:

Reporting: https://www.redskyalliance.org/
Website: https://www.redskyalliance.com/
LinkedIn: https://www.linkedin.com/company/64265941

Weekly Cyber Intelligence Briefings:
REDSHORTS - Weekly Cyber Intelligence Briefings
https://register.gotowebinar.com/register/5207428251321676122

[1] https://six3ro.substack.com/p/when-ai-flunks-humanitys-hardest

X-Industry

When AI Flunks Humanity’s Hardest Test

Comments