Artificial intelligence has been breaking records everywhere—from writing essays to solving complex problems—but what happens when scientists design a test so tough that even the smartest machines struggle?
This isn’t just another benchmark; it’s the hardest AI test ever created, built to push algorithms beyond their comfort zone. The surprising part? The results didn’t go as expected. Some systems failed spectacularly, while others revealed strengths no one saw coming.
If you’ve ever wondered how far AI can really go, this story uncovers the limits, the breakthroughs, and the shocking twists behind the toughest challenge in AI history.
Let’s dive into what makes this test so extraordinary.
![]() |
| Solving the AI brain puzzle - The Hardest AI Text Ever Made |
Scientists Built the Toughest AI Test Ever — And the Results Are Surprising
Artificial intelligence has made remarkable progress over the past few years. Modern AI systems can write essays, solve math problems, translate languages, and even pass exams that once challenged university students. But as these systems began achieving extremely high scores on traditional benchmarks, researchers realized something important: many of the tests used to measure AI intelligence were no longer difficult enough.
If an AI can easily pass an exam designed to test intelligence, does that mean the AI truly understands the subject? Or does it simply recognize patterns in data?
To answer this question, nearly 1,000 researchers from around the world collaborated to create what may be the most challenging AI benchmark ever built — Humanity’s Last Exam (HLE). This massive test includes 2,500 expert-level questions across many disciplines.
Early results show that even the most advanced AI models struggle significantly, revealing that the gap between AI capabilities and deep human expertise is still surprisingly large.
Why Traditional AI Benchmarks Are No Longer Enough
For years, researchers have relied on academic benchmarks to measure how intelligent AI systems are.
One of the most widely used tests is Massive Multitask Language Understanding (MMLU), which evaluates models across dozens of academic subjects.
Initially, benchmarks like MMLU were extremely difficult for machines. Early AI systems struggled to answer even a small percentage of questions correctly. However, rapid improvements in large language models changed that situation dramatically.
Modern AI models began scoring extremely high on these exams, sometimes outperforming human students. At first glance, this looked like proof that AI was approaching human-level intelligence.
But researchers soon realized the problem: the tests themselves had become outdated. AI systems were trained on vast amounts of internet data, including educational materials similar to those found in benchmark exams.
As a result, high scores might reflect pattern recognition rather than genuine understanding. This realization pushed scientists to design a far more demanding evaluation capable of truly testing the limits of artificial intelligence.
The Creation of “Humanity’s Last Exam”
To solve the benchmarking problem, researchers designed a new type of evaluation called Humanity’s Last Exam (HLE). The goal was simple but ambitious: create a test so challenging that even the most advanced AI models would struggle to solve it.
Nearly 1,000 experts from universities and research institutions around the world contributed to the project. The test ultimately grew into a massive assessment containing 2,500 carefully crafted questions.
Unlike traditional exams, HLE focuses on deep academic expertise rather than general knowledge. It includes subjects such as advanced mathematics, linguistics, natural sciences, humanities, and historical studies.
One key contributor was Dr. Tung Nguyen, an instructional associate professor in computer science who helped write and refine many questions.
The researchers intentionally designed the exam to capture areas where AI typically struggles — complex reasoning, specialized knowledge, contextual interpretation, and domain-specific expertise.
The result is one of the most comprehensive and difficult AI benchmarks ever developed.
A Global Collaboration of Nearly 1,000 Experts
One reason Humanity’s Last Exam stands out is the sheer scale of collaboration behind it. Experts from many disciplines contributed questions, reviewed answers, and ensured the problems were academically rigorous.
The contributors included historians, physicists, linguists, medical researchers, mathematicians, and computer scientists. This interdisciplinary approach was essential because intelligence cannot be measured through a single subject area.
Each contributor focused on their area of expertise, creating questions that reflected real academic challenges. These were not textbook problems designed for undergraduate students. Instead, many questions required knowledge typically found in advanced research or niche academic fields.
This diversity helped expose the limitations of modern AI systems. While AI can perform well on common tasks such as translation or essay writing, it often struggles when dealing with specialized or obscure knowledge.
Ironically, the collaboration itself highlights something uniquely human: the ability of experts from different backgrounds to combine their knowledge to solve complex problems.
What Makes the Questions So Difficult?
The difficulty of Humanity’s Last Exam comes from the depth and specificity of its questions. Instead of testing basic facts or general reasoning, the exam challenges AI systems with highly specialized academic problems.
For example, some questions involve translating ancient inscriptions from the Palmyrene language, a rarely studied script used in ancient Syria. Others require identifying tiny anatomical structures in birds, a task that demands expert-level biological knowledge.
Some linguistic questions ask participants to analyze detailed features of Biblical Hebrew pronunciation, while others require advanced mathematical reasoning.
To ensure fairness and clarity, every question was designed to have one correct and verifiable answer. At the same time, the questions were carefully constructed so that quick internet searches could not easily reveal the solution.
This approach forces AI systems to rely on genuine reasoning and understanding rather than memorized patterns or searchable information.
Removing Questions AI Could Solve
One of the most interesting aspects of Humanity’s Last Exam is how the researchers filtered the questions. After creating the initial set, they tested them using several leading AI models.
If any AI system successfully answered a question, that question was removed from the final exam.
This process ensured the remaining questions were genuinely difficult for current AI technologies. The goal was not to trick AI systems but to identify the boundary between machine capabilities and human expertise.
The filtering process required repeated testing and refinement. Researchers ran questions through multiple AI models, carefully analyzing whether the answers were correct.
By eliminating questions that AI could already solve, the team ensured that the final 2,500-question exam would remain just beyond the reach of today’s AI systems.
This makes Humanity’s Last Exam a powerful benchmark for measuring future progress in artificial intelligence.
Early Results From Leading AI Models
When researchers finally tested advanced AI systems on Humanity’s Last Exam, the results were surprising.
Despite their impressive performance on older benchmarks, the models struggled significantly with the new test.
Some early scores included:
- GPT-4o: about 2.7% accuracy
- Claude 3.5 Sonnet: about 4.1% accuracy
- OpenAI’s o1 model: roughly 8% accuracy
Even the most advanced systems currently available perform far below expert human levels. The strongest models so far have achieved around 40% to 50% accuracy on certain portions of the exam.
These results highlight a critical insight: while AI can excel at many tasks, there remains a substantial gap between machine capabilities and deep human expertise.
The findings also remind researchers that high scores on traditional benchmarks do not necessarily mean AI truly understands the material.
Why Accurate AI Benchmarks Matter
Benchmarks are more than just technical tests. They play a crucial role in shaping how researchers, companies, and policymakers understand artificial intelligence.
Without reliable evaluation tools, it becomes easy to misinterpret AI progress. If a model performs well on outdated tests, people might assume it possesses deeper intelligence than it actually does.
Accurate benchmarks help researchers identify both strengths and weaknesses in AI systems. This knowledge guides improvements in model design, training methods, and safety measures.
Benchmarks also influence real-world decisions. Governments and organizations rely on AI assessments when considering regulations, investments, and adoption strategies.
By creating Humanity’s Last Exam, researchers hope to provide a more realistic measurement of AI capabilities.
This allows the scientific community to track progress more accurately while avoiding exaggerated claims about what AI systems can truly achieve.
Humanity’s Last Exam Is Not About Replacing Humans
Despite its dramatic name, Humanity’s Last Exam does not suggest that humans are becoming obsolete.
Instead, the project highlights the vast amount of knowledge that remains uniquely human.
AI systems are incredibly powerful tools, but they still struggle with deep reasoning, specialized academic knowledge, and complex contextual understanding. These are areas where human experts continue to excel.
Researchers emphasize that the exam should not be seen as a competition between humans and machines. Rather, it is a method for identifying where AI performs well and where it still needs improvement.
Understanding these limitations helps developers build safer and more reliable AI technologies.
At the same time, the results reinforce the importance of human expertise in science, medicine, education, and many other fields.
A Long-Term Benchmark for Future AI Systems
Humanity’s Last Exam is designed to remain relevant for many years as AI technology continues to evolve.
To prevent AI models from simply memorizing answers, researchers have only released a small portion of the questions publicly. Most of the exam remains hidden, ensuring it can still challenge future systems.
This strategy allows the benchmark to track AI progress over time. As models improve, researchers can periodically test them against the exam to measure real advances in reasoning and expertise.
The project also demonstrates the importance of international scientific collaboration. Experts from many countries and disciplines worked together to build the exam, combining their knowledge to create an exceptionally rigorous evaluation.
For now, Humanity’s Last Exam clearly shows that despite rapid progress, artificial intelligence still has a long way to go before matching the depth and breadth of human intelligence.
