Artificial intelligence (AI) continues to advance rapidly, yet researchers find themselves struggling to formulate assessments capable of fully gauging these systems' capabilities. This challenge is epitomized by the latest evaluation, Humanity's Last Exam, which seeks to test AI far beyond existing benchmarks.
Developed by the Center for AI Safety (CAIS) and Scale AI, Humanity's Last Exam emerged from concerns about benchmark saturation—where AI models routinely achieve high scores on standard tests, potentially skewing perceptions of their actual abilities. "We wanted problems to test the capabilities of the models at the frontier of human knowledge and reasoning," explained Dan Hendrycks, co-founder of CAIS.
The exam features approximately 3,000 multiple-choice and short-answer questions, encapsulating various fields including math, humanities, and natural sciences. Questions were derived from crowdsourced submissions made by nearly 1,000 experts worldwide, comprising researchers and professors from over 500 institutions.
Through this rigorous testing methodology, CAIS and Scale AI aimed to confront the insight gaps depicted by prior assessments, where only certain types of questions had been able to accurately challenge newer AI models. The researchers collected over 70,000 trial questions, filtering them down to find the hardest queries capable of stumping AI systems. "We can't predict how quickly the models will advance," Hendrycks pointed out, highlighting unexpected advancements seen with previous tests.
Questions included examples such as the anatomy of hummingbirds and complex physics problems. One notable query on hummingbird anatomy asked: "Hummingbirds within Apodiformes uniquely have a bilaterally paired oval bone, a sesamoid embedded in the caudolateral portion of the expanded, cruciate aponeurosis of insertion of m. depressor caudae. How many paired tendons are supported by this sesamoid bone? Answer with a number." Likewise, challenging physics problems sought detailed responses, testing logic and analytical capabilities of the AI systems.
Throughout testing, major AI models like OpenAI’s o1 and Google’s Gemini 1.5 Pro showed dismally low scores, with accuracy rates below 10%. This follows the pattern of previous evaluations, underscoring the notion of persistent gaps within AI reasoning capabilities. Hendrycks has commented, reflecting on the test, “There are still some expert closed-ended questions models are not able to answer. We will see how long this lasts.”
Despite these challenges, researchers remain optimistic about future developments. Hendrycks anticipates scores could rise quickly with iterations of AI, potentially surpassing 50% accuracy by the end of the year. Should this happen, AI models might be reclassified as "world-class oracles," endowed with knowledge and reasoning capabilities outstripping those of human experts.
Summer Yue, Director of Research at Scale AI, noted the exam's intent to provide a roadmap for future ventures: "By identifying the gaps in AI's reasoning capabilities, Humanity's Last Exam not only benchmarks current systems but also provides guidance for future research and development." The development of such tests becomes increasingly important as society grapples with the ethical and practical implications of AI systems taking on more complex tasks traditionally reserved for humans.
AI today exemplifies irregular advancements, emphasizing the need for assessment methods beyond rudimentary testing. Hendrycks previously developed the Massive Multitask Language Understanding test, yet the genesis of Humanity's Last Exam arose from the desire to not rest on high performances but to challenge AI across various academic subjects.
Merging findings from various fields exemplifies the collaborative effort Understanding tests like this one illuminate where AI models might excel or falter. The jagged progress AI has made calls for continuing research focusing on creative methods of tracking advancement—methods not solely confined to examinations.
Key contributors to Humanity’s Last Exam were awarded financial compensation for high-quality questions, showcasing the team-driven method used to develop the test. Financial incentives ranged from $500 to $5,000 for the top-performing questions, as well as opportunities for recognized co-authorship on the resulting academic paper.
The AI community, comprised of voices from around the globe, remains engaged and committed to making the next big leap. With the right insights and efforts, benchmark assessments like Humanity’s Last Exam may eventually lead to breakthroughs serving both AI technologies and their human counterparts.
Humanity is faced with the compelling question of how far AI can go when being tested effectively. With exams aimed at pushing boundaries, we inch closer to embracing and comprehending AI's changing role within society.