The Hao AI Lab at the University of California San Diego recently tested artificial intelligence models in live Super Mario Bros. gameplay. The results were surprising:
- Claude 3.7 (Anthropic) performed the best, followed by Claude 3.5.
- Google’s Gemini 1.5 Pro and OpenAI’s GPT-4o struggled to keep up.
However, it wasn’t exactly the same Super Mario Bros. from 1985.
The researchers ran the game on an emulator, using a custom-built system called GamingAgent, which allowed the AI to take control of Mario, reports TechCrunch.
How AI Tried to Master Super Mario Bros.
The GamingAgent system, developed in-house by the Hao Lab team, provided AI models with basic instructions, such as:
“If an obstacle or enemy is near, move/jump left to dodge.”
The AI also received real-time game screenshots to analyze the environment and generate commands using Python code to control Mario.
But it wasn’t easy. The AI models had to “learn” how to plan complex movements and develop gameplay strategies.
“Reasoning” models like OpenAI’s o1 performed worse than “non-reasoning” models, despite generally performing better on most other benchmarks.
“One of the biggest issues is that reasoning models take too long—usually seconds—to decide on an action. In real-time games like Super Mario Bros., timing is everything. A one-second delay can mean the difference between a successful jump and falling to your death,” the researchers explained.
Are video games really a good AI Benchmark?
For decades, video games have been used as benchmarks to evaluate AI performance. But some experts question whether gaming truly reflects AI’s real-world capabilities.
- Video games are controlled environments, much simpler than real-world complexities.
- AI can train on an almost infinite amount of data within a game, unlike real-world scenarios where data is limited.
The recent wave of flashy gaming benchmarks has sparked debate about how we should evaluate AI’s true intelligence. Andrej Karpathy, research scientist and founding member of OpenAI, spoke about what he called an “evaluation crisis” in AI testing.
“I don’t really know what [AI] metrics to look at right now. TL;DR: my reaction is I don’t really know how good these models are right now,” Karpathy wrote on X (formerly Twitter).
As AI gets better at playing Super Mario Bros., one big question remains: How much do these skills actually translate to real-world intelligence?