AI Benchmark Model Comparison Explained with Pokémon

Did you know?

The very first Pokémon games, Red and Green, were released in Japan in the year 1996. These were so successful that they saved Game Freak, which is the studio behind Pokémon, from getting bankrupt. Fast forward to today, and Pokémon is the highest-grossing media franchise in the world!

🎮 Pokémon vs AI

Everything Pokémon Taught Us About the Problems With AI Benchmarks

Yes, you read that right, even Pokémon has now become a way to test how smart AI models are.

A recent post on X went viral after it claimed that Google’s AI model Gemini was doing comparatively better than Anthropic’s Claude while playing the original Pokémon games. According to this post, Gemini had made it farther in the game by reaching a place called Lavender Town during a livestream. While Claude was still trying to get through an earlier level called Mount Moon.

AI benchmark model comparison — Source: X

Sounds cool, right? But wait, of course there’s a twist.

Gemini Had a Secret Advantage

People on Reddit quickly pointed out that Gemini wasn’t playing fair as the developer behind Gemini’s stream had added a custom mini-map. So what this map did was it helped Gemini understand where exactly it was in the game, so for example, it could spot the trees that needed to be cut down.

From the game’s point of view it is a big deal. Now let’s assume it did not have the map, Gemini would have to first study each screenshot and understand the environment. After which it could make a decision. The map removed most of the challenge.

On the other hand, Claude, didn’t get any sort of help. It had to play the game the hard way that is by looking at raw screenshots and trying to figure everything out on its own.

🧪 What This Means for AI Testing

It might seem silly to use games like Pokémon as AI benchmarks , but there’s a justifiable reason for it. Games are a fun and exciting way to test how well an AI can make decisions, or remember what it sees, and how it acts in a complex environment.

But the main problem is if the test isn’t fair, then the results don’t really mean much.

This is not just about Pokémon. Companies do this kind of thing in more serious AI tests, too.

For example:

Anthropic’s Claude 3.7 got 62.3% on a coding test (SWE-bench Verified).
But with a little boost from a system called a “custom scaffold”, it scored 70.3%.

Meta did something similar with its Llama 4 model. The regular version in the beginning scored lower, but after fine-tuning it specifically for one test, the scores jumped.

Why This Matters

AI benchmarks are supposed to be tests which are neutral, kind of like a standard exam, so we can easily compare one AI model to another. But when an AI model gets its own “cheat sheet” or behind-the-scenes help, the comparisons become useless.

It’s just like allowing one student bring notes into the test and telling the other student to rely on memory.

Companies use these scores to decide which AI model to trust, invest in, or buy. They use these scores to judge what would be more beneficial and suitable for them.

Final Thought

Pokémon might not be the most serious AI test, but it’s shown us something important: benchmarks only matter if everyone plays by the same rules.

Until that happens, take flashy AI performance claims with a grain of salt, especially if they involve Pikachu.

Pokémon Sparks Debate Over AI Benchmark Model Comparison Rules