Nerd Notes

When a pelican decides to ride

A quirky challenge that reveals the real limits of AI

Every 6 weeks, we run the Pelican Test to assess all the leading AI models and see where they ace and where they miss.

"Generate an SVG of a pelican riding a bike."

The Pelican Test is a simple challenge where all the popular AI models (GPT-5, Claude, Gemini etc) need to do something odd yet specific. It was inspired by Simon Willison's blog post here↗.

The task sounds quirky, but that's exactly the point. The Pelican Test matters because:

These are non-trivial tasks for LLMs, and the Pelican Test reveals where they struggle at a glance:

Pelican Test Results

Excel (top row: GPT-5, GPT-5-high, Claude-opus-4, Hunyuan-t1)

These models manage to resemble both a pelican and a bike, with recognisable balance and proportions: intact pelican, coherent bike, and the two concepts integrated in a believable way. GPT-5-high even captures smoother form and symmetry, suggesting a more advanced level of imaginative reasoning and greater processing bandwidth.

However, they all missed placing the pelican's wings on the handlebar. With the partial exception of GPT-5-high, which appears to at least attempt it, none position the bird's bottom on the saddle in a natural cruising stance.

Pass (middle row: Gemini-2.5-pro, Deepseek-3.1)

These models meet the basic requirement but reveal clear weaknesses in spatial reasoning.

The Gemini-2.5-pro pelican is not riding but walking alongside the bike, undermining the core idea of the prompt. Deepseek-3.1 makes a dynamic attempt, placing the duo in motion on the road beneath the sun - an adorable effort. Yet, the bike is flipped front-to-back, and the bird's shape is oversimplified, resembling a duck more than a pelican - possibly a reflection of the training data, where ducks may appear far more often than pelicans.

So-so (bottom row: Grok-3, Qwen3, Llama-4, Magistral, Magistral-medium)

These models exhibit the greatest struggles, either failing to integrate the pelican and bike as one coherent scene or misrepresenting key relationships.

Grok-3 barely combines the two, with them awkwardly placed at the canvas noticeably off-centre. Qwen3 produces a surreal image: a bird-like figure strapped into a boxy cart made of loosely assembled bike parts. Llama-4 simplifies excessively, rendering blocky shapes that only faintly suggest a pelican and a bike, while also introducing an unintended extra 'pelican' observing from the corner. Both Magistral models drift into comics land, producing abstract pelican shapes with backwards-facing beaks perched on a tandem-like bike, more a playful collage than meaningful interpretation.

Overall insight

The Pelican Test does more than amuse - it reveals, in a single glance, the gradient of capability of different models. It pushes LLMs to combine several faculties all at once: reason through an unusual request, arrange elements in space with intent, and translate all of it into working code.

Because we repeat the test every 6 weeks, we get a living snapshot of progress. A model that once failed hilariously can suddenly leap forward and produce something exquisite, showing us just how fast this field is moving. That rhythm of improvement is far from uniform: at times it advances steadily, at others it lunges forward drastically.

Finally, while playful, the test points to something serious. It mirrors real-world challenges: the art of combining disparate elements together into one coherent whole. A filmmaker weaving story, visuals and sounds. A teacher turning raw information into understanding that clicks. A salesperson aligning product and audience so that need and solution naturally connect. In that sense, success in the Pelican Test is far more than novelty - it is a proxy for broader skills that all future models need to master.

← Previous: Taming the tigerNext: The 4 assets AI makes priceless