Every 6 weeks, we run the Pelican Test to assess all the leading AI models and see where they ace and where they miss.
"Generate an SVG of a pelican riding a bike."
The Pelican Test is a simple challenge where all the popular AI models (GPT-5, Claude, Gemini etc) need to do something odd yet specific. It was inspired by Simon Willison's blog post here↗.
The task sounds quirky, but that's exactly the point. The Pelican Test matters because:
- there is unlikely to be prior SVG training data, so the model is truly composing the elements from scratch
- it tests multiple skills at once: reasoning (understanding an odd request), geometric and spatial planning (e.g. have I placed the pelican's bottom on the cushion?), and accurate coding (one slip and the SVG won't render at all)
- it is low-cost and repeatable: we use this simple script ↗ to automate the test across multiple models, compare results side by side, and repeat the process regularly at minimal cost
These are non-trivial tasks for LLMs, and the Pelican Test reveals where they struggle at a glance:

Excel (top row: GPT-5, GPT-5-high, Claude-opus-4, Hunyuan-t1)
These models manage to resemble both a pelican and a bike, with recognisable balance and proportions: intact pelican, coherent bike, and the two concepts integrated in a believable way. GPT-5-high even captures smoother form and symmetry, suggesting a more advanced level of imaginative reasoning and greater processing bandwidth.
However, they all missed placing the pelican's wings on the handlebar. With the partial exception of GPT-5-high, which appears to at least attempt it, none position the bird's bottom on the saddle in a natural cruising stance.
Pass (middle row: Gemini-2.5-pro, Deepseek-3.1)
These models meet the basic requirement but reveal clear weaknesses in spatial reasoning.
The Gemini-2.5-pro pelican is not riding but walking alongside the bike, undermining the core idea of the prompt. Deepseek-3.1 makes a dynamic attempt, placing the duo in motion on the road beneath the sun - an adorable effort. Yet, the bike is flipped front-to-back, and the bird's shape is oversimplified, resembling a duck more than a pelican - possibly a reflection of the training data, where ducks may appear far more often than pelicans.
So-so (bottom row: Grok-3, Qwen3, Llama-4, Magistral, Magistral-medium)
These models exhibit the greatest struggles, either failing to integrate the pelican and bike as one coherent scene or misrepresenting key relationships.
Grok-3 barely combines the two, with them awkwardly placed at the canvas noticeably off-centre. Qwen3 produces a surreal image: a bird-like figure strapped into a boxy cart made of loosely assembled bike parts. Llama-4 simplifies excessively, rendering blocky shapes that only faintly suggest a pelican and a bike, while also introducing an unintended extra 'pelican' observing from the corner. Both Magistral models drift into comics land, producing abstract pelican shapes with backwards-facing beaks perched on a tandem-like bike, more a playful collage than meaningful interpretation.
Overall insight
The Pelican Test does more than amuse - it reveals, in a single glance, the gradient of capability of different models. It pushes LLMs to combine several faculties all at once: reason through an unusual request, arrange elements in space with intent, and translate all of it into working code.
Because we repeat the test every 6 weeks, we get a living snapshot of progress. A model that once failed hilariously can suddenly leap forward and produce something exquisite, showing us just how fast this field is moving. That rhythm of improvement is far from uniform: at times it advances steadily, at others it lunges forward drastically.
Finally, while playful, the test points to something serious. It mirrors real-world challenges: the art of combining disparate elements together into one coherent whole. A filmmaker weaving story, visuals and sounds. A teacher turning raw information into understanding that clicks. A salesperson aligning product and audience so that need and solution naturally connect. In that sense, success in the Pelican Test is far more than novelty - it is a proxy for broader skills that all future models need to master.