METHODOLOGY
the rigor is real. the subject is cats.
The task
Every model receives the identical system prompt and six escalating assignments: a minimal flat-design cat, a realistic sitting cat, a cat riding a bicycle, an origami-style cat, a recognizable cat in at most 12 SVG elements, and a cat whose tail sways using SMIL or CSS animation only. Four attempts per assignment, temperature pinned to 1.0 where the provider supports it. Twenty-four attempts per model, no retries for quality. A refusal is a result.
The pipeline
Generation runs through OpenRouter against each model's public API. Every returned SVG passes a programmatic gate: it must parse as XML, contain no scripts, no event handlers, no embedded raster images, no external references, no DOCTYPE entity tricks, and stay under 500KB. Valid SVGs are rendered to 800px PNGs with resvg, deterministically, with no browser involved. Degenerate geometry that would hang a renderer is rejected. The pipeline is resumable and every artifact is committed to the repository: any score on this site can be audited back to the raw model output that produced it.
The judges
Each rendered cat goes to a panel of three vision models, which score four axes from 0–10: cat_likeness (is this recognizably a cat?), aesthetic (is it pleasing?), technique (structural quality of the SVG source, which judges also receive), and prompt_fidelity (did it do the assignment?). A sample's axis score is the median across judges. Judges never see model names. SVG comments are stripped before judging so models can't smuggle instructions to the panel; remaining injection surface (title and description text) is disclosed here rather than pretended away.
The meowscore
Per sample: the mean of the four axis medians. Per assignment: the median of the four attempts (best-of-four is also recorded). The headline meowscore is the mean of the six per-assignment medians, scaled to 0–100. Invalid or refused samples score zero, so models don't get to skip the hard ones. Sample counts are published per assignment so partial runs are visible.
The crowd
The arena shows two anonymous cats from the same assignment; you pick the better one. Ratings use Elo with K=32 from a 1500 start, updated atomically per vote. The crowd column is deliberately separate from the meowscore. The judges and the crowd are allowed to disagree, and the disagreement is the interesting part. Votes are rate limited (10 per minute per IP, stored as a salted hash, never the raw address). The rate limit is a soft cap: a burst can slip a vote or two past it, which we consider acceptable for a cat-drawing leaderboard and disclose anyway.
Reproduce it
The whole harness is open source. Clone the repository, set an OpenRouter key, and run pnpm -F @meowbench/harness cli run --run-dir runs/my-run --estimate to see what it would cost before spending a cent. Current run on this site: 2026-07-04_run-001.
Known limitations
- Cost estimates are flat-rate approximations; the estimator is a floor, not a ceiling.
- Refusal detection is English-only; non-English refusals are counted as "no SVG" (same score: zero).
- Judges may share training lineage with contestants; panel composition is disclosed per run.
- The animation assignment is judged from a static frame until motion verification ships.