An experiment to understand what AI models can actually do — and where they still need a human.
Vibe Arcade is a one-person experiment, started in early 2026, to find out what current AI models can actually do autonomously across long-running, real-world software work — and, just as importantly, where the rough edges are and where a human still has to stay in the loop.
The setup: an overnight pipeline plans a new HTML5 game, writes it, runs lint + security + a playability rubric, and commits if every gate passes. If anything fails, I triage it in the morning. 30+ playable games have come out of this so far. Every one of them is free in the browser — no signup, no install, no ads, no email capture.
The games themselves are a side product. The real output is a set of running notes — what the AI surprised me with, where it failed quietly, what got better between model versions, what kinds of tasks I now trust it with versus the kinds I still keep tight human review on. I write these up as how-we-built posts on the blog. Each one is a data point in the larger experiment.
Benchmarks are easy to game; this site is one attempt at a real-world tracking signal instead.
Three open questions drive the experiment:
Game by game, what fraction of the work clears every gate without human intervention? Eight weeks in, more nights ship cleanly than don't — but the failure modes are still informative when they happen.
Some categories of failure recur (content quality that passes structural lint, novel mechanics that need many human QA passes, cross-environment iframe issues). Tracking them over model versions tells me what's actually getting better in practice, not on benchmarks.
Auto-merge clears security, lint, and the playability rubric. I keep manual review on for anything that touches core architecture, anything content-quality-sensitive, and anything that could fail silently in production. The line between those buckets keeps moving as the rubric and lint gates get better.
The pipeline is the experimental apparatus. Each game runs through the same gate sequence; results compound across runs because every game is built against the same shared infrastructure (CSS, leaderboard widget, integration lint).
A planning model writes a detailed spec from the concept — genre, theme, scoring formula, integration checklist, naming, trademark notes. Either I author the concept or pull a top-voted submission from the on-site idea board.
Implementation models build the game in iterations. Each iteration runs a 60-point QA rubric (playability, visuals, fun, integration, mobile, code) and writes feedback for the next pass. Most games clear the 56-point ship threshold within 3–4 iterations.
Structural rules: leaderboard wiring, canvas sizing, schema tags, no banned imports, etc. Grep is roughly 10× cheaper than a model pass and catches a surprising amount — so the pipeline runs lint first and only spends model tokens on the things grep can't check.
Universal categories first (the OWASP-style classes), then a project-specific tier that targets the exact infrastructure this site uses. The project-specific tier catches more real issues than the universal one — generic models can miss framework-specific patterns.
Final scoring pass. If the score clears the threshold and every gate is green, the build auto-merges. If anything fails, the run goes into manual review in the morning.
A separate improvement pipeline runs a couple of times a week to deepen existing games — that's how games grow past their initial overnight build.
Eight weeks in. Concrete observations rather than abstractions:
main.css fixed tablet tap responsiveness across all 20+ games in one commit instead of 20 per-game edits. Moving from a chat interface (where each game was built independently) to a unified pipeline is what made shared CSS, a universal leaderboard widget, and structural conventions consistent across every new build.If you're interested in the longer write-ups, every game has a how-we-built post on the blog with what surprised me, what broke, and what changed across iterations.
A few honest disclosures since the framing of "experiment" can be ambiguous: