About Vibe Arcade — An Experiment in Autonomous AI Coding

What This Is

Vibe Arcade is a one-person experiment, started in early 2026, to find out what current AI models can actually do autonomously across long-running, real-world software work — and, just as importantly, where the rough edges are and where a human still has to stay in the loop.

The setup: an overnight pipeline plans a new HTML5 game, writes it, runs lint + security + a playability rubric, and commits if every gate passes. If anything fails, I triage it in the morning. 30+ playable games have come out of this so far. Every one of them is free in the browser — no signup, no install, no ads, no email capture.

The games themselves are a side product. The real output is a set of running notes — what the AI surprised me with, where it failed quietly, what got better between model versions, what kinds of tasks I now trust it with versus the kinds I still keep tight human review on. I write these up as how-we-built posts on the blog. Each one is a data point in the larger experiment.

Benchmarks are easy to game; this site is one attempt at a real-world tracking signal instead.

What I'm Trying to Figure Out

Three open questions drive the experiment:

1. How much of a non-trivial software project can the AI handle autonomously?

Game by game, what fraction of the work clears every gate without human intervention? Eight weeks in, more nights ship cleanly than don't — but the failure modes are still informative when they happen.

2. Where do the rough edges show up — and how do they shift as models improve?

Some categories of failure recur (content quality that passes structural lint, novel mechanics that need many human QA passes, cross-environment iframe issues). Tracking them over model versions tells me what's actually getting better in practice, not on benchmarks.

3. Where does a human still have to stay in the loop?

Auto-merge clears security, lint, and the playability rubric. I keep manual review on for anything that touches core architecture, anything content-quality-sensitive, and anything that could fail silently in production. The line between those buckets keeps moving as the rubric and lint gates get better.

How the Pipeline Works

The pipeline is the experimental apparatus. Each game runs through the same gate sequence; results compound across runs because every game is built against the same shared infrastructure (CSS, leaderboard widget, integration lint).

1. Spec

A planning model writes a detailed spec from the concept — genre, theme, scoring formula, integration checklist, naming, trademark notes. Either I author the concept or pull a top-voted submission from the on-site idea board.

2. Scaffold + iterate

Implementation models build the game in iterations. Each iteration runs a 60-point QA rubric (playability, visuals, fun, integration, mobile, code) and writes feedback for the next pass. Most games clear the 56-point ship threshold within 3–4 iterations.

3. Lint gate (grep-based)

Structural rules: leaderboard wiring, canvas sizing, schema tags, no banned imports, etc. Grep is roughly 10× cheaper than a model pass and catches a surprising amount — so the pipeline runs lint first and only spends model tokens on the things grep can't check.

4. Security gate (two-tier)

Universal categories first (the OWASP-style classes), then a project-specific tier that targets the exact infrastructure this site uses. The project-specific tier catches more real issues than the universal one — generic models can miss framework-specific patterns.

5. Playability rubric + ship decision

Final scoring pass. If the score clears the threshold and every gate is green, the build auto-merges. If anything fails, the run goes into manual review in the morning.

A separate improvement pipeline runs a couple of times a week to deepen existing games — that's how games grow past their initial overnight build.

What I've Found So Far

Eight weeks in. Concrete observations rather than abstractions:

The AI picks structural architecture without being asked, and often picks well. Path Runner shipped with a segmented procedural track model I never specified — segments spawn ahead of the player and despawn behind. It turns out that's exactly the right answer for an endless runner because it sidesteps browser memory limits. Mini Cross shipped iter-4 with a six-tier streak celebration ladder I never asked for. Deadlock proposed Castle-Doombad-style narrow-room layouts in a single commit when the spec said almost nothing about geometry.
Content-quality bugs slip past structural lint. Mini Cross iter-1 shipped crossword answers that weren't real English words — they validated as data, the lint passed, but a human (or the QA gate's solve-the-puzzle pass) was the only thing that could catch it. Lint and security can verify shape; they can't verify meaning.
Fix-once-globally pays compound interest. A single CSS rule in main.css fixed tablet tap responsiveness across all 20+ games in one commit instead of 20 per-game edits. Moving from a chat interface (where each game was built independently) to a unified pipeline is what made shared CSS, a universal leaderboard widget, and structural conventions consistent across every new build.
Genuinely novel mechanics still need a lot of human QA. When I'm specific about what a game should look like and how its rules work, the loop usually lands in 3–4 iterations. Anything where the rules need to be invented takes many more passes — and human playtesting is the only thing that catches the "this is technically working but it isn't fun" failure mode.
"Vibe coding" oversells the prompts and undersells the rules. The pipeline is roughly 90% scaffolding and guardrails. The AI is fast hands; the rules I wrote (the rubric, the lint, the structural conventions) are what keep it from shipping junk. Vibe coding implies vibes-first; this is rules-first, AI-executes-fast.

If you're interested in the longer write-ups, every game has a how-we-built post on the blog with what surprised me, what broke, and what changed across iterations.

Free, Solo, Closed-Source

A few honest disclosures since the framing of "experiment" can be ambiguous:

Free for players. No signup, no account, no email capture, no ads, no in-app purchases, no paywall. Leaderboards work with a chosen display name on a per-game basis.
One person. No team, no studio, no investors. The "we" in earlier versions of this page was aspirational; corrected.
Pipeline is closed-source. The pipeline is the experimental apparatus, and sharing it would replace independent reproductions with one shared codebase — collapsing the comparison signal the experiment depends on. Architecture is described above and on the blog; the implementation is private.
What I'll do with what I learn. Inform where I lean on AI in my day-to-day work — for tasks it's actually good at, and to be careful about handing it work where the stakes are higher and the failure mode is harder to detect.