Vibe Arcade Blog

Dev stories, game design, and the art of building with AI

The Overnight Pipeline: How AI Ships a Browser Game While I Sleep

Vibe Arcade has shipped 25 games in about six weeks. Not because anyone is working around the clock, but because the pipeline is. Here's exactly how it works.

· Vibe Arcade

engineering AI pipeline vibe coding behind the scenes automation

The premise sounds like a gimmick: kick off a build before bed, wake up to a new browser game ready for review. But this is the actual workflow behind Vibe Arcade — roughly 25 games shipped in six weeks, three to five successful builds per week. Some of them are genuinely good. Some get scrapped in the morning. The interesting part isn't the AI. It's all the scaffolding around the AI that keeps it from shipping junk.

This is a walkthrough of that scaffolding: what runs, in what order, and what breaks.

Where Game Ideas Come From

Every pipeline run starts with a concept. Ours come from two sources.

The first is a curated list of genre gaps — game types that are underrepresented in the free browser game space or that would fill a hole in our catalog. What are people searching for? What classic mechanics haven't gotten a modern HTML5 treatment? It's a living document, not static.

The second source is community submissions. Players suggest game ideas directly, and other users vote on them. The top-voted ideas get promoted into the pipeline's input queue. The people who play the games have a direct channel to influence what gets built next. More on that loop later.

Phase 1: Planning (~10 Minutes)

The first phase uses an advanced planning model — a slower, more capable AI that's good at reasoning through design problems. Its job is to produce a detailed spec, not code: game mechanics, control scheme, visual design direction, difficulty curve, UI layout, and leaderboard integration requirements.

The spec is deliberately detailed because the next phase uses a different, faster model that needs clear instructions. Ambiguity in the spec becomes bugs in the build. A spec that says "difficulty increases over time" is worse than one that says "enemy speed increases by 8% every 30 seconds, capping at 2x base speed." The planning model's job is to eliminate ambiguity, not to be creative.

Phase 2: Implementation (1-4 Hours)

A faster implementation model takes the spec and builds the game. For straightforward games, this is a single-pass build that takes an hour or two. For complex games, multiple sub-agents work on different modules in parallel: core game loop, UI layer, leaderboard integration.

The output is a complete HTML5 game — HTML, CSS, JavaScript — running in a browser with no dependencies. Every game includes mobile support, keyboard and touch input, a start screen, a game-over flow, and leaderboard wiring. The implementation model follows authoring rules that encode everything we've learned about what makes a Vibe Arcade game work. Those rules are iterated constantly — the model on day one was working from a much thinner set of instructions than the one running today.

The Quality Gates

A finished build doesn't ship. It goes through four gates — three automated, one human. The AI is impressive but unreliable. The gates are what make the pipeline trustworthy.

Gate 1: Structural Lint

The first automated check is a set of pattern-matching rules that catch structural and wiring mistakes — a grep-based linter specifically designed for our game format. Every game must call leaderboard integration correctly, include certain meta elements, and follow specific patterns for input handling and score submission.

These rules are cheap to run and they catch the most common bugs — the kind where the game works fine in isolation but doesn't integrate correctly with the site. The lint set grows over time: every new class of wiring bug gets a new rule. The first version had maybe five rules; the current version has substantially more. This is the single most effective quality gate by volume of real bugs caught. Structural checks are unglamorous and underrated.

Gate 2: Security Checks

Automated security checks run against every build. We also have additional manual security review processes. I'm going to leave the details deliberately vague here — describing what the checks look for is counterproductive. They exist, they run on every build, and a human reviews their output.

Gate 3: Playability Rubric

The playability rubric is a scoring system — roughly 60 points across eight categories — that evaluates whether the game is worth playing. The categories include:

And three more categories covering visual polish, audio cues, and leaderboard motivation. A game needs to clear a minimum score threshold to pass.

Here's the weakness, and I want to be straightforward about it: the AI scores its own work. Same class of model that built the game evaluates whether the game is good. This is asking the student to grade their own exam. A game that a human would honestly rate at 40/60 might score itself a 55. The rubric catches genuinely broken games but it's less reliable at distinguishing "technically functional but boring" from "actually fun." We've tightened the criteria several times and added negative-scoring conditions for specific failure modes. But today, the rubric is the weakest automated gate, and we know it.

Gate 4: The Human

A human reviews every pull request before it merges. Every single one. This is the most important gate in the pipeline and it is not optional.

The morning review: open the PR, launch the game, play it. Does the difficulty curve feel fun, not just present? Do the controls work on a phone? Then it goes to real users — family, kids, friends who don't know or care how it was built. "This is boring after 30 seconds" is more valuable than any automated score.

Based on that review, a game ships, gets sent back for targeted fixes, or gets discarded. The discard rate is real. The pipeline is cheap enough that throwing away a failed build is fine. The human's job isn't to fix the AI's work; it's to judge whether the work is worth fixing.

Auto-Merge vs. Morning Triage

If all three automated gates pass above their thresholds, the build auto-merges. If anything flags, it waits for morning triage — fix it, redirect it to a different concept, or discard it. The real cadence: three to five successful games per week. "Successful" means passed all gates, survived human review, and shipped. The total number of pipeline runs is higher; some builds fail automation, some pass automation but get rejected by a human.

Fix Once, Fix Everywhere

This is the principle that makes the pipeline get better over time rather than just get bigger.

Every bug fix gets captured in a shared place so it can't recur. Three canonical locations: global styles that every game inherits, lint rules that every future build gets checked against, and the authoring instructions the implementation model follows.

Example: early games had a touch-input issue on tablets — buttons required a double-tap. The fix was a single CSS rule in the global stylesheet. One commit, every game on the site got the fix. No per-game patches. Similarly, when we found games occasionally wiring up leaderboard submission incorrectly, we added a lint rule. Every future build now gets that check automatically.

This compounding effect is the real force multiplier. Game number 25 is built on all the lessons from games 1 through 24. The authoring rules are thicker, the lint set is larger, the global styles handle more edge cases. Each game is slightly better than the last not because the AI is improving, but because the guardrails are.

What Doesn't Work

Honesty requires listing the things this pipeline cannot do well, at least not yet.

Game art. The pipeline produces CSS-based visuals — geometric shapes, particle effects, the neon aesthetic. It cannot produce original sprite art or character illustrations. Every game looks like it belongs to the same family, which is simultaneously a brand strength and an artistic limitation.

Novel mechanics. The AI is excellent at executing known game mechanics — snake, breakout, tower defense, typing games. It's less reliable at inventing genuinely new ones. When the spec asks for something that doesn't map to an existing genre, the implementation model falls back on familiar patterns. Innovation still requires human design thinking.

Multi-session projects. The pipeline is optimized for single-overnight builds. Games that need multiple build-test-iterate cycles over several days are possible but awkward — the pipeline loses context between sessions. Our most complex games were built partially outside the pipeline with more hands-on human direction.

Self-assessment. The pipeline cannot reliably tell you whether a game is fun. It can tell you the game is structurally correct, secure, and meets a checklist of design criteria. "Fun" is a human judgment. I'm not sure it can be automated.

The Community Loop

The part of this system I find most interesting isn't the pipeline — it's the feedback cycle with the people who play the games.

Users suggest ideas. Others vote. The highest-voted ideas enter the pipeline. The pipeline builds them overnight. Users play the results, which sparks new ideas, which get submitted and voted on. The loop: suggest, vote, build, play, suggest again.

This solves the hardest problem in the pipeline: deciding what to build. The AI can build almost anything you spec. The question is whether anyone wants to play it. Community voting is an imperfect proxy for demand, but better than guessing. And when users see their suggestions become real games within a week, the suggestion rate goes up. The loop accelerates itself.

The pipeline's real contribution isn't speed — it's that the gap between "someone has an idea" and "someone is playing the result" is short enough to sustain a feedback loop. Fast enough to throw away a bad build and try a different approach tomorrow night.

The System Builds; The Human Judges

AI-assisted game development is not AI-generated games. The pipeline is roughly 90% scaffolding and guardrails — lint rules, security checks, authoring instructions, the playability rubric, the global fix infrastructure. The AI provides the fast hands. The accumulated rules written by a human are what keep it from shipping junk.

The pipeline is impressive. But the most important part is the person who plays the game in the morning, shows it to their kids, and decides whether it's actually good. That part doesn't automate.


Related reading: What Is Vibe Coding? · How We Built Neon Snake With AI · Vibe Coding Tools: From Chatbots to AI IDEs · Building a Universal Leaderboard