Compound Engineering vs. gstack vs. Karpathy’s Autoresearch vs. Superpowers vs. Recursive Self-Improvement

May 23, 2026

∙ Paid

These five things get lumped together, but they sit at different layers of the stack — philosophy, role library, research automation, agentic skill framework, and a theoretical capability. Here’s how they actually differ and when to reach for each.

1. Compound Engineering (Kieran Klaassen / Every)

A philosophy and workflow, not a tool. Every unit of work is treated as an investment so the next unit is cheaper. The loop is Brainstorm → Plan → Work → Review → Compound, with the compound step explicitly codifying patterns, decisions, and failures into files (AGENTS.md, skills, docs) that future agents consult automatically.

Form factor: A reference plugin for Claude Code (27 agents, 14 skills, 20 commands, MCP server) plus a written methodology.
Time split: ~80% planning + review, ~20% coding
Learning mechanism: Human triggers /workflows:compound after a feature; agent reflects and writes lessons into project knowledge files.
Strength: Durable institutional memory for a real codebase shipped by humans + agents.

2. gstack (Garry Tan)

An opinionated personal skill pack for Claude Code — 23 tools that turn the agent into a virtual org: CEO, Eng Manager, Designer, Reviewer/QA Lead, Security Officer (OWASP/STRIDE), Release Engineer, Doc Engineer, Retro, etc. Tan reports producing 10–20k LOC/day with it while running YC.

Form factor: Slash commands + skills you drop into Claude Code.
Philosophy: Role-based delegation — one human plays general contractor across multiple specialist personas.
Learning mechanism: Implicit, lives in the prompts/skills themselves; the user updates the skills as their taste evolves.
Strength: Fast cold-start. You inherit Tan’s opinions on shipping discipline (design polish, QA in a real browser, security review, release hygiene) without inventing them.

3. Karpathy’s Autoresearch

A fundamentally different game. It’s an autonomous ML research harness where an AI agent mutates train.py, runs a 5-minute training experiment, scores against a single scalar metric, keeps or reverts via git, and loops — ~12 experiments/hour, ~100 overnight.

Form factor: Three files: program.md (the research direction in plain English), train.py (the only thing the agent edits), prepare.py (locked).
Result: ~700 experiments over 2 days yielded ~20 real improvements and an 11% speedup on already-tuned nanochat. Tobi Lütke trained a 0.8B model overnight that beat his prior 1.6B by 19%.
Learning mechanism: Git history is the memory; branch log is read before each next hypothesis.
Strength: Anything with one editable file and one measurable scalar — model training, prompt tuning, kernel optimization.

4. Superpowers (obra)

An agentic skills framework for Claude Code that enforces test-driven, spec-driven development with sub-agents getting fresh context windows per task. Skills cover brainstorm → write spec → write plan → execute (with checkpoints) → code review.

Form factor: Plugin marketplace install for Claude Code.
Philosophy: Process discipline (TDD, checkpointed sub-agent execution) over personas.
Learning mechanism: The skill library itself is the curated knowledge; sub-agents prevent context pollution between tasks.
Strength: When you want guard-rails — specs before code, tests before implementation, fresh context per task to prevent drift.

5. Recursive Self-Improvement (RSI)

The umbrella theoretical capability, not a product: a system that improves the very mechanism that produces its improvements. Karpathy publicly framed autoresearch as a small, scoped step toward RSI — nanoGPT was first a teaching repo, then a llm.c port target, then a research harness, and now a benchmark for LLM coding agents that optimize their own training.

Form factor: A property a system exhibits, not a repo. Autoresearch is an instance constrained to one file and one metric; broader RSI would let the agent modify its own evaluation, tooling, and goals.
Relation to the others: Compound Engineering is human-in-the-loop RSI at the workflow level (lessons feed next loop). Autoresearch is bounded RSI at the model-weights level. gstack and Superpowers are not RSI — they’re fixed scaffolding around an unchanging agent.
Strength of the concept: Useful as a design lens — ask what is the loop, what gets codified, and who closes it?

Quick Comparison

Continue reading this post for free, courtesy of Ken Huang.

Or purchase a paid subscription.

Agentic AI