The 630 Lines That Changed AI Research
March 9, 2026 Β· 3 min read
On March 8, 2026, Andrej Karpathy released Autoresearch β a 630-line open-source project that ran 89 experiments overnight on a single H100 GPU, autonomously improving a GPT language model's validation loss from 0.9979 to 0.9773 bits-per-byte. Zero human intervention. Zero crashes.
The reaction was seismic: 8.6M+ views, 22K+ GitHub stars in days, and a distributed team (Hyperspace AI) scaling it to 35 agents across a P2P network within 48 hours.
But the real insight isn't about training language models. It's about the pattern β a universal loop for autonomous improvement that applies far beyond ML research.
Photo by Jordan Harrison on Unsplash
At AGNT, we've been building toward this same pattern from a different angle. Where Karpathy asks "can an AI agent autonomously improve a neural network?", we ask:
Can AI agents autonomously improve themselves β learning reusable skills from every task they complete?
This article is a deep technical comparison: what Autoresearch does, what AGNT already has, where the patterns converge, and how we're extending the concept into something we call SkillForge β an autonomous skill evolution system that turns every goal execution into a learning opportunity.
The Universal Pattern: Scientific Method as Code
Before comparing architectures, let's extract the abstract pattern that both systems implement. Strip away the specifics and you get the oldest algorithm in human civilization:
1. HYPOTHESIZE β Propose a change
2. EXPERIMENT β Execute the change under controlled conditions
3. EVALUATE β Measure the outcome against a fixed metric
4. DECIDE β Keep if improved, discard if not
5. REPEAT β Feed learnings back into the next hypothesisThis is the scientific method. It's also gradient descent. It's also natural selection. Every system that learns β biological or artificial β runs some variant of this loop.
The breakthrough in 2026 isn't the loop itself. It's that LLMs are now good enough to be the hypothesis generator β replacing random mutations (evolutionary algorithms) or grid searches (hyperparameter tuning) with informed, context-aware proposals that understand why something might work.
Let's see how both systems implement this pattern.
Autoresearch: The Reference Implementation
Architecture
Karpathy's design is radically minimal β three files:
| File | Role | Who Edits It |
|---|---|---|
prepare.py |
Data prep, tokenizer, evaluation utilities | Nobody (fixed infrastructure) |
train.py |
Full GPT model + optimizer + training loop | The AI agent |
program.md |
Natural-language research instructions | The human |
The Loop
The AI agent (Claude, via Cursor IDE) runs this cycle indefinitely:
- Read
program.md+ currenttrain.py+ git history of past results - Propose a code modification (architecture change, hyperparameter tweak, new technique)
- Edit
train.pydirectly - Train for exactly 5 minutes of wall-clock time (fixed budget)
- Evaluate validation bits-per-byte (val_bpb) β a single, vocabulary-independent metric
- Decide: If val_bpb improved β
git commit. If worse βgit revert. - Repeat from step 1
Key Design Decisions
Fixed time budget (5 minutes): Every experiment gets the same wall-clock time regardless of what changed. This makes results directly comparable even when the agent changes model size, batch size, or architecture. It also optimizes for your specific hardware β finding the best model for the GPU you actually have.
Single metric (val_bpb): Bits-per-byte is vocabulary-size-independent, meaning the agent can freely change tokenizer settings and still compare results. One number, lower is better, no ambiguity.
Git as memory: Every successful change is committed. The agent builds on a traceable evolutionary history. Failed experiments are reverted cleanly.
Code, not config: The agent modifies arbitrary Python code, not a YAML config. As Karpathy noted on Hacker News: "The notion of a 'hyperparameter' dissolves. There is no need to run 'sweeps' β because LLM agents are sequential, they can do binary search to narrow in on the right setting quickly."
Results
In the published overnight run:
- 89 experiments total
- 15 kept, 74 discarded, 0 crashes
- val_bpb: 0.9979 β 0.9773 (steady improvement)
- Biggest win: halving batch size (counterintuitive β more gradient steps beat bigger batches under fixed time)
- In a subsequent 2-day run: ~700 experiments, ~20 additive improvements, cutting "Time to GPT-2" by 11%
AGNT: The Autonomous Agent Platform
AGNT approaches the same pattern from a fundamentally different starting point. Where Autoresearch improves a single artifact (a training script), AGNT improves the agents themselves β building reusable knowledge that compounds across every future task.
Architecture (What Already Exists)
AGNT's goal execution system already implements most of the autonomous loop:
| Component | File | Role |
|---|---|---|
| GoalProcessor | backend/src/services/goal/GoalProcessor.js |
Decomposes goals into ordered tasks with dependencies |
| TaskOrchestrator | backend/src/services/goal/TaskOrchestrator.js |
Executes tasks via the AGI loop: execute β evaluate β replan β repeat |
| GoalEvaluator | backend/src/services/goal/GoalEvaluator.js |
LLM-based scoring of task outputs against success criteria |
| GoalIterationModel | backend/src/models/GoalIterationModel.js |
Stores per-iteration snapshots: score, world state, replanned tasks |
| GoldenStandardModel | backend/src/models/GoldenStandardModel.js |
Archives high-performing goal templates for reuse |
| SkillModel | backend/src/models/SkillModel.js |
CRUD for skills β exportable as .SKILL.md files |
| SkillService | backend/src/services/SkillService.js |
Export/import skills, inject into agent system prompts via XML |
| Git Checkpoints | GoalService._gitCheckpoint() |
Commits state per iteration on goal-specific branches |
The Existing AGI Loop
Inside TaskOrchestrator.executeGoalAutonomous(), AGNT already runs a loop that mirrors Autoresearch:
for each iteration (max 50):
1. Execute pending tasks (tools, agent chat, code execution)
2. GoalEvaluator.evaluateGoal() β LLM scores outputs
3. If score >= threshold β complete
4. If score < threshold β replan tasks, update world state
5. Git checkpoint β commit iteration state
6. RepeatThe goal system already has:
- β Autonomous execution with configurable iteration limits
- β LLM-based evaluation scoring outputs against success criteria
- β World state tracking across iterations (accumulated context)
- β
Git memory via
_gitCheckpoint() - β Agent-task matching that scores agents by tool overlap + success rate
- β
Skills system with
.SKILL.mdexport and system prompt injection - β Golden standards for archiving high-performing patterns
What's Missing: The Bridge
The gap β and the opportunity β is the feedback loop between execution and skills. Currently:
- Goals execute and produce outputs β
- Outputs get evaluated and scored β
- But the knowledge gained during execution dies with the goal β
- Skills exist but are manually created β
- No mechanism to test whether a skill actually helps β
- No evolutionary pressure on skills β no keep/discard cycle β
This is exactly the gap that Autoresearch's pattern fills.
Side-by-Side: The Full Comparison
| Dimension | Karpathy's Autoresearch | AGNT (Current) | AGNT + SkillForge (Proposed) |
|---|---|---|---|
| What evolves | train.py (a training script) |
Nothing (skills are static) | .SKILL.md files (agent instructions) |
| Who proposes changes | LLM agent (Claude via Cursor) | N/A | TraceAnalyzer (LLM-as-Judge) |
| Experiment budget | Fixed 5 min wall-clock | Fixed iteration count (max 50) | Fixed iteration count per A/B test |
| Evaluation metric | val_bpb (single float) |
GoalEvaluator score (0-100) |
Skill Effectiveness Score (composite) |
| Memory system | Git commits | Git checkpoints β + GoalIterationModel β | + SkillVersionModel (evolution lineage) |
| Keep/discard | git commit / git revert |
Goal succeeds/fails | Skill A/B test: delta > 0 β keep |
| Gold standard | N/A (best model is latest commit) | GoldenStandardModel β | Promote skills scoring >90% SES |
| Human interface | program.md (research instructions) |
Goal success criteria | Goal success criteria + eval rubric |
| Scope | Single model, single GPU | Multi-agent, multi-tool, any task | Multi-agent with compounding skills |
| Knowledge transfer | Manual (read git log) | Manual (create skill by hand) | Automatic (trace β skill β inject) |
| Agent count | 1 | Many (per-agent assignment) | Many (skills shared across agents) |
| Infrastructure | 630 lines, 3 files | Full platform (backend, frontend, DB) | +4 new service files, +2 DB tables |
Where AGNT Is Already Ahead
It's worth highlighting where AGNT's existing architecture already exceeds what Autoresearch provides:
1. Multi-Agent Orchestration
Autoresearch runs a single agent modifying a single file. AGNT's AgentTaskMatcher dynamically assigns the best agent for each subtask based on tool overlap, success rate, and skill match. When SkillForge generates a new skill, it can be shared across the entire agent fleet β one agent's learning benefits all agents.
2. Structured Evaluation
Autoresearch uses a single number (val_bpb). AGNT's GoalEvaluator already performs multi-dimensional LLM-based evaluation:
// From GoalEvaluator.aiEvaluateTaskOutput()
// The evaluator scores against specific success criteria,
// provides explanations, and suggests next actions
{
score: 85,
explanation: "Task completed all required subtasks...",
suggestions: ["Consider adding error handling for edge case X"],
criteria_met: ["data_accuracy", "completeness"],
criteria_missed: ["performance_optimization"]
}This structured feedback is precisely what a skill evolution system needs β not just "better or worse" but "better at what and worse at what."
3. World State Persistence
Autoresearch's memory is git history β effective but unstructured. AGNT's GoalModel.world_state maintains a rich, structured record:
{
completed_tasks: ["task-1", "task-2"],
pending_tasks: ["task-3"],
key_findings: ["API rate limit is 100/min", "Auth requires OAuth2"],
iteration_history: [
{ iteration: 1, score: 45, action: "replanned task-2" },
{ iteration: 2, score: 72, action: "added recovery step" }
],
accumulated_context: "..." // Growing knowledge base
}This structured world state is a goldmine for trace analysis β it tells the story of how the agent solved the problem, not just the final result.
4. Skill Injection Infrastructure
The hardest part of any "learning" system is getting the learned knowledge back into the execution path. AGNT already solves this completely:
// From SkillService.buildSkillsContext()
// Skills are injected as structured XML into agent system prompts:
<skills>
<skill name='Deep Web Research' category='research'>
<instructions>When tasked with research, follow this proven sequence...</instructions>
<allowed_tools>web_search, web_scrape, execute_javascript</allowed_tools>
</skill>
</skills>This means the moment a new skill is validated, it's immediately available to every agent that matches its category. No retraining, no redeployment β just a system prompt update on the next execution.
SkillForge: Completing the Loop
SkillForge is the proposed system that bridges the gap β connecting AGNT's execution traces to its skill system via Autoresearch's keep/discard evolutionary pattern. Here's how the complete loop works:
Phase 1: Execute (Already Built)
The agent executes a goal through the standard pipeline. GoalProcessor breaks the goal into tasks, AgentTaskMatcher assigns the best agent, TaskOrchestrator runs the AGI loop, and GoalEvaluator scores the results. Nothing changes here.
Phase 2: Analyze (New β TraceAnalyzer)
After a goal completes, the TraceAnalyzer service activates. It gathers the full execution trace β every task input, output, tool call, error, recovery, and world state transition β and submits it to an LLM-as-Judge.
The judge isn't scoring "good or bad." It's extracting transferable patterns:
- Tool sequences: "web_search β web_scrape Γ 5 β execute_javascript for synthesis" worked 3x better than scraping everything first
- Prompt patterns: "Breaking the task into numbered substeps in the prompt correlated with 40% higher completion"
- Recovery strategies: "Retrying failed scrapes with a 10-second delay succeeded 80% of the time"
- Anti-patterns: "Scraping more than 8 pages showed diminishing returns"
These patterns are distilled into a skill candidate β a draft .SKILL.md file with instructions, allowed tools, and a confidence score.
Phase 3: Evolve (New β SkillEvolver)
This is where Autoresearch's core insight applies directly. The SkillEvolver doesn't just save the skill β it tests it:
- Baseline run: Execute a similar goal without the new skill
- Test run: Execute the same goal with the skill injected via
buildSkillsContext() - Measure delta: Compare the Skill Effectiveness Score (SES) between runs
- Decide:
- Delta > 0 β Keep the skill, commit to
SkillVersionModel - Delta β€ 0 β Discard the skill, log the failure for future reference
- SES > 90% β Promote to Gold Standard, available to all agents
- Delta > 0 β Keep the skill, commit to
Phase 4: Compound (The Flywheel)
Here's where it gets powerful. On the next goal execution, the agent now has the evolved skill in its system prompt. If the agent performs better (which it should, given the A/B validation), the next trace analysis produces an even better skill candidate. The improved skill leads to better execution, which leads to better traces, which leads to better skills.
This is the compound learning flywheel β each cycle builds on the last, and knowledge transfers across agents, goals, and time.
The Skill Effectiveness Score: AGNT's val_bpb
Karpathy's genius was choosing val_bpb as a single, comparable metric. We need the same for skills. The Skill Effectiveness Score (SES) is a weighted composite:
| Component | Weight | What It Measures |
|---|---|---|
| Task Completion Rate | 30% | Did the agent finish all subtasks? |
| Tool Efficiency | 20% | Ratio of useful tool calls to total tool calls |
| Error Recovery | 15% | How well did the agent handle failures? |
| Speed vs. Baseline | 15% | Did the skill make execution faster? |
| Transferability | 10% | Does the skill work across different goal types? |
| Consistency | 10% | Does the skill produce reliable results across runs? |
The formula:
SES = 0.30 Γ Completion + 0.20 Γ Efficiency + 0.15 Γ Recovery
+ 0.15 Γ Speed + 0.10 Γ Transfer + 0.10 Γ ConsistencyThresholds:
- SES β₯ 90: Gold Standard β promoted, shared across all agents
- SES 70β89: Validated β kept, assigned to relevant agents
- SES 50β69: Draft β may be refined in future iterations
- SES < 50: Discarded β logged but not used
The critical property: like val_bpb, SES is comparable across experiments. Whether the agent is executing a research goal, a coding task, or a data analysis pipeline, SES gives a single number that answers: "Did this skill help?"
Implementation: What Changes
The beauty of this design is how little needs to change. AGNT's existing architecture was built with extensibility in mind, and the integration points are clean:
New Files (4 Services + 2 Models)
| File | Purpose |
|---|---|
TraceAnalyzer.js |
LLM-as-Judge over execution traces β pattern extraction |
SkillEvolver.js |
A/B testing loop: generate β test β keep/discard skills |
SkillVersionModel.js |
Track skill generations (v1 β v2 β v3) with diffs |
SkillEvalModel.js |
Store A/B test results per skill evaluation |
New Database Tables (2)
-- Track skill evolution lineage
CREATE TABLE skill_versions (
id TEXT PRIMARY KEY,
skill_id TEXT NOT NULL,
version INTEGER NOT NULL,
instructions TEXT NOT NULL,
instructions_diff TEXT,
effectiveness_score REAL,
parent_version_id TEXT,
source_goal_id TEXT,
trace_analysis TEXT, -- JSON: patterns, antipatterns, insights
created_at DATETIME DEFAULT CURRENT_TIMESTAMP
);
-- Track A/B test results
CREATE TABLE skill_evaluations (
id TEXT PRIMARY KEY,
skill_id TEXT NOT NULL,
skill_version_id TEXT NOT NULL,
goal_id TEXT NOT NULL,
baseline_score REAL,
with_skill_score REAL,
delta REAL,
ses_breakdown TEXT, -- JSON: per-component scores
decision TEXT, -- 'kept' | 'discarded' | 'promoted'
trace_analysis_summary TEXT,
evaluated_at DATETIME DEFAULT CURRENT_TIMESTAMP
);Integration Points (Minimal Touchpoints)
The entire system hooks in at exactly one place in existing code:
// In GoalService.js, after goal completion:
// ADD ONE LINE:
await TraceAnalyzer.analyzeAndEvolve(goalId, userId);That's it. One line. Everything else is additive β new files, new tables, new routes. Zero modifications to the existing execution pipeline.
The Gold Standard SKILL.md: What Evolved Skills Look Like
After several cycles of the evolution loop, a mature skill looks like this:
---
name: 'Deep Web Research'
description: 'Multi-source research with cross-referencing and synthesis'
category: 'research'
icon: 'π¬'
generation: 3
effectiveness_score: 92.4
source_goals:
- 'goal-abc123'
- 'goal-def456'
- 'goal-ghi789'
validated_at: '2026-03-10T22:30:00Z'
allowed_tools:
- web_search
- web_scrape
- execute_javascript
- file_operations
evolution_history:
- version: 1
score: 68.2
note: 'Initial extraction from competitor research goal'
- version: 2
score: 81.7
note: 'Added parallel search strategy from market analysis goal'
- version: 3
score: 92.4
note: 'Refined synthesis step from academic research goal'
---
# Deep Web Research Skill
## Strategy
When tasked with research, follow this proven sequence:
### Phase 1: Broad Discovery (Parallel)
1. Execute 3β5 parallel `web_search` queries with varied phrasings
2. Include at least one query with "site:reddit.com" or "site:news.ycombinator.com"
for community perspective
3. Collect ALL unique URLs from results (typically 15β25 sources)
4. Prioritize: primary sources > news articles > blog posts > forums
### Phase 2: Deep Extraction (Top 5)
1. `web_scrape` the top 5 most relevant URLs in parallel
2. Extract: key claims, data points, dates, quotes with attribution
3. Cross-reference: any factual claim must appear in β₯2 independent sources
4. Flag contradictions between sources explicitly
### Phase 3: Structured Synthesis
1. Use `execute_javascript` to deduplicate findings and detect patterns
2. Build hierarchical outline: thesis β supporting evidence β sources
3. Save intermediate results with `file_operations` for crash recovery
4. Final output should cite specific sources for every major claim
## Anti-patterns (Learned from 3 generations of execution)
- β DO NOT scrape more than 8 pages per research task β returns diminish after 5
- β DO NOT trust single-source claims β flag them as "[unverified]"
- β DO NOT skip the synthesis step β raw scraped data β research
- β DO NOT use generic search queries β specific queries with operators perform 2x better
## Recovery Strategy
- If `web_scrape` fails on a URL β retry once after 10s delay
- If retry fails β skip, note as "[source unavailable]", continue
- If >50% of scrapes fail β fall back to search snippet extraction
- Never block entire research pipeline on a single failed sourceNotice how the skill encodes tactical knowledge β not just "do research" but how to do research, with specific anti-patterns learned from failures and recovery strategies proven across multiple goal executions. This is the kind of operational knowledge that typically lives in a senior engineer's head and never gets written down.
Lessons from Autoresearch Applied to AGNT
Several of Karpathy's design choices map directly to principles AGNT should adopt:
1. "The notion of a hyperparameter dissolves"
In Autoresearch, the agent modifies arbitrary code β there's no distinction between "architecture" and "hyperparameter." In SkillForge, the same principle applies: skills aren't config files with fixed fields. They're free-form markdown that can contain any instruction, strategy, or heuristic. The agent that generates them has full creative freedom.
2. "More gradient steps > bigger batches" under fixed budget
Karpathy's biggest finding was counterintuitive: smaller batches beat larger ones because you get more learning steps in the same wall-clock time. The AGNT equivalent: more focused iterations > fewer ambitious ones. Skills should encourage agents to take small, verifiable steps rather than attempting everything at once.
3. Sequential search > random sweeps
Because LLMs understand context, they can do binary search toward optimal settings instead of random grid search. In SkillForge, this means the TraceAnalyzer doesn't just extract random patterns β it reasons about why something worked and proposes targeted improvements. Each skill generation builds intelligently on the last.
4. "The human writes the program.md"
You're not the researcher anymore β you're the research designer. In AGNT, you're not the agent β you're the goal architect. The quality of your success criteria and evaluation rubric determines how well the autonomous loop converges. This is a fundamental shift in how users interact with AI systems.
What This Means for the Agent Ecosystem
SkillForge isn't just a technical feature β it changes the dynamics of the entire AGNT platform:
Skills as a Knowledge Graph
As skills evolve and cross-pollinate across goals, they form a knowledge graph of operational expertise. A skill learned from a web research goal might improve a competitive analysis agent. A recovery strategy from a data pipeline goal might help a content generation agent handle API failures.
Agent Specialization Through Skill Accumulation
Today, AGNT agents are differentiated by their system prompts and tool assignments. With SkillForge, agents develop genuine expertise through accumulated, validated skills. An agent that has executed 50 research goals will have a fundamentally different (and better) skill set than a fresh agent β not because a human configured it, but because it learned from its own execution history.
The Compound Learning Advantage
Every goal execution makes the system smarter. This creates a defensible compound advantage: the more goals users run, the better the skills get, the better the agents perform, the more goals users want to run. This is the flywheel that turns a tool into a platform.
Comparing the Competitive Landscape
How does AGNT + SkillForge compare to other approaches?
| System | Learning Scope | Knowledge Transfer | Evaluation | Autonomy Level |
|---|---|---|---|---|
| Autoresearch | Single artifact (train.py) | Manual (read git log) | Single metric (val_bpb) | Full (overnight runs) |
| AutoGPT | Per-task memory | Vector store retrieval | None (no eval loop) | High (but no learning) |
| CrewAI | Per-crew context | Shared within crew only | Task completion binary | Medium (human checkpoints) |
| LangGraph | Per-graph state | State machine transitions | Custom (user-defined) | Medium (graph-defined flow) |
| AGNT + SkillForge | Cross-agent, cross-goal skills | Automatic via SKILL.md injection | Multi-dimensional SES + LLM judge | Full (with A/B validated learning) |
The key differentiator: validated, transferable, evolving skills. Other systems either don't learn, learn but don't validate, or validate but don't transfer knowledge across agents and tasks.
The Road Ahead
SkillForge is a direct implementation of the pattern Karpathy demonstrated β but applied to the meta-level of agent intelligence rather than model training. The progression is clear:
- Autoresearch: AI improves AI models autonomously
- SkillForge: AI improves AI agents autonomously
- Next: AI improves AI systems autonomously (architecture, orchestration, evaluation)
We're building toward a future where the system doesn't just run your goals β it gets measurably better at running them, every single time.
The agent that runs your 100th goal will be fundamentally more capable than the one that ran your first. Not because we shipped an update, but because it learned.
Get Started
AGNT's goal execution system is live today. The SkillForge extension is in active development.
- Try AGNT β Create a goal and watch the autonomous execution loop in action
- Read the PRD β Full technical specification for SkillForge
- Join the Community β Discuss autonomous agent architectures
The era of agents that learn from every execution has begun.
This article is part of AGNT's Architecture Deep Dive series, exploring the technical foundations of autonomous AI agent systems. For more, see our comparison of Executor vs. AGNT AI Tool Infrastructure.