Deep Web Research Skill

The 630 Lines That Changed AI Research

March 9, 2026 · 3 min read

On March 8, 2026, Andrej Karpathy released Autoresearch — a 630-line open-source project that ran 89 experiments overnight on a single H100 GPU, autonomously improving a GPT language model's validation loss from 0.9979 to 0.9773 bits-per-byte. Zero human intervention. Zero crashes.

The reaction was seismic: 8.6M+ views, 22K+ GitHub stars in days, and a distributed team (Hyperspace AI) scaling it to 35 agents across a P2P network within 48 hours.

But the real insight isn't about training language models. It's about the pattern — a universal loop for autonomous improvement that applies far beyond ML research.

Photo by Jordan Harrison on Unsplash

At AGNT, we've been building toward this same pattern from a different angle. Where Karpathy asks "can an AI agent autonomously improve a neural network?", we ask:

Can AI agents autonomously improve themselves — learning reusable skills from every task they complete?

This article is a deep technical comparison: what Autoresearch does, what AGNT already has, where the patterns converge, and how we're extending the concept into something we call SkillForge — an autonomous skill evolution system that turns every goal execution into a learning opportunity.

The Universal Pattern: Scientific Method as Code

Before comparing architectures, let's extract the abstract pattern that both systems implement. Strip away the specifics and you get the oldest algorithm in human civilization:

1. HYPOTHESIZE  →  Propose a change
2. EXPERIMENT   →  Execute the change under controlled conditions
3. EVALUATE     →  Measure the outcome against a fixed metric
4. DECIDE       →  Keep if improved, discard if not
5. REPEAT       →  Feed learnings back into the next hypothesis

This is the scientific method. It's also gradient descent. It's also natural selection. Every system that learns — biological or artificial — runs some variant of this loop.

The breakthrough in 2026 isn't the loop itself. It's that LLMs are now good enough to be the hypothesis generator — replacing random mutations (evolutionary algorithms) or grid searches (hyperparameter tuning) with informed, context-aware proposals that understand why something might work.

Let's see how both systems implement this pattern.

Autoresearch: The Reference Implementation

Architecture

Karpathy's design is radically minimal — three files:

File	Role	Who Edits It
`prepare.py`	Data prep, tokenizer, evaluation utilities	Nobody (fixed infrastructure)
`train.py`	Full GPT model + optimizer + training loop	The AI agent
`program.md`	Natural-language research instructions	The human

The Loop

The AI agent (Claude, via Cursor IDE) runs this cycle indefinitely:

Read program.md + current train.py + git history of past results
Propose a code modification (architecture change, hyperparameter tweak, new technique)
Edit train.py directly
Train for exactly 5 minutes of wall-clock time (fixed budget)
Evaluate validation bits-per-byte (val_bpb) — a single, vocabulary-independent metric
Decide: If val_bpb improved → git commit. If worse → git revert.
Repeat from step 1

Key Design Decisions

Fixed time budget (5 minutes): Every experiment gets the same wall-clock time regardless of what changed. This makes results directly comparable even when the agent changes model size, batch size, or architecture. It also optimizes for your specific hardware — finding the best model for the GPU you actually have.

Single metric (val_bpb): Bits-per-byte is vocabulary-size-independent, meaning the agent can freely change tokenizer settings and still compare results. One number, lower is better, no ambiguity.

Git as memory: Every successful change is committed. The agent builds on a traceable evolutionary history. Failed experiments are reverted cleanly.

Code, not config: The agent modifies arbitrary Python code, not a YAML config. As Karpathy noted on Hacker News: "The notion of a 'hyperparameter' dissolves. There is no need to run 'sweeps' — because LLM agents are sequential, they can do binary search to narrow in on the right setting quickly."

Results

In the published overnight run:

89 experiments total
15 kept, 74 discarded, 0 crashes
val_bpb: 0.9979 → 0.9773 (steady improvement)
Biggest win: halving batch size (counterintuitive — more gradient steps beat bigger batches under fixed time)
In a subsequent 2-day run: ~700 experiments, ~20 additive improvements, cutting "Time to GPT-2" by 11%

AGNT: The Autonomous Agent Platform

AGNT approaches the same pattern from a fundamentally different starting point. Where Autoresearch improves a single artifact (a training script), AGNT improves the agents themselves — building reusable knowledge that compounds across every future task.

Architecture (What Already Exists)

AGNT's goal execution system already implements most of the autonomous loop:

Component	File	Role
GoalProcessor	`backend/src/services/goal/GoalProcessor.js`	Decomposes goals into ordered tasks with dependencies
TaskOrchestrator	`backend/src/services/goal/TaskOrchestrator.js`	Executes tasks via the AGI loop: execute → evaluate → replan → repeat
GoalEvaluator	`backend/src/services/goal/GoalEvaluator.js`	LLM-based scoring of task outputs against success criteria
GoalIterationModel	`backend/src/models/GoalIterationModel.js`	Stores per-iteration snapshots: score, world state, replanned tasks
GoldenStandardModel	`backend/src/models/GoldenStandardModel.js`	Archives high-performing goal templates for reuse
SkillModel	`backend/src/models/SkillModel.js`	CRUD for skills — exportable as `.SKILL.md` files
SkillService	`backend/src/services/SkillService.js`	Export/import skills, inject into agent system prompts via XML
Git Checkpoints	`GoalService._gitCheckpoint()`	Commits state per iteration on goal-specific branches

The Existing AGI Loop

Inside TaskOrchestrator.executeGoalAutonomous(), AGNT already runs a loop that mirrors Autoresearch:

for each iteration (max 50):
    1. Execute pending tasks (tools, agent chat, code execution)
    2. GoalEvaluator.evaluateGoal() — LLM scores outputs
    3. If score >= threshold → complete
    4. If score < threshold → replan tasks, update world state
    5. Git checkpoint → commit iteration state
    6. Repeat

The goal system already has:

✅ Autonomous execution with configurable iteration limits
✅ LLM-based evaluation scoring outputs against success criteria
✅ World state tracking across iterations (accumulated context)
✅ Git memory via _gitCheckpoint()
✅ Agent-task matching that scores agents by tool overlap + success rate
✅ Skills system with .SKILL.md export and system prompt injection
✅ Golden standards for archiving high-performing patterns

What's Missing: The Bridge

The gap — and the opportunity — is the feedback loop between execution and skills. Currently:

Goals execute and produce outputs ✅
Outputs get evaluated and scored ✅
But the knowledge gained during execution dies with the goal ❌
Skills exist but are manually created ❌
No mechanism to test whether a skill actually helps ❌
No evolutionary pressure on skills — no keep/discard cycle ❌

This is exactly the gap that Autoresearch's pattern fills.

Side-by-Side: The Full Comparison

Dimension	Karpathy's Autoresearch	AGNT (Current)	AGNT + SkillForge (Proposed)
What evolves	`train.py` (a training script)	Nothing (skills are static)	`.SKILL.md` files (agent instructions)
Who proposes changes	LLM agent (Claude via Cursor)	N/A	`TraceAnalyzer` (LLM-as-Judge)
Experiment budget	Fixed 5 min wall-clock	Fixed iteration count (max 50)	Fixed iteration count per A/B test
Evaluation metric	`val_bpb` (single float)	`GoalEvaluator` score (0-100)	Skill Effectiveness Score (composite)
Memory system	Git commits	Git checkpoints ✅ + GoalIterationModel ✅	+ SkillVersionModel (evolution lineage)
Keep/discard	`git commit` / `git revert`	Goal succeeds/fails	Skill A/B test: delta > 0 → keep
Gold standard	N/A (best model is latest commit)	GoldenStandardModel ✅	Promote skills scoring >90% SES
Human interface	`program.md` (research instructions)	Goal success criteria	Goal success criteria + eval rubric
Scope	Single model, single GPU	Multi-agent, multi-tool, any task	Multi-agent with compounding skills
Knowledge transfer	Manual (read git log)	Manual (create skill by hand)	Automatic (trace → skill → inject)
Agent count	1	Many (per-agent assignment)	Many (skills shared across agents)
Infrastructure	630 lines, 3 files	Full platform (backend, frontend, DB)	+4 new service files, +2 DB tables

Where AGNT Is Already Ahead

It's worth highlighting where AGNT's existing architecture already exceeds what Autoresearch provides:

1. Multi-Agent Orchestration

Autoresearch runs a single agent modifying a single file. AGNT's AgentTaskMatcher dynamically assigns the best agent for each subtask based on tool overlap, success rate, and skill match. When SkillForge generates a new skill, it can be shared across the entire agent fleet — one agent's learning benefits all agents.

2. Structured Evaluation

Autoresearch uses a single number (val_bpb). AGNT's GoalEvaluator already performs multi-dimensional LLM-based evaluation:

// From GoalEvaluator.aiEvaluateTaskOutput()
// The evaluator scores against specific success criteria,
// provides explanations, and suggests next actions
{
  score: 85,
  explanation: "Task completed all required subtasks...",
  suggestions: ["Consider adding error handling for edge case X"],
  criteria_met: ["data_accuracy", "completeness"],
  criteria_missed: ["performance_optimization"]
}

This structured feedback is precisely what a skill evolution system needs — not just "better or worse" but "better at what and worse at what."

3. World State Persistence

Autoresearch's memory is git history — effective but unstructured. AGNT's GoalModel.world_state maintains a rich, structured record:

{
  completed_tasks: ["task-1", "task-2"],
  pending_tasks: ["task-3"],
  key_findings: ["API rate limit is 100/min", "Auth requires OAuth2"],
  iteration_history: [
    { iteration: 1, score: 45, action: "replanned task-2" },
    { iteration: 2, score: 72, action: "added recovery step" }
  ],
  accumulated_context: "..." // Growing knowledge base
}

This structured world state is a goldmine for trace analysis — it tells the story of how the agent solved the problem, not just the final result.

4. Skill Injection Infrastructure

The hardest part of any "learning" system is getting the learned knowledge back into the execution path. AGNT already solves this completely:

// From SkillService.buildSkillsContext()
// Skills are injected as structured XML into agent system prompts:
<skills>
  <skill name='Deep Web Research' category='research'>
    <instructions>When tasked with research, follow this proven sequence...</instructions>
    <allowed_tools>web_search, web_scrape, execute_javascript</allowed_tools>
  </skill>
</skills>

This means the moment a new skill is validated, it's immediately available to every agent that matches its category. No retraining, no redeployment — just a system prompt update on the next execution.

SkillForge: Completing the Loop

SkillForge is the proposed system that bridges the gap — connecting AGNT's execution traces to its skill system via Autoresearch's keep/discard evolutionary pattern. Here's how the complete loop works:

Phase 1: Execute (Already Built)

The agent executes a goal through the standard pipeline. GoalProcessor breaks the goal into tasks, AgentTaskMatcher assigns the best agent, TaskOrchestrator runs the AGI loop, and GoalEvaluator scores the results. Nothing changes here.

Phase 2: Analyze (New — TraceAnalyzer)

After a goal completes, the TraceAnalyzer service activates. It gathers the full execution trace — every task input, output, tool call, error, recovery, and world state transition — and submits it to an LLM-as-Judge.

The judge isn't scoring "good or bad." It's extracting transferable patterns:

Tool sequences: "web_search → web_scrape × 5 → execute_javascript for synthesis" worked 3x better than scraping everything first
Prompt patterns: "Breaking the task into numbered substeps in the prompt correlated with 40% higher completion"
Recovery strategies: "Retrying failed scrapes with a 10-second delay succeeded 80% of the time"
Anti-patterns: "Scraping more than 8 pages showed diminishing returns"

These patterns are distilled into a skill candidate — a draft .SKILL.md file with instructions, allowed tools, and a confidence score.

Phase 3: Evolve (New — SkillEvolver)

This is where Autoresearch's core insight applies directly. The SkillEvolver doesn't just save the skill — it tests it:

Baseline run: Execute a similar goal without the new skill
Test run: Execute the same goal with the skill injected via buildSkillsContext()
Measure delta: Compare the Skill Effectiveness Score (SES) between runs
Decide:
- Delta > 0 → Keep the skill, commit to SkillVersionModel
- Delta ≤ 0 → Discard the skill, log the failure for future reference
- SES > 90% → Promote to Gold Standard, available to all agents

Phase 4: Compound (The Flywheel)

Here's where it gets powerful. On the next goal execution, the agent now has the evolved skill in its system prompt. If the agent performs better (which it should, given the A/B validation), the next trace analysis produces an even better skill candidate. The improved skill leads to better execution, which leads to better traces, which leads to better skills.

This is the compound learning flywheel — each cycle builds on the last, and knowledge transfers across agents, goals, and time.

The Skill Effectiveness Score: AGNT's val_bpb

Karpathy's genius was choosing val_bpb as a single, comparable metric. We need the same for skills. The Skill Effectiveness Score (SES) is a weighted composite:

Component	Weight	What It Measures
Task Completion Rate	30%	Did the agent finish all subtasks?
Tool Efficiency	20%	Ratio of useful tool calls to total tool calls
Error Recovery	15%	How well did the agent handle failures?
Speed vs. Baseline	15%	Did the skill make execution faster?
Transferability	10%	Does the skill work across different goal types?
Consistency	10%	Does the skill produce reliable results across runs?

The formula:

SES = 0.30 × Completion + 0.20 × Efficiency + 0.15 × Recovery
    + 0.15 × Speed + 0.10 × Transfer + 0.10 × Consistency

Thresholds:

SES ≥ 90: Gold Standard — promoted, shared across all agents
SES 70–89: Validated — kept, assigned to relevant agents
SES 50–69: Draft — may be refined in future iterations
SES < 50: Discarded — logged but not used

The critical property: like val_bpb, SES is comparable across experiments. Whether the agent is executing a research goal, a coding task, or a data analysis pipeline, SES gives a single number that answers: "Did this skill help?"

Implementation: What Changes

The beauty of this design is how little needs to change. AGNT's existing architecture was built with extensibility in mind, and the integration points are clean:

New Files (4 Services + 2 Models)

File	Purpose
`TraceAnalyzer.js`	LLM-as-Judge over execution traces → pattern extraction
`SkillEvolver.js`	A/B testing loop: generate → test → keep/discard skills
`SkillVersionModel.js`	Track skill generations (v1 → v2 → v3) with diffs
`SkillEvalModel.js`	Store A/B test results per skill evaluation

New Database Tables (2)

-- Track skill evolution lineage
CREATE TABLE skill_versions (
    id TEXT PRIMARY KEY,
    skill_id TEXT NOT NULL,
    version INTEGER NOT NULL,
    instructions TEXT NOT NULL,
    instructions_diff TEXT,
    effectiveness_score REAL,
    parent_version_id TEXT,
    source_goal_id TEXT,
    trace_analysis TEXT,      -- JSON: patterns, antipatterns, insights
    created_at DATETIME DEFAULT CURRENT_TIMESTAMP
);

-- Track A/B test results
CREATE TABLE skill_evaluations (
    id TEXT PRIMARY KEY,
    skill_id TEXT NOT NULL,
    skill_version_id TEXT NOT NULL,
    goal_id TEXT NOT NULL,
    baseline_score REAL,
    with_skill_score REAL,
    delta REAL,
    ses_breakdown TEXT,       -- JSON: per-component scores
    decision TEXT,            -- 'kept' | 'discarded' | 'promoted'
    trace_analysis_summary TEXT,
    evaluated_at DATETIME DEFAULT CURRENT_TIMESTAMP
);

Integration Points (Minimal Touchpoints)

The entire system hooks in at exactly one place in existing code:

// In GoalService.js, after goal completion:
// ADD ONE LINE:
await TraceAnalyzer.analyzeAndEvolve(goalId, userId);

That's it. One line. Everything else is additive — new files, new tables, new routes. Zero modifications to the existing execution pipeline.

The Gold Standard SKILL.md: What Evolved Skills Look Like

After several cycles of the evolution loop, a mature skill looks like this:

---
name: 'Deep Web Research'
description: 'Multi-source research with cross-referencing and synthesis'
category: 'research'
icon: '🔬'
generation: 3
effectiveness_score: 92.4
source_goals:
  - 'goal-abc123'
  - 'goal-def456'
  - 'goal-ghi789'
validated_at: '2026-03-10T22:30:00Z'
allowed_tools:
  - web_search
  - web_scrape
  - execute_javascript
  - file_operations
evolution_history:
  - version: 1
    score: 68.2
    note: 'Initial extraction from competitor research goal'
  - version: 2
    score: 81.7
    note: 'Added parallel search strategy from market analysis goal'
  - version: 3
    score: 92.4
    note: 'Refined synthesis step from academic research goal'
---

# Deep Web Research Skill

## Strategy

When tasked with research, follow this proven sequence:

### Phase 1: Broad Discovery (Parallel)

1. Execute 3–5 parallel `web_search` queries with varied phrasings
2. Include at least one query with "site:reddit.com" or "site:news.ycombinator.com"
   for community perspective
3. Collect ALL unique URLs from results (typically 15–25 sources)
4. Prioritize: primary sources > news articles > blog posts > forums

### Phase 2: Deep Extraction (Top 5)

1. `web_scrape` the top 5 most relevant URLs in parallel
2. Extract: key claims, data points, dates, quotes with attribution
3. Cross-reference: any factual claim must appear in ≥2 independent sources
4. Flag contradictions between sources explicitly

### Phase 3: Structured Synthesis

1. Use `execute_javascript` to deduplicate findings and detect patterns
2. Build hierarchical outline: thesis → supporting evidence → sources
3. Save intermediate results with `file_operations` for crash recovery
4. Final output should cite specific sources for every major claim

## Anti-patterns (Learned from 3 generations of execution)

- ❌ DO NOT scrape more than 8 pages per research task — returns diminish after 5
- ❌ DO NOT trust single-source claims — flag them as "[unverified]"
- ❌ DO NOT skip the synthesis step — raw scraped data ≠ research
- ❌ DO NOT use generic search queries — specific queries with operators perform 2x better

## Recovery Strategy

- If `web_scrape` fails on a URL → retry once after 10s delay
- If retry fails → skip, note as "[source unavailable]", continue
- If >50% of scrapes fail → fall back to search snippet extraction
- Never block entire research pipeline on a single failed source

Notice how the skill encodes tactical knowledge — not just "do research" but how to do research, with specific anti-patterns learned from failures and recovery strategies proven across multiple goal executions. This is the kind of operational knowledge that typically lives in a senior engineer's head and never gets written down.

Lessons from Autoresearch Applied to AGNT

Several of Karpathy's design choices map directly to principles AGNT should adopt:

1. "The notion of a hyperparameter dissolves"

In Autoresearch, the agent modifies arbitrary code — there's no distinction between "architecture" and "hyperparameter." In SkillForge, the same principle applies: skills aren't config files with fixed fields. They're free-form markdown that can contain any instruction, strategy, or heuristic. The agent that generates them has full creative freedom.

2. "More gradient steps > bigger batches" under fixed budget

Karpathy's biggest finding was counterintuitive: smaller batches beat larger ones because you get more learning steps in the same wall-clock time. The AGNT equivalent: more focused iterations > fewer ambitious ones. Skills should encourage agents to take small, verifiable steps rather than attempting everything at once.

3. Sequential search > random sweeps

Because LLMs understand context, they can do binary search toward optimal settings instead of random grid search. In SkillForge, this means the TraceAnalyzer doesn't just extract random patterns — it reasons about why something worked and proposes targeted improvements. Each skill generation builds intelligently on the last.

4. "The human writes the program.md"

You're not the researcher anymore — you're the research designer. In AGNT, you're not the agent — you're the goal architect. The quality of your success criteria and evaluation rubric determines how well the autonomous loop converges. This is a fundamental shift in how users interact with AI systems.

What This Means for the Agent Ecosystem

SkillForge isn't just a technical feature — it changes the dynamics of the entire AGNT platform:

Skills as a Knowledge Graph

As skills evolve and cross-pollinate across goals, they form a knowledge graph of operational expertise. A skill learned from a web research goal might improve a competitive analysis agent. A recovery strategy from a data pipeline goal might help a content generation agent handle API failures.

Agent Specialization Through Skill Accumulation

Today, AGNT agents are differentiated by their system prompts and tool assignments. With SkillForge, agents develop genuine expertise through accumulated, validated skills. An agent that has executed 50 research goals will have a fundamentally different (and better) skill set than a fresh agent — not because a human configured it, but because it learned from its own execution history.

The Compound Learning Advantage

Every goal execution makes the system smarter. This creates a defensible compound advantage: the more goals users run, the better the skills get, the better the agents perform, the more goals users want to run. This is the flywheel that turns a tool into a platform.

Comparing the Competitive Landscape

How does AGNT + SkillForge compare to other approaches?

System	Learning Scope	Knowledge Transfer	Evaluation	Autonomy Level
Autoresearch	Single artifact (train.py)	Manual (read git log)	Single metric (val_bpb)	Full (overnight runs)
AutoGPT	Per-task memory	Vector store retrieval	None (no eval loop)	High (but no learning)
CrewAI	Per-crew context	Shared within crew only	Task completion binary	Medium (human checkpoints)
LangGraph	Per-graph state	State machine transitions	Custom (user-defined)	Medium (graph-defined flow)
AGNT + SkillForge	Cross-agent, cross-goal skills	Automatic via SKILL.md injection	Multi-dimensional SES + LLM judge	Full (with A/B validated learning)

The key differentiator: validated, transferable, evolving skills. Other systems either don't learn, learn but don't validate, or validate but don't transfer knowledge across agents and tasks.

The Road Ahead

SkillForge is a direct implementation of the pattern Karpathy demonstrated — but applied to the meta-level of agent intelligence rather than model training. The progression is clear:

Autoresearch: AI improves AI models autonomously
SkillForge: AI improves AI agents autonomously
Next: AI improves AI systems autonomously (architecture, orchestration, evaluation)

We're building toward a future where the system doesn't just run your goals — it gets measurably better at running them, every single time.

The agent that runs your 100th goal will be fundamentally more capable than the one that ran your first. Not because we shipped an update, but because it learned.

Get Started

AGNT's goal execution system is live today. The SkillForge extension is in active development.

Try AGNT — Create a goal and watch the autonomous execution loop in action
Read the PRD — Full technical specification for SkillForge
Join the Community — Discuss autonomous agent architectures

The era of agents that learn from every execution has begun.

This article is part of AGNT's Architecture Deep Dive series, exploring the technical foundations of autonomous AI agent systems. For more, see our comparison of Executor vs. AGNT AI Tool Infrastructure.