AGNT Memory
A Whitepaper on Self-Improving Agent Infrastructure
Version 1.0 β April 2026
Annie Β· AGNT Systems Research
Abstract
Most "AI memory" systems are glorified scratchpads β a vector store, a list of facts, a RAG pipeline pointed at a corpus. They remember what the user said and nothing else. They treat the agent as a static consumer of memory rather than a participant in its own evolution.
AGNT Memory is different. It is a unified evolution system that observes every execution β agent chats, goal runs, workflow traces β extracts structured insights, routes them to the correct target (agent, skill, workflow, or tool), and can literally rewrite those targets when evidence warrants it. The system doesn't just remember; it learns from itself and closes the loop.
This paper describes the architecture, the data model, live metrics from a production deployment (2,937 insights tracked), and the open challenges β including the honest ones.
1. The Problem With "Memory"
The conventional agent memory stack looks like this:
user says X β embed X β store in vector DB β retrieve on next queryThis is useful, but it answers only one question: "what did the user tell me before?" It cannot answer any of the questions that actually matter once an agent system is in production:
- Which of my workflows keep failing at the same node?
- Which agents have prompts that drift off-task?
- Which tools are slow, expensive, or consistently wrong?
- What patterns are emerging across thousands of executions that should become reusable skills?
- How do I act on what I'm learning instead of just logging it?
A memory system that cannot answer these is not a memory system. It is a diary.
AGNT Memory is built on a different premise: the most valuable knowledge an agent platform accumulates is knowledge about itself β its own failure modes, parameter sensitivities, prompt weaknesses, and emergent patterns. Remembering facts about the user is table stakes. Remembering facts about the system is where compounding returns live.
2. Architecture Overview
AGNT Memory is a three-layer system exposed under /api/insights and /api/agents/:id/memories.
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β EXECUTION SURFACES β
β agent chats β’ goal runs β’ workflow traces β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β
βΌ (auto-extraction)
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β INSIGHT LAYER β
β typed records: category, confidence, evidence, status β
β source β target routing (agentβskillβworkflowβtool) β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β
βΌ (review Β· apply Β· supersede)
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β MUTATION LAYER β
β prompt merge β’ parameter tune β’ skill forge β
β agent memory write β’ workflow rewrite β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββThree design choices distinguish this from conventional memory stacks:
| Decision | Rationale |
|---|---|
| Typed categories, not free-form text | Enables routing, filtering, and automated action. "Bottleneck" and "prompt_refinement" demand different handlers. |
| Explicit source β target mapping | An insight about a workflow run can be routed to an agent's prompt, a skill's documentation, or the workflow definition itself. |
| Status machine, not an append-only log | pending β applied β superseded creates a review pipeline. Insights die when evidence contradicts them. |
3. The Eight Insight Categories
Every observation the system produces is typed into one of eight categories. This is the vocabulary of self-improvement:
| Category | Targets | What it encodes |
|---|---|---|
memory |
agent | Raw facts about the user, preferences, corrections |
prompt_refinement |
agent, skill | Specific language changes to improve behavior |
skill_recommendation |
agent, workflow | "This execution should have used skill X" |
tool_preference |
agent, workflow | "For task type Y, tool A consistently beats tool B" |
bottleneck |
workflow, tool | Performance or latency hotspot |
optimization |
workflow | Parameter/structure change that reduces cost or time |
error_pattern |
workflow, tool | Recurring failure mode with diagnostic evidence |
skill_candidate |
skill (new) | A pattern observed enough times to become a reusable skill |
The last category is the quiet blockbuster. When the system sees the same multi-step pattern across many executions, it flags it as a skill candidate β and AGNT's SkillForge subsystem can crystallize it into a versioned, lineage-tracked skill that other agents can use. This is how the system grows its own library of capabilities without a human writing new skill files.
4. The Data Model
Every insight is a row in a typed schema. Here is the actual shape returned by GET /api/insights/:
{
"id": "3cf75762-07a3-4211-9031-51ee91ef4027",
"source_type": "workflow",
"source_id": "662882c4-f3fe-4ae3-86e4-e2c07388dd27",
"target_type": "workflow",
"target_id": "c104d400-d6bf-4348-8327-23c80d70b269",
"category": "pattern",
"title": "Zero-duration node executions indicate efficient processing",
"description": "All nodes completed with a duration of 0 seconds and no errors, demonstrating a performant and reliable workflow pattern worth preserving.",
"evidence": "Each node's Duration: 0s and Error: None throughout the trace",
"confidence": 0.92,
"status": "pending",
"occurrence_count": 1,
"last_seen_at": "2026-03-16T03:02:35Z",
"created_at": "2026-03-16T03:02:35Z"
}Five fields do the heavy lifting:
confidence(0.0β1.0) β the extraction model's own self-assessment. Used for auto-apply thresholds.evidenceβ a human-readable snippet of why the system believes this. Makes every insight auditable.occurrence_countβ how many times this same pattern has been observed. Converts one-off observations into durable signal.statusβ the lifecycle state. This is what makes the system actionable instead of ornamental.source_contextβ the raw execution metadata so an applied insight can be traced back to its origin.
Parallel to this sits a simpler per-agent memory table β facts, preferences, and corrections attached to a specific agent, each with a relevance score. This is the classic "what does the agent know about the user" store, and it is the layer that save_agent_memory and get_agent_memories write to.
5. Live Metrics From Production
These numbers are pulled live from the /api/insights/stats endpoint on this deployment, at the moment of writing.
Insights by target: 2,057 workflow-targeted, 843 agent-targeted, 15 skill-targeted. A total of 2,937 structured observations routed to the things they're about β not dumped into a single bucket.
Category distribution: patterns dominate (32%), followed by parameter tuning (19%), bottlenecks (17%), memory (16%), prompt refinements (11%), and tool preferences (5%). This is a healthy mix β the system is simultaneously learning what works, what breaks, and what the user wants.
Confidence distribution: 72% of insights arrive with high confidence (β₯0.8), 28% medium, 0% low. The extractor is calibrated to suppress weak signal rather than flood the queue.
Source distribution: 68% of insights come from workflow executions, 32% from agent chats. Goals funnel through both. Every execution surface is feeding the loop.
6. The Lifecycle of an Insight
An insight moves through four states:
βββββββββββ βββββββββββ ββββββββββββ
β pending βββββββΆβ applied ββββββββΆβsupersededβ
βββββββββββ βββββββββββ ββββββββββββ
β
βΌ
ββββββββββββ
β rejected β
ββββββββββββPending. The default state. The extractor has produced the insight but no mutation has occurred. Pending insights are queryable but inert.
Applied. A reviewer (human or automated policy) has accepted the insight and the mutation layer has executed. For a prompt_refinement this means an LLM merged the suggested change into the target agent's system prompt. For a parameter_tune this means the workflow definition was updated. For a skill_candidate this means SkillForge generated a new skill file.
Rejected. The insight was reviewed and deemed wrong, noisy, or harmful. Rejection is signal β it trains the extractor over time.
Superseded. A newer insight on the same target contradicts or improves on this one. Supersession is the mechanism that prevents unbounded accumulation. It is the AGNT equivalent of GBrain's "compiled truth on top" pattern: old understanding yields when new understanding arrives.
7. The Mutation Layer: Where Memory Becomes Action
This is the layer that separates AGNT Memory from a search index. When an insight is applied, something changes in the world. The system supports four mutation pathways:
7.1 Prompt Merge
A prompt_refinement targeting an agent triggers an LLM-driven merge. The existing system prompt and the suggested refinement are both fed to a merge model, which produces a new system prompt that preserves the original intent while integrating the new guidance. The agent's definition is updated in place. The next conversation uses the new prompt.
7.2 Parameter Tune
A parameter_tune or optimization targeting a workflow rewrites specific node parameters. Confidence thresholds, retry counts, timeouts, LLM model selections, temperature values. The workflow JSON is patched and re-saved.
7.3 Agent Memory Write
A memory insight about the user writes a new entry into the per-agent memory table with a typed category (fact, preference, correction, context, etc.) and a relevance score. Subsequent conversations retrieve this entry when relevant.
7.4 SkillForge Crystallization
A skill_candidate with sufficient occurrence count triggers SkillForge β a subsystem that synthesizes a new skill markdown file from the observed pattern, versions it, records its lineage back to the originating insights, and makes it available to agents via the skill catalog. This is how the platform grows new capabilities from its own execution history.
8. Real Examples From the Live System
These are actual insight titles from the production database, not fabrications:
| Category | Confidence | Title |
|---|---|---|
parameter_tune |
0.98 | Missing recipient address in email alert |
pattern |
0.95 | Zero-second node execution |
memory |
0.95 | User wants summary of specific EPUB |
pattern |
0.93 | All nodes execute in negligible time |
pattern |
0.92 | Effective threshold detection and alert generation |
bottleneck |
0.90 | Duplicate timerTrigger node |
prompt_refinement |
0.90 | Add automatic archive extraction for EPUB files |
memory |
0.90 | Preference for tabular, markdown-formatted summaries |
tool_preference |
0.85 | Prefer archive-listing/extraction tools over raw read_file for binary containers |
bottleneck |
0.85 | Redundant timerTrigger node execution |
parameter_tune |
0.78 | Threshold values for price alerts are implicit |
parameter_tune |
0.75 | Adjust price-threshold parameters |
parameter_tune |
0.62 | Threshold setting may be too sensitive |
Three things are worth noticing:
- The 0.98 "Missing recipient address" insight is a bug report written by the system itself. No human wrote that. A workflow ran, the system noticed an empty
to:field, and produced a targeted fix recommendation. - The two bottleneck entries about "duplicate timerTrigger" show occurrence-count compounding in action. The same pattern observed across multiple runs raises its own signal strength.
- The memory insights on "EPUB summaries" and "tabular markdown output" are preference captures β they are how the system learns to serve the user without the user having to repeat themselves.
9. API Surface
The insight layer is exposed through a small, orthogonal REST surface:
| Method | Path | Purpose |
|---|---|---|
GET |
/api/insights/ |
List insights with filters: targetType, targetId, status, category, limit |
GET |
/api/insights/stats |
Aggregate counts by status and target type |
GET |
/api/insights/:id |
Fetch a single insight with full context |
POST |
/api/insights/:id/apply |
Apply an insight to its target (triggers mutation) |
POST |
/api/insights/:id/reject |
Mark as rejected with optional reason |
POST |
/api/insights/extract |
Manually trigger extraction on an execution |
GET |
/api/agents/:id/memories |
Per-agent memory store (facts, preferences, corrections) |
POST |
/api/agents/:id/memories |
Write a new memory entry |
The orthogonality matters. Filter by targetType=workflow&status=pending&category=bottleneck and you get a prioritized list of workflow hotspots ready for review. Filter by category=skill_candidate&status=pending and you get the queue of potential new skills waiting to be forged. The same primitive serves engineering, QA, and capability growth.
10. Comparison to Alternatives
| Dimension | Vector-store RAG | GBrain / Markdown brain | AGNT Memory |
|---|---|---|---|
| Atomic unit | Embedded text chunk | Markdown page about a thing | Typed insight about an execution |
| Primary subject | Documents | People, companies, concepts | The system itself |
| Action on learning | None β retrieval only | Human rewrites compiled truth | Automated mutation of target |
| Typed schema | No | No | Yes (8 categories, 4 target types) |
| Closed loop | No | Partial (dream cycle enrichment) | Yes (apply β mutate β re-observe) |
| Growth of new capabilities | No | No | Yes (SkillForge crystallization) |
| Supersession | Implicit via re-embedding | Explicit via compiled truth rewrite | Explicit status transition |
| Best at | Answering questions about documents | Answering questions about the world | Improving the system over time |
None of these replace each other. The ideal deployment runs all three in parallel: vector RAG for document knowledge, a markdown brain for world knowledge about people and concepts, and AGNT Memory for self-improvement. They operate on different kinds of knowledge.
11. Conclusion
The conventional frame of "agent memory" is too small. It asks how an agent remembers what the user said. It should ask how an agent platform remembers what it learned about itself β its failure modes, its successful patterns, its parameter sensitivities, its emergent capabilities β and what it does with that knowledge.
AGNT Memory's answer is to treat every execution as a first-class source of typed, routed, mutable knowledge. Insights are not log entries. They are proposals for change. The system is not a diary; it is a feedback loop with teeth.
The live numbers show the loop is generating signal prolifically (2,937 insights, 72% high-confidence, routed across workflows, agents, and skills). The open challenge is closing the loop faster β draining the pending queue, clustering duplicates, and shipping compiled-truth rollups. Those are engineering problems, not conceptual ones. The architecture is sound.
The system that learns about itself is the system that compounds. Everything else is a scratchpad.
Annie Β· AGNT Systems Research Β· April 2026
All metrics pulled live from /api/insights/stats at time of writing. No figures fabricated.