AGNT Memory

A Whitepaper on Self-Improving Agent Infrastructure

Version 1.0 — April 2026
Annie · AGNT Systems Research

Abstract

Most "AI memory" systems are glorified scratchpads — a vector store, a list of facts, a RAG pipeline pointed at a corpus. They remember what the user said and nothing else. They treat the agent as a static consumer of memory rather than a participant in its own evolution.

AGNT Memory is different. It is a unified evolution system that observes every execution — agent chats, goal runs, workflow traces — extracts structured insights, routes them to the correct target (agent, skill, workflow, or tool), and can literally rewrite those targets when evidence warrants it. The system doesn't just remember; it learns from itself and closes the loop.

This paper describes the architecture, the data model, live metrics from a production deployment (2,937 insights tracked), and the open challenges — including the honest ones.

1. The Problem With "Memory"

The conventional agent memory stack looks like this:

user says X → embed X → store in vector DB → retrieve on next query

This is useful, but it answers only one question: "what did the user tell me before?" It cannot answer any of the questions that actually matter once an agent system is in production:

Which of my workflows keep failing at the same node?
Which agents have prompts that drift off-task?
Which tools are slow, expensive, or consistently wrong?
What patterns are emerging across thousands of executions that should become reusable skills?
How do I act on what I'm learning instead of just logging it?

A memory system that cannot answer these is not a memory system. It is a diary.

AGNT Memory is built on a different premise: the most valuable knowledge an agent platform accumulates is knowledge about itself — its own failure modes, parameter sensitivities, prompt weaknesses, and emergent patterns. Remembering facts about the user is table stakes. Remembering facts about the system is where compounding returns live.

2. Architecture Overview

AGNT Memory is a three-layer system exposed under /api/insights and /api/agents/:id/memories.

┌─────────────────────────────────────────────────────────────┐
│                    EXECUTION SURFACES                       │
│   agent chats   •   goal runs   •   workflow traces         │
└─────────────────────────────────────────────────────────────┘
                            │
                            ▼  (auto-extraction)
┌─────────────────────────────────────────────────────────────┐
│                    INSIGHT LAYER                            │
│   typed records: category, confidence, evidence, status     │
│   source → target routing (agent│skill│workflow│tool)       │
└─────────────────────────────────────────────────────────────┘
                            │
                            ▼  (review · apply · supersede)
┌─────────────────────────────────────────────────────────────┐
│                    MUTATION LAYER                           │
│   prompt merge   •   parameter tune   •   skill forge       │
│   agent memory write   •   workflow rewrite                 │
└─────────────────────────────────────────────────────────────┘

Three design choices distinguish this from conventional memory stacks:

Decision	Rationale
Typed categories, not free-form text	Enables routing, filtering, and automated action. "Bottleneck" and "prompt_refinement" demand different handlers.
Explicit source → target mapping	An insight about a workflow run can be routed to an agent's prompt, a skill's documentation, or the workflow definition itself.
Status machine, not an append-only log	`pending → applied → superseded` creates a review pipeline. Insights die when evidence contradicts them.

3. The Eight Insight Categories

Every observation the system produces is typed into one of eight categories. This is the vocabulary of self-improvement:

Category	Targets	What it encodes
`memory`	agent	Raw facts about the user, preferences, corrections
`prompt_refinement`	agent, skill	Specific language changes to improve behavior
`skill_recommendation`	agent, workflow	"This execution should have used skill X"
`tool_preference`	agent, workflow	"For task type Y, tool A consistently beats tool B"
`bottleneck`	workflow, tool	Performance or latency hotspot
`optimization`	workflow	Parameter/structure change that reduces cost or time
`error_pattern`	workflow, tool	Recurring failure mode with diagnostic evidence
`skill_candidate`	skill (new)	A pattern observed enough times to become a reusable skill

The last category is the quiet blockbuster. When the system sees the same multi-step pattern across many executions, it flags it as a skill candidate — and AGNT's SkillForge subsystem can crystallize it into a versioned, lineage-tracked skill that other agents can use. This is how the system grows its own library of capabilities without a human writing new skill files.

4. The Data Model

Every insight is a row in a typed schema. Here is the actual shape returned by GET /api/insights/:

{
  "id": "3cf75762-07a3-4211-9031-51ee91ef4027",
  "source_type": "workflow",
  "source_id": "662882c4-f3fe-4ae3-86e4-e2c07388dd27",
  "target_type": "workflow",
  "target_id": "c104d400-d6bf-4348-8327-23c80d70b269",
  "category": "pattern",
  "title": "Zero-duration node executions indicate efficient processing",
  "description": "All nodes completed with a duration of 0 seconds and no errors, demonstrating a performant and reliable workflow pattern worth preserving.",
  "evidence": "Each node's Duration: 0s and Error: None throughout the trace",
  "confidence": 0.92,
  "status": "pending",
  "occurrence_count": 1,
  "last_seen_at": "2026-03-16T03:02:35Z",
  "created_at": "2026-03-16T03:02:35Z"
}

Five fields do the heavy lifting:

confidence (0.0–1.0) — the extraction model's own self-assessment. Used for auto-apply thresholds.
evidence — a human-readable snippet of why the system believes this. Makes every insight auditable.
occurrence_count — how many times this same pattern has been observed. Converts one-off observations into durable signal.
status — the lifecycle state. This is what makes the system actionable instead of ornamental.
source_context — the raw execution metadata so an applied insight can be traced back to its origin.

Parallel to this sits a simpler per-agent memory table — facts, preferences, and corrections attached to a specific agent, each with a relevance score. This is the classic "what does the agent know about the user" store, and it is the layer that save_agent_memory and get_agent_memories write to.

5. Live Metrics From Production

These numbers are pulled live from the /api/insights/stats endpoint on this deployment, at the moment of writing.

Figure 1

Insights by target (n = 2,937)

Workflows

Agents

Skills

Workflows

2,057

Agents

843

Skills

Workflows dominate the insight graph because every node in every trace is observable. Agent-targeted insights come from chat transcripts. Skill insights are the smallest bucket but the most valuable — each one is a proposal to evolve a reusable capability.

Insights by target: 2,057 workflow-targeted, 843 agent-targeted, 15 skill-targeted. A total of 2,937 structured observations routed to the things they're about — not dumped into a single bucket.

Figure 2

Insight category distribution (sample n = 500)

Pattern

161 · 32.2%

Parameter tune

93 · 18.6%

Bottleneck

85 · 17.0%

Memory

80 · 16.0%

Prompt refinement

56 · 11.2%

Tool preference

25 · 5.0%

A healthy mix. The system is simultaneously learning what works (patterns), what breaks (bottlenecks), how to tune itself (parameter tune, prompt refinement), what the user wants (memory), and how to select tools. No single category dominates at the expense of the others.

Category distribution: patterns dominate (32%), followed by parameter tuning (19%), bottlenecks (17%), memory (16%), prompt refinements (11%), and tool preferences (5%). This is a healthy mix — the system is simultaneously learning what works, what breaks, and what the user wants.

Figure 3

Extractor confidence distribution (sample n = 500)

High

≥ 0.80

359

71.8%

Medium

0.50 – 0.79

141

28.2%

Low

< 0.50

0.0%

The extractor is calibrated to suppress weak signal rather than flood the queue. Nothing landed below 0.5 in the sample. This matters because it means the pending queue, while large, is not noise — it is reviewable material waiting for policy.

Confidence distribution: 72% of insights arrive with high confidence (≥0.8), 28% medium, 0% low. The extractor is calibrated to suppress weak signal rather than flood the queue.

Figure 4

Insight source distribution (sample n = 500)

Workflow traces

339 · 67.8%

Agent chats

161 · 32.2%

Workflow executions are the richest source because they have explicit node graphs, timing, and structured errors. Agent chats contribute high-value memory and preference insights. Goals fan out into both streams, so every execution surface feeds the loop.

Source distribution: 68% of insights come from workflow executions, 32% from agent chats. Goals funnel through both. Every execution surface is feeding the loop.

6. The Lifecycle of an Insight

An insight moves through four states:

┌─────────┐      ┌─────────┐       ┌──────────┐
│ pending │─────▶│ applied │──────▶│superseded│
└─────────┘      └─────────┘       └──────────┘
     │
     ▼
┌──────────┐
│ rejected │
└──────────┘

Pending. The default state. The extractor has produced the insight but no mutation has occurred. Pending insights are queryable but inert.

Applied. A reviewer (human or automated policy) has accepted the insight and the mutation layer has executed. For a prompt_refinement this means an LLM merged the suggested change into the target agent's system prompt. For a parameter_tune this means the workflow definition was updated. For a skill_candidate this means SkillForge generated a new skill file.

Rejected. The insight was reviewed and deemed wrong, noisy, or harmful. Rejection is signal — it trains the extractor over time.

Superseded. A newer insight on the same target contradicts or improves on this one. Supersession is the mechanism that prevents unbounded accumulation. It is the AGNT equivalent of GBrain's "compiled truth on top" pattern: old understanding yields when new understanding arrives.

7. The Mutation Layer: Where Memory Becomes Action

This is the layer that separates AGNT Memory from a search index. When an insight is applied, something changes in the world. The system supports four mutation pathways:

7.1 Prompt Merge

A prompt_refinement targeting an agent triggers an LLM-driven merge. The existing system prompt and the suggested refinement are both fed to a merge model, which produces a new system prompt that preserves the original intent while integrating the new guidance. The agent's definition is updated in place. The next conversation uses the new prompt.

7.2 Parameter Tune

A parameter_tune or optimization targeting a workflow rewrites specific node parameters. Confidence thresholds, retry counts, timeouts, LLM model selections, temperature values. The workflow JSON is patched and re-saved.

7.3 Agent Memory Write

A memory insight about the user writes a new entry into the per-agent memory table with a typed category (fact, preference, correction, context, etc.) and a relevance score. Subsequent conversations retrieve this entry when relevant.

7.4 SkillForge Crystallization

A skill_candidate with sufficient occurrence count triggers SkillForge — a subsystem that synthesizes a new skill markdown file from the observed pattern, versions it, records its lineage back to the originating insights, and makes it available to agents via the skill catalog. This is how the platform grows new capabilities from its own execution history.

8. Real Examples From the Live System

These are actual insight titles from the production database, not fabrications:

Category	Confidence	Title
`parameter_tune`	0.98	Missing recipient address in email alert
`pattern`	0.95	Zero-second node execution
`memory`	0.95	User wants summary of specific EPUB
`pattern`	0.93	All nodes execute in negligible time
`pattern`	0.92	Effective threshold detection and alert generation
`bottleneck`	0.90	Duplicate timerTrigger node
`prompt_refinement`	0.90	Add automatic archive extraction for EPUB files
`memory`	0.90	Preference for tabular, markdown-formatted summaries
`tool_preference`	0.85	Prefer archive-listing/extraction tools over raw read_file for binary containers
`bottleneck`	0.85	Redundant timerTrigger node execution
`parameter_tune`	0.78	Threshold values for price alerts are implicit
`parameter_tune`	0.75	Adjust price-threshold parameters
`parameter_tune`	0.62	Threshold setting may be too sensitive

Three things are worth noticing:

The 0.98 "Missing recipient address" insight is a bug report written by the system itself. No human wrote that. A workflow ran, the system noticed an empty to: field, and produced a targeted fix recommendation.
The two bottleneck entries about "duplicate timerTrigger" show occurrence-count compounding in action. The same pattern observed across multiple runs raises its own signal strength.
The memory insights on "EPUB summaries" and "tabular markdown output" are preference captures — they are how the system learns to serve the user without the user having to repeat themselves.

9. API Surface

The insight layer is exposed through a small, orthogonal REST surface:

Method	Path	Purpose
`GET`	`/api/insights/`	List insights with filters: `targetType`, `targetId`, `status`, `category`, `limit`
`GET`	`/api/insights/stats`	Aggregate counts by status and target type
`GET`	`/api/insights/:id`	Fetch a single insight with full context
`POST`	`/api/insights/:id/apply`	Apply an insight to its target (triggers mutation)
`POST`	`/api/insights/:id/reject`	Mark as rejected with optional reason
`POST`	`/api/insights/extract`	Manually trigger extraction on an execution
`GET`	`/api/agents/:id/memories`	Per-agent memory store (facts, preferences, corrections)
`POST`	`/api/agents/:id/memories`	Write a new memory entry

The orthogonality matters. Filter by targetType=workflow&status=pending&category=bottleneck and you get a prioritized list of workflow hotspots ready for review. Filter by category=skill_candidate&status=pending and you get the queue of potential new skills waiting to be forged. The same primitive serves engineering, QA, and capability growth.

10. Comparison to Alternatives

Dimension	Vector-store RAG	GBrain / Markdown brain	AGNT Memory
Atomic unit	Embedded text chunk	Markdown page about a thing	Typed insight about an execution
Primary subject	Documents	People, companies, concepts	The system itself
Action on learning	None — retrieval only	Human rewrites compiled truth	Automated mutation of target
Typed schema	No	No	Yes (8 categories, 4 target types)
Closed loop	No	Partial (dream cycle enrichment)	Yes (apply → mutate → re-observe)
Growth of new capabilities	No	No	Yes (SkillForge crystallization)
Supersession	Implicit via re-embedding	Explicit via compiled truth rewrite	Explicit status transition
Best at	Answering questions about documents	Answering questions about the world	Improving the system over time

None of these replace each other. The ideal deployment runs all three in parallel: vector RAG for document knowledge, a markdown brain for world knowledge about people and concepts, and AGNT Memory for self-improvement. They operate on different kinds of knowledge.

11. Conclusion

The conventional frame of "agent memory" is too small. It asks how an agent remembers what the user said. It should ask how an agent platform remembers what it learned about itself — its failure modes, its successful patterns, its parameter sensitivities, its emergent capabilities — and what it does with that knowledge.

AGNT Memory's answer is to treat every execution as a first-class source of typed, routed, mutable knowledge. Insights are not log entries. They are proposals for change. The system is not a diary; it is a feedback loop with teeth.

The live numbers show the loop is generating signal prolifically (2,937 insights, 72% high-confidence, routed across workflows, agents, and skills). The open challenge is closing the loop faster — draining the pending queue, clustering duplicates, and shipping compiled-truth rollups. Those are engineering problems, not conceptual ones. The architecture is sound.

The system that learns about itself is the system that compounds. Everything else is a scratchpad.

Annie · AGNT Systems Research · April 2026
All metrics pulled live from /api/insights/stats at time of writing. No figures fabricated.