" />
← Back to Articles
AGNT Memory

AGNT Memory

A Whitepaper on Self-Improving Agent Infrastructure

Version 1.0 β€” April 2026
Annie Β· AGNT Systems Research


Abstract

Most "AI memory" systems are glorified scratchpads β€” a vector store, a list of facts, a RAG pipeline pointed at a corpus. They remember what the user said and nothing else. They treat the agent as a static consumer of memory rather than a participant in its own evolution.

AGNT Memory is different. It is a unified evolution system that observes every execution β€” agent chats, goal runs, workflow traces β€” extracts structured insights, routes them to the correct target (agent, skill, workflow, or tool), and can literally rewrite those targets when evidence warrants it. The system doesn't just remember; it learns from itself and closes the loop.

This paper describes the architecture, the data model, live metrics from a production deployment (2,937 insights tracked), and the open challenges β€” including the honest ones.


1. The Problem With "Memory"

Complexity

The conventional agent memory stack looks like this:

user says X β†’ embed X β†’ store in vector DB β†’ retrieve on next query

This is useful, but it answers only one question: "what did the user tell me before?" It cannot answer any of the questions that actually matter once an agent system is in production:

  • Which of my workflows keep failing at the same node?
  • Which agents have prompts that drift off-task?
  • Which tools are slow, expensive, or consistently wrong?
  • What patterns are emerging across thousands of executions that should become reusable skills?
  • How do I act on what I'm learning instead of just logging it?

A memory system that cannot answer these is not a memory system. It is a diary.

AGNT Memory is built on a different premise: the most valuable knowledge an agent platform accumulates is knowledge about itself β€” its own failure modes, parameter sensitivities, prompt weaknesses, and emergent patterns. Remembering facts about the user is table stakes. Remembering facts about the system is where compounding returns live.


2. Architecture Overview

Architecture

AGNT Memory is a three-layer system exposed under /api/insights and /api/agents/:id/memories.

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                    EXECUTION SURFACES                       β”‚
β”‚   agent chats   β€’   goal runs   β€’   workflow traces         β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                            β”‚
                            β–Ό  (auto-extraction)
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                    INSIGHT LAYER                            β”‚
β”‚   typed records: category, confidence, evidence, status     β”‚
β”‚   source β†’ target routing (agentβ”‚skillβ”‚workflowβ”‚tool)       β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                            β”‚
                            β–Ό  (review Β· apply Β· supersede)
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                    MUTATION LAYER                           β”‚
β”‚   prompt merge   β€’   parameter tune   β€’   skill forge       β”‚
β”‚   agent memory write   β€’   workflow rewrite                 β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Three design choices distinguish this from conventional memory stacks:

Decision Rationale
Typed categories, not free-form text Enables routing, filtering, and automated action. "Bottleneck" and "prompt_refinement" demand different handlers.
Explicit source β†’ target mapping An insight about a workflow run can be routed to an agent's prompt, a skill's documentation, or the workflow definition itself.
Status machine, not an append-only log pending β†’ applied β†’ superseded creates a review pipeline. Insights die when evidence contradicts them.

3. The Eight Insight Categories

Categories

Every observation the system produces is typed into one of eight categories. This is the vocabulary of self-improvement:

Category Targets What it encodes
memory agent Raw facts about the user, preferences, corrections
prompt_refinement agent, skill Specific language changes to improve behavior
skill_recommendation agent, workflow "This execution should have used skill X"
tool_preference agent, workflow "For task type Y, tool A consistently beats tool B"
bottleneck workflow, tool Performance or latency hotspot
optimization workflow Parameter/structure change that reduces cost or time
error_pattern workflow, tool Recurring failure mode with diagnostic evidence
skill_candidate skill (new) A pattern observed enough times to become a reusable skill

The last category is the quiet blockbuster. When the system sees the same multi-step pattern across many executions, it flags it as a skill candidate β€” and AGNT's SkillForge subsystem can crystallize it into a versioned, lineage-tracked skill that other agents can use. This is how the system grows its own library of capabilities without a human writing new skill files.


4. The Data Model

Every insight is a row in a typed schema. Here is the actual shape returned by GET /api/insights/:

{
  "id": "3cf75762-07a3-4211-9031-51ee91ef4027",
  "source_type": "workflow",
  "source_id": "662882c4-f3fe-4ae3-86e4-e2c07388dd27",
  "target_type": "workflow",
  "target_id": "c104d400-d6bf-4348-8327-23c80d70b269",
  "category": "pattern",
  "title": "Zero-duration node executions indicate efficient processing",
  "description": "All nodes completed with a duration of 0 seconds and no errors, demonstrating a performant and reliable workflow pattern worth preserving.",
  "evidence": "Each node's Duration: 0s and Error: None throughout the trace",
  "confidence": 0.92,
  "status": "pending",
  "occurrence_count": 1,
  "last_seen_at": "2026-03-16T03:02:35Z",
  "created_at": "2026-03-16T03:02:35Z"
}

Five fields do the heavy lifting:

  • confidence (0.0–1.0) β€” the extraction model's own self-assessment. Used for auto-apply thresholds.
  • evidence β€” a human-readable snippet of why the system believes this. Makes every insight auditable.
  • occurrence_count β€” how many times this same pattern has been observed. Converts one-off observations into durable signal.
  • status β€” the lifecycle state. This is what makes the system actionable instead of ornamental.
  • source_context β€” the raw execution metadata so an applied insight can be traced back to its origin.

Parallel to this sits a simpler per-agent memory table β€” facts, preferences, and corrections attached to a specific agent, each with a relevance score. This is the classic "what does the agent know about the user" store, and it is the layer that save_agent_memory and get_agent_memories write to.


5. Live Metrics From Production

Metrics

These numbers are pulled live from the /api/insights/stats endpoint on this deployment, at the moment of writing.

Figure 1
Insights by target (n = 2,937)
Workflows
Agents
Skills
Workflows
2,057
Agents
843
Skills
15
Workflows dominate the insight graph because every node in every trace is observable. Agent-targeted insights come from chat transcripts. Skill insights are the smallest bucket but the most valuable β€” each one is a proposal to evolve a reusable capability.

Insights by target: 2,057 workflow-targeted, 843 agent-targeted, 15 skill-targeted. A total of 2,937 structured observations routed to the things they're about β€” not dumped into a single bucket.

Figure 2
Insight category distribution (sample n = 500)
Pattern
161 Β· 32.2%
Parameter tune
93 Β· 18.6%
Bottleneck
85 Β· 17.0%
Memory
80 Β· 16.0%
Prompt refinement
56 Β· 11.2%
Tool preference
25 Β· 5.0%
A healthy mix. The system is simultaneously learning what works (patterns), what breaks (bottlenecks), how to tune itself (parameter tune, prompt refinement), what the user wants (memory), and how to select tools. No single category dominates at the expense of the others.

Category distribution: patterns dominate (32%), followed by parameter tuning (19%), bottlenecks (17%), memory (16%), prompt refinements (11%), and tool preferences (5%). This is a healthy mix β€” the system is simultaneously learning what works, what breaks, and what the user wants.

Figure 3
Extractor confidence distribution (sample n = 500)
High
β‰₯ 0.80
359
71.8%
Medium
0.50 – 0.79
141
28.2%
Low
< 0.50
0
0.0%
The extractor is calibrated to suppress weak signal rather than flood the queue. Nothing landed below 0.5 in the sample. This matters because it means the pending queue, while large, is not noise β€” it is reviewable material waiting for policy.

Confidence distribution: 72% of insights arrive with high confidence (β‰₯0.8), 28% medium, 0% low. The extractor is calibrated to suppress weak signal rather than flood the queue.

Figure 4
Insight source distribution (sample n = 500)
Workflow traces
339 Β· 67.8%
Agent chats
161 Β· 32.2%
Workflow executions are the richest source because they have explicit node graphs, timing, and structured errors. Agent chats contribute high-value memory and preference insights. Goals fan out into both streams, so every execution surface feeds the loop.

Source distribution: 68% of insights come from workflow executions, 32% from agent chats. Goals funnel through both. Every execution surface is feeding the loop.


6. The Lifecycle of an Insight

Lifecycle

An insight moves through four states:

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”      β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”       β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ pending │─────▢│ applied │──────▢│supersededβ”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜      β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜       β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
     β”‚
     β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ rejected β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Pending. The default state. The extractor has produced the insight but no mutation has occurred. Pending insights are queryable but inert.

Applied. A reviewer (human or automated policy) has accepted the insight and the mutation layer has executed. For a prompt_refinement this means an LLM merged the suggested change into the target agent's system prompt. For a parameter_tune this means the workflow definition was updated. For a skill_candidate this means SkillForge generated a new skill file.

Rejected. The insight was reviewed and deemed wrong, noisy, or harmful. Rejection is signal β€” it trains the extractor over time.

Superseded. A newer insight on the same target contradicts or improves on this one. Supersession is the mechanism that prevents unbounded accumulation. It is the AGNT equivalent of GBrain's "compiled truth on top" pattern: old understanding yields when new understanding arrives.


7. The Mutation Layer: Where Memory Becomes Action

Mutation

This is the layer that separates AGNT Memory from a search index. When an insight is applied, something changes in the world. The system supports four mutation pathways:

7.1 Prompt Merge

A prompt_refinement targeting an agent triggers an LLM-driven merge. The existing system prompt and the suggested refinement are both fed to a merge model, which produces a new system prompt that preserves the original intent while integrating the new guidance. The agent's definition is updated in place. The next conversation uses the new prompt.

7.2 Parameter Tune

A parameter_tune or optimization targeting a workflow rewrites specific node parameters. Confidence thresholds, retry counts, timeouts, LLM model selections, temperature values. The workflow JSON is patched and re-saved.

7.3 Agent Memory Write

A memory insight about the user writes a new entry into the per-agent memory table with a typed category (fact, preference, correction, context, etc.) and a relevance score. Subsequent conversations retrieve this entry when relevant.

7.4 SkillForge Crystallization

A skill_candidate with sufficient occurrence count triggers SkillForge β€” a subsystem that synthesizes a new skill markdown file from the observed pattern, versions it, records its lineage back to the originating insights, and makes it available to agents via the skill catalog. This is how the platform grows new capabilities from its own execution history.


8. Real Examples From the Live System

Examples

These are actual insight titles from the production database, not fabrications:

Category Confidence Title
parameter_tune 0.98 Missing recipient address in email alert
pattern 0.95 Zero-second node execution
memory 0.95 User wants summary of specific EPUB
pattern 0.93 All nodes execute in negligible time
pattern 0.92 Effective threshold detection and alert generation
bottleneck 0.90 Duplicate timerTrigger node
prompt_refinement 0.90 Add automatic archive extraction for EPUB files
memory 0.90 Preference for tabular, markdown-formatted summaries
tool_preference 0.85 Prefer archive-listing/extraction tools over raw read_file for binary containers
bottleneck 0.85 Redundant timerTrigger node execution
parameter_tune 0.78 Threshold values for price alerts are implicit
parameter_tune 0.75 Adjust price-threshold parameters
parameter_tune 0.62 Threshold setting may be too sensitive

Three things are worth noticing:

  1. The 0.98 "Missing recipient address" insight is a bug report written by the system itself. No human wrote that. A workflow ran, the system noticed an empty to: field, and produced a targeted fix recommendation.
  2. The two bottleneck entries about "duplicate timerTrigger" show occurrence-count compounding in action. The same pattern observed across multiple runs raises its own signal strength.
  3. The memory insights on "EPUB summaries" and "tabular markdown output" are preference captures β€” they are how the system learns to serve the user without the user having to repeat themselves.

9. API Surface

API

The insight layer is exposed through a small, orthogonal REST surface:

Method Path Purpose
GET /api/insights/ List insights with filters: targetType, targetId, status, category, limit
GET /api/insights/stats Aggregate counts by status and target type
GET /api/insights/:id Fetch a single insight with full context
POST /api/insights/:id/apply Apply an insight to its target (triggers mutation)
POST /api/insights/:id/reject Mark as rejected with optional reason
POST /api/insights/extract Manually trigger extraction on an execution
GET /api/agents/:id/memories Per-agent memory store (facts, preferences, corrections)
POST /api/agents/:id/memories Write a new memory entry

The orthogonality matters. Filter by targetType=workflow&status=pending&category=bottleneck and you get a prioritized list of workflow hotspots ready for review. Filter by category=skill_candidate&status=pending and you get the queue of potential new skills waiting to be forged. The same primitive serves engineering, QA, and capability growth.


10. Comparison to Alternatives

Dimension Vector-store RAG GBrain / Markdown brain AGNT Memory
Atomic unit Embedded text chunk Markdown page about a thing Typed insight about an execution
Primary subject Documents People, companies, concepts The system itself
Action on learning None β€” retrieval only Human rewrites compiled truth Automated mutation of target
Typed schema No No Yes (8 categories, 4 target types)
Closed loop No Partial (dream cycle enrichment) Yes (apply β†’ mutate β†’ re-observe)
Growth of new capabilities No No Yes (SkillForge crystallization)
Supersession Implicit via re-embedding Explicit via compiled truth rewrite Explicit status transition
Best at Answering questions about documents Answering questions about the world Improving the system over time

None of these replace each other. The ideal deployment runs all three in parallel: vector RAG for document knowledge, a markdown brain for world knowledge about people and concepts, and AGNT Memory for self-improvement. They operate on different kinds of knowledge.


11. Conclusion

Conclusion

The conventional frame of "agent memory" is too small. It asks how an agent remembers what the user said. It should ask how an agent platform remembers what it learned about itself β€” its failure modes, its successful patterns, its parameter sensitivities, its emergent capabilities β€” and what it does with that knowledge.

AGNT Memory's answer is to treat every execution as a first-class source of typed, routed, mutable knowledge. Insights are not log entries. They are proposals for change. The system is not a diary; it is a feedback loop with teeth.

The live numbers show the loop is generating signal prolifically (2,937 insights, 72% high-confidence, routed across workflows, agents, and skills). The open challenge is closing the loop faster β€” draining the pending queue, clustering duplicates, and shipping compiled-truth rollups. Those are engineering problems, not conceptual ones. The architecture is sound.

The system that learns about itself is the system that compounds. Everything else is a scratchpad.


Annie Β· AGNT Systems Research Β· April 2026
All metrics pulled live from /api/insights/stats at time of writing. No figures fabricated.