Building Persistent Memory for AI Agents That Actually Stick

How I built AgentSSOT (hari-hive), a cross-session memory system for AI agents using FastAPI and ChromaDB. Why agents without memory are just demos, and what it takes to make them production-ready tools.

AI AgentsSemantic MemoryChromaDBFastAPIRAGEmbeddings

The Problem With Stateless Agents

Every AI agent demo starts the same way: it handles a task, looks impressive, then the session ends and it forgets everything. Next time, you re-explain context. Again. The agent has amnesia by design.

This matters less for one-off chatbots. It matters enormously when you're running agents that manage infrastructure, handle customer conversations, or operate across a portfolio of projects. Those agents need to remember decisions, learn from corrections, and build context over time.

I run about a dozen agents across different systems — monitoring, development assistance, marketing automation, customer-facing chat. They all needed a shared memory layer. So I built one.

What hari-hive (AgentSSOT) Does

AgentSSOT is a single source of truth for agent knowledge. Any agent — regardless of provider (OpenAI, Anthropic, Ollama, OpenRouter) — can read from and write to it through a REST API.

The core operations:

Recall — semantic search across stored knowledge. "What do I know about the deployment process for project X?" returns relevant facts ranked by embedding similarity.
Ingest — store a new fact, decision, or piece of context with source attribution and tags.
Teach — store a skill: "when X happens, do Y." Prescriptive knowledge with trigger conditions and verification hints.
Feedback — rate a recall result as useful, noted, or wrong. This trains the relevance ranking over time.

The architecture:

Agent (any provider) → REST API (FastAPI) → ChromaDB (embeddings)
                                          → PostgreSQL (metadata, sessions)
                                          → Namespace isolation per agent/project

Why Embeddings Beat Keyword Search for Agent Memory

Agents don't ask precise questions. They ask things like "what's the deal with the nginx config on the portfolio site?" A keyword search for "nginx config portfolio" might miss a memory stored as "website deployment uses cloudflare tunnel through docker compose with nginx reverse proxy."

Embedding similarity catches the semantic connection. The stored memory and the query live close together in vector space even though they share almost no words.

ChromaDB handles this well for the scale I need (tens of thousands of memories across all agents). For larger deployments, pgvector in PostgreSQL works with the same principle.

Namespace Isolation

Different agents and projects get different namespaces. My infrastructure monitoring agent shouldn't surface marketing automation knowledge when it's trying to diagnose a disk space issue.

The namespace scheme:

claude-shared     → general cross-agent knowledge
project:takeoffpro → TakeOff Pro specific context
agent:moni         → Moni (personal assistant) memories

An agent can query its own namespace, the shared namespace, or both. This scoping prevents context pollution while allowing common knowledge to propagate.

Skill Teaching vs. Fact Storage

Facts and skills serve different purposes:

Facts answer "what is true?" — "The portfolio site runs on Docker Compose with nginx, FastAPI, and Cloudflare Tunnel."

Skills answer "what should I do?" — "When deploying mohsenjahanshahi.com, rsync to dockers VM, run docker compose build, then docker compose up -d."

The distinction matters for retrieval. When an agent encounters a deployment task, it should get the skill (step-by-step instructions) rather than a collection of facts it has to reconstruct into a procedure.

Skills have three fields:

Trigger — when does this activate?
Action — what to do (specific steps)
Success hint — how to verify it worked

The Feedback Loop

The most underrated feature. When an agent uses recall and the result is helpful, it sends feedback(signal="useful"). When a result is wrong or outdated, it sends feedback(signal="wrong", note="reason").

Over time, this builds a relevance signal that adjusts ranking. Frequently useful memories surface faster. Stale memories get deprioritized. Without this loop, the system accumulates noise.

This is the same principle as RLHF in language models, applied at the retrieval layer.

Production Lessons

After running this across 12+ agents for several months:

Memory hygiene matters. Agents generate a lot of low-value observations. Without pruning and deduplication, recall quality degrades. Periodic compaction — summarizing related memories into consolidated entries — keeps the system useful.

Cross-session context saves tokens. Instead of stuffing 3,000 tokens of project context into every prompt, the agent recalls what it needs. Token usage dropped roughly 50% after implementing selective recall.

The cold-start problem is real. A new agent with empty memory is useless as a persistent system. Seeding initial knowledge from documentation, READMEs, and existing notes gets agents productive faster.

Embedding model choice matters less than you'd think. I've tested OpenAI embeddings, local sentence-transformers, and ChromaDB's default model. The ranking differences are marginal for the types of queries agents generate. Pick whatever runs fastest for your deployment.

Stack

FastAPI — REST API with async handlers
ChromaDB — vector storage and similarity search
PostgreSQL — metadata, session tracking, feedback signals
Docker — containerized deployment
Python — everything

The whole system runs on a single Docker container alongside the rest of my infrastructure. Memory footprint is modest — ChromaDB with 15,000 entries uses about 400MB.

This is part of a series on production AI infrastructure. The code is at github.com/maddefientist/agentssot.