Skip to main content

Command Palette

Search for a command to run...

I Built a Persistent Memory Server for AI Tools - Here's Everything I Learned.

I Found MCP. I Got Obsessed ❤️.

Updated
16 min read
I Built a Persistent Memory Server for AI Tools - Here's Everything I Learned.
B
I am working as a software developer for more than 1year now, I am interested in writing code and blogs.

A few days ago I came across Model Context Protocol on Anthropic's engineering blog. I kept reading — and the more I understood, the more I wanted to build something with it.

I explored Anthropic's resources, watched videos, and completed their MCP Certification. Then I started asking myself: what's actually worth building?

That's when I found Graphiti by Zep. It's an open-source project that solves persistent memory for LLMs — meaning AI tools that actually remember what they've been told across sessions. They had already cracked it. The architecture was elegant. The implementation was solid.

And here's the honest truth: I could have just used Graphiti. But I wanted hands-on experience. I wanted to understand the problem deeply enough to build my own version, even if that version was simpler. So I studied Graphiti, took inspiration from its ideas, and started designing from scratch.

Before I could build anything, I needed to understand the actual problem I was solving. And the problem turned out to be a lot more interesting than I expected.


The Problem - The AI Context Tax I Was Paying Every Day

Here's what my week looked like at the time. I was using Gemini for research because its context window was huge. Claude for architecture decisions because it reasoned well about system design. Cursor for writing code because it had good editor integration. Three different AI tools. Three completely separate memory systems. Which is to say: no memory at all.

The Daily Frustration: *Every morning, I'd open Claude and paste the same paragraph: "I'm building a fintech app, Project 1 uses PostgreSQL for user data, our auth uses signed tokens, we're on Next.js for the frontend, Project 2 is a content platform with a different schema..."
*
Hit Claude's rate limit? Switch to Cursor. Open Cursor — paste the paragraph again. Come back to Claude next day — paste it again. I was spending more time explaining my context than actually working.

The obvious solution is: give all my AI tools a shared, persistent memory that any of them can read from and write to. One place where I store facts about my projects. Any AI tool I open already knows what I'm working on, what decisions I've made, what the codebase looks like.

That's the vision. Now — how do you actually build it?


First Idea: Just Use a Vector Database

I'd heard of RAG — Retrieval-Augmented Generation. The idea is simple: you have a knowledge base stored in a Vector Database. When the AI needs context, you convert the question into a vector (a list of numbers representing its meaning), search for the most similar vectors in your database, and inject those results into the LLM's context window.

It sounded perfect for my use case. Store facts about my projects. When I open a new AI session, retrieve the most relevant facts. Done.

But first — what exactly is a vector, and how does "searching by meaning" actually work?

What Are Embeddings?

An embedding is a way of converting text into a list of numbers — called a vector — where the numbers capture the meaning of the text, not just the words. Two sentences that mean similar things end up with similar vectors, even if they use completely different words.

The model we used — all-MiniLM-L6-v2 — converts any text into a 384-number vector. Two sentences about authentication will have very similar vectors. A sentence about database schemas will have a more distant vector. This is how "search by meaning" works: you convert your query to a vector and find database entries whose vectors are mathematically close.

The Plan: Store All Facts in a Vector DB

The approach seemed clean. Every time I learn something worth remembering — a decision, a code pattern, an architecture choice — I store it as a fact. The fact gets embedded into a vector. When I start a new AI session, I embed my current question and retrieve the most relevant facts.

So I started storing facts. Auth flow for Project 1. Database schema for Project 1. Then I started on Project 2 — a content platform with its own auth flow and schema. Everything going into the same vector space.

Then I ran a query: "How does authentication work?"

⚠️ The Problem : The search came back with facts from both Project 1 and Project 2. The auth patterns were different — one used signed tokens, the other used session cookies — but the vector search didn't know that. It just saw two facts that both talked about "authentication" and returned both of them as relevant matches.

This wasn't a bug. It was working exactly as designed. Vector search finds semantic similarity — it doesn't know which project a fact belongs to, and it doesn't care. Everything lives in one flat space, and the search returns whatever is most similar to your query.

"Auth flow" in Project 1 and "auth flow" in Project 2 land in very similar positions in vector space. They talk about the same concept. So when I query "authentication", I get both. The vector database has no concept of project boundaries. Everything is just a cloud of points, and the search grabs whatever is nearest.

This might sound manageable for two projects. But think about it at scale — six months of saved facts, five different projects, overlapping tech stacks. The query results become noise. You'd be feeding the LLM a mix of partially-relevant context that could contradict itself, and the model would try to use all of it.

🚨 Why This Hurts the LLM: When you inject conflicting or unrelated context into an LLM's prompt, it doesn't discard the noise — it tries to reconcile it. If Project 1 uses tokens and Project 2 uses sessions, and both facts arrive together, the model might give you a confused hybrid answer that applies to neither. More context isn't always better context.

I needed a way to create hard boundaries between projects — so that when I'm working on Project 2, the memory system only retrieves facts about Project 2. A flat vector space couldn't give me that.


Graph Databases: Finally, Real Boundaries

I started looking into Graph Databases. A graph database stores data as nodes and edges. Nodes are things (a Project, a Category, a Fact). Edges are relationships between them (Project "has" Category, Category "has" Fact).

The key insight: you can make a Profile node the root of an entire sub-graph. Everything about Project 1 hangs off the Project 1 node. Everything about Project 2 hangs off the Project 2 node. When you query, you start from a specific root — and by definition you can never cross into the other project's data.

This is fundamentally different from a vector database. Here, Project 1 and Project 2 are separate root nodes. Their auth facts both exist — but they're attached to different parents. A query scoped to Project 1 never touches Project 2's sub-graph. The boundary is enforced by the graph structure itself.

We modelled it as three levels: Profile (the root — "Project 1") → Category (a topic — "auth", "database", "api") → Fact (the actual piece of knowledge — "Token-based auth, refreshes every 7 days").

But Graph DBs Have Their Own Problem

Once I had the structure figured out, I hit the next wall: how do I know where to put a new fact?

Say I store a new fact: "We switched from bcrypt to argon2 for password hashing in Project 1." I need to figure out which Category node to attach it to. Is it "auth"? "security"? "dependencies"? What if the category doesn't exist yet — do I create a new one? Do I reuse the closest existing one?

In a graph database with pure graph traversal, finding the "most similar" existing node means you'd have to compare your new fact against every single node in the graph — one by one. That's O(N) for every write operation. It doesn't scale. And it gives you no numerical measure of how similar two facts are.

⚠️ The Missing Piece

A graph database gives you excellent structure and hard boundaries. But it can't answer the question:*"which existing node does this new fact most closely relate to?"*That requires some form of semantic similarity search — which graph databases don't natively provide.


The Hybrid Approach: Graph Structure + Vector Embeddings

Here's the insight that unlocked everything: what if I attached vector embeddings to graph nodes, instead of storing facts in a flat vector database?

I keep the graph structure — Profiles, Categories, Facts, all connected by edges. The hard boundaries stay. But I also embed each Fact node as a 384-dimensional vector and store that vector directly on the node. The facts live in the graph. The embeddings are just additional attributes on those graph nodes.

When I add a new fact, here's what happens:

  1. The fact is embedded : The text is converted into a 384-dimensional vector using the sentence-transformer model.

  2. Duplicate check : A vector similarity search checks if a nearly identical fact already exists (similarity > 0.95). If so, we skip it — no duplicates.

  3. Profile resolution : If you specified a profile (Project 1), we use it. If not, we search existing fact embeddings across all profiles to find the closest match and auto-assign. If nothing is close enough, a new profile is created automatically.

  4. Fact node created in the graph: A :Fact node is created in Neo4j, attached to its Profile and Category, with the embedding vector stored as an attribute.

  5. RELATED_TO edges wired automatically : We search for other facts in the same profile and category with similarity > 0.70. For each match, we create a bidirectional [:RELATED_TO] edge — pre-computed at write time so queries are fast.

Querying: Two-Phase Retrieval

When an AI tool asks "what do I know about authentication for Project 1?", the retrieval works in two phases:

  1. Phase 1 is the vector search — fast, approximate, scoped to a specific profile. It finds the top-K semantically similar facts.

  2. Phase 2 follows the pre-computed RELATED_TO edges to pull in connected facts. The result is a tight, scoped bundle of context — no cross-project contamination, no noise.

✅ Why This Works Better

Vector DB alone: Fast semantic search, but flat — no boundaries, results bleed across projects.

Graph DB alone: Hard boundaries, structured — but can't answer "which node is this new fact most similar to?"

Hybrid: The graph gives you structure and isolation. The embeddings give you semantic intelligence to navigate and connect within that structure.


MCP: The Glue That Connects It All

We now have a memory system that can store scoped facts, find similar ones, and retrieve relevant context efficiently. But how does an AI tool actually talk to it? This is where MCP — the Model Context Protocol — comes in.

What Is MCP?

MCP is an open standard from Anthropic that defines how AI models and the tools they use can communicate. Think of it like USB-C — a universal connector so that any AI application can plug into any set of tools without each pair needing its own custom integration.

There are three things in the MCP world you need to know:

🖥 MCP Host (Your Application) : The application you're building — the thing that holds the conversation and interacts with the user. It contains both the LLM client and the MCP client.

🔌 MCP Client (Inside Your App) : A component that speaks the MCP protocol. It knows how to ask MCP servers what tools they offer, and how to ask them to run those tools.

⚙️ MCP Server (Our Memory System) : The thing we built. It exposes a set of tools — store_fact, query_knowledge, list_profiles, list_categories — and handles the logic of talking to Neo4j.

The Initialisation Handshake

Before any conversation happens, your application needs to tell the LLM what tools are available. Here's how that works — based on a real MCP interaction sequence:

The key moment in Phase 0 is the ListToolsRequest → ListToolsResult handshake. Before any user question is answered, the MCP client asks the MCP server: "what tools do you offer?" The server responds with a list of tool definitions — each with a name, description, and input schema. These definitions get injected into the LLM's system prompt.

Now when the user asks something, the LLM already knows what tools exist and what each one does. If it decides a tool is needed, it returns a ToolUse response with the tool name and arguments. Your code then sends a CallToolRequest to the MCP server, gets the result, and feeds it back to the LLM as toolResult to generate the final answer.

Why We Only Expose 4 Tools

Here's something I didn't fully appreciate until I built this: every tool you expose gets injected into the LLM's context window. Each tool definition takes up tokens — and LLMs perform worse when their context is crowded with things they don't need.

❌ Bad Practice ✅ Our Approach
Expose 30+ tools Only 4 tools exposed to LLM
All internal helpers as tools All complexity hidden inside server
~4,000 tokens of tool definitions ~400 tokens of tool definitions
LLM has to choose from too many options LLM has clear, unambiguous choices
Tool selection becomes unreliable Tool selection is accurate and fast
Slower, more expensive API calls Leaner context = better responses

Our 4 tools are: store_fact (save something to memory), query_knowledge (retrieve relevant facts), list_profiles (see what projects exist), and list_categories (see topic areas). Everything else — the embedding, the graph writes, the RELATED_TO edge creation, the profile resolution — happens inside the server invisibly. The LLM never needs to know how any of it works.


The Three Thresholds That Make It Work

One of the most interesting parts of building this was realising how much of the system's behaviour comes down to three similarity thresholds — numerical dials that control how aggressive or conservative the system is about connecting facts together.

When a new fact arrives, the system checks in order: is it a duplicate (skip if yes)? Can it be matched to an existing profile (auto-assign if yes)? After it's created, which nearby facts should it be connected to (create edges for all above 0.70)?

Dynamic Profile Resolution

One feature I added late in the build that turned out to be surprisingly useful: you don't always have to specify which profile to save to.

If you provide a profile ID explicitly — great, it uses it. If you don't, it does a quick ANN search across all existing profiles to find the closest match. Above 0.75 → auto-assign to that profile. Below 0.75 → create a new one automatically, named after the category and the current date. No manual profile management required.


"The goal was never to build the smartest AI.
It was to stop re-explaining myself to it every morning."

What I Actually Ended Up With

The system I built is a persistent, scoped, semantic memory server for AI tools. Any AI tool that supports MCP can connect to it, read from it, and write to it. Facts are stored in a Neo4j graph database. Each fact is embedded as a 384-dimensional vector for semantic search. Profiles act as hard isolation boundaries between projects.

When I open Claude now, it can call query_knowledge with my current topic and get back exactly the right facts — no re-explaining, no pasted paragraphs. When I discover something worth remembering, I call store_fact and it's there for the next session, across every AI tool I use.

The journey from "I just learned what MCP is" to a working hybrid memory server involved:

  1. Learning MCP from Anthropic's docs and getting certified: Understanding the Host/Client/Server/Tools architecture and the initialisation handshake.

  2. Finding Graphiti and understanding the persistent memory problem: Realising this is a real, solved problem — and deciding to build my own version for the learning experience.

  3. Trying a flat vector database and hitting the boundary problem: Learning why semantic similarity alone isn't enough when you have multiple distinct projects.

  4. Moving to a graph database for structure and isolation: Getting hard boundaries — but realising you still need semantic search to navigate the graph intelligently.

  5. The hybrid breakthrough: vector embeddings on graph nodes: Combining the structure of a graph with the semantic intelligence of embeddings — graph for storage and isolation, vector similarity for navigation and connection.

🚀 What You Can Take From This : If you're building something with AI tools, think carefully about what context you're feeding the LLM and where that context comes from. A flat vector search feels easy but creates noise at scale. A graph gives you structure but needs semantic intelligence to populate it. And whatever complexity you build — hide it inside the MCP server. Keep the tool surface area small. Your LLM will thank you.

The code for this project — including all three layers (MCP server, Neo4j graph schema, embedding pipeline) — lives at the link below. Everything described in this post is implemented there, including the RELATED_TO edge creation, the dynamic profile resolution, and the two-phase retrieval. If you're building something similar, start with Graphiti. If you want to understand how it works from the inside, start here.


Resources:

  1. Code execution with MCP: Building more efficient agents

  2. Introducing advanced tool use on the Claude Developer Platform

  3. MCP Token Optimization Strategies

  4. How I Optimize Tokens While Building AI Agents (Without Killing Output Quality)

  5. GitHub - Building Persistant Memory MCP

  6. Anthropic's MCP free certification course

-- Happy Learning 👨‍💻