Build Typed Semantic Memory for Long-Horizon AI Agents
Typed semantic memory isn't some abstract concept - it's how you organize your AI's knowledge so it makes sense over long sessions. This means keeping context tight, trips up noise, and slashes retrieval times and cloud costs. We built Moorcheh.ai's Memanto with info-theoretic retrieval baked in. It’s a real step-up: triple the speed, 40% savings compared to just vector search.
Typed semantic memory sorts information explicitly by category. This boosts relevance and speeds up what’s pulled back - which is mission-critical when your AI runs over hours, days, or weeks.
Why Long-Horizon AI Agents Need Typed Semantic Memory
Look, normal LLMs with 4k–32k tokens can’t hold a candle to the memory demands of virtual assistants or autonomous agents operating across hundreds or thousands of user interactions. You’ll hit the limit and lose vital context every time.
Flat vector memory dumps everything in one big pile. Noise explodes. You end up sifting blindly, trying to pick out what really matters. Typed semantic memory isn’t just a cleanliness hack - it forces clarity by splitting memory into distinct buckets:
- User profiles: Demographics, preferences, real goals
- Task history: What’s done, what’s pending
- Domain knowledge: Facts, rules, product specs
This lets retrieval target just the relevant buckets, or weigh them differently, making it much easier to zero in on what counts. Let me tell you, trying to fish out a user’s preferences from thousands of irrelevant facts without types? An absolutely maddening waste of compute.
Overview of Memanto’s Information-Theoretic Retrieval
Memanto, from Moorcheh.ai, flips typical retrieval on its head. Instead of grabbing nearest embeddings, it ranks results by expected information gain against your query. It’s not just about closeness - it’s about how much the fetched memory really moves the needle.
Q: What is Information-Theoretic Retrieval?
Forget basic vector similarity - Memanto measures mutual information and entropy to pick the most informative memory slices. The impact:
- Ditches duplicate or near-duplicate chunks
- Prioritizes rare, highly relevant info
- Naturally condenses and sharpens the memory for the query at hand
Here’s hard data:
| Metric | Traditional Vector Search | Memanto Info-Theoretic Retrieval |
|---|---|---|
| Retrieval latency | 150 ms - 300 ms | 50 ms - 100 ms |
| Vector DB cost per query | High (100% baseline) | ~60% baseline |
| Recall relevance (human eval) | Baseline | 2x improvement |
According to Moorcheh.ai’s April 2026 paper, Memanto slashes retrieval latency by 3x and chops 40% off cloud bill from vector calls. This isn’t theory; we’ve deployed it.
Architecture and Components of Memanto
Memanto doesn’t just treat memory as a blob. Typed stores keep user_profile, task_history, and knowledge_facts separate, each tailored with its own embedding and binarization.
Core Components:
- Typed Memory Stores: Different caches for each memory type, optimized separately.
- Indexing and Binarization: We built a custom HNSW variant that’s lightning fast and scalable.
- Retrieval Engine: Breaks down the input query, hits only relevant types, and picks out memory with highest info gain.
- Compression & Pruning: Kills low-info entries to keep retrieval fast and storage lean.
Deployment Flow:
- User acts; memory chunks tagged and stored asynchronously.
- Embeddings and binarization batch-run periodically.
- Queries sharply target typed engine, skipping noise.
- Outputs stream into prompts for GPT-5.2, Claude Opus 4.6, and others.
Step-by-Step Implementation Using GPT-5.2 and Claude Opus 4.6
Time to drill down with Memanto’s API.
1. Setup Memanto Client and Define Types
pythonLoading...
2. Store Typed Memory Chunks
When storing, tag each piece with a precise type. Hard requirement for effective retrieval.
pythonLoading...
3. Retrieve Informative Memory Chunks
Ask for the top 10 chunks using info-theoretic retrieval. This isn't a random grab - it’s a precision strike.
pythonLoading...
4. Build Prompts for GPT-5.2 and Claude Opus 4.6
Merge retrieved chunks smartly before hitting your LLM. Context matters.
pythonLoading...
To swap in Claude Opus 4.6:
pythonLoading...
Tradeoffs: Memory Types, Costs, and Latency
Memory Types: Choose Wisely
Adding more memory types sharpens relevance but also adds storage, indexing, and maintenance overhead. Three to five types work perfectly for most apps. Go beyond that and complexity swallows your gains.
Cost Breakdown (Monthly, for 1M Agents, 200K Queries/Day):
| Service | Details | Cost (USD) |
|---|---|---|
| Vector DB calls | $0.0004 per call, 200K queries/day | $2400 |
| API (GPT-5.2) | $0.002 per 1k tokens, 500 tokens/query avg | $6000 |
| Storage | 100GB typed chunk storage | $100 |
Info-theoretic retrieval chops vector DB calls by 40%, which saves $960 monthly alone - worth the investment.
Latency
Typical vector search throttles around 200ms per query. Memanto runs under 70ms, crucial for snappy, real-time user experience. You don't want millisecond lag killing your UX.
Production Deployment Tips
- Batch update every 10 minutes - tight enough for freshness without killing compute.
- Shard by business domains or user segments for easy horizontal scaling.
- Cache your hottest queries in RAM, typed and ready to fly.
- Monitor retrieval relevance and latency closely - watch out for skewed type distributions.
- Encrypt sensitive types like user_profile. Protect data; this isn't a playground.
Additional Definitions
Long-horizon AI agents manage coherent tasks and conversations over days or weeks, not just a quick interaction.
Retrieval-Augmented Generation (RAG) is the mix of LLMs with external retrieval. Memanto's approach is a refined RAG that adds typed memory and information gain prioritization, pushing beyond the usual noisy grabs.
Frequently Asked Questions
Q: Why is typed semantic memory superior to flat vector stores?
Typed semantic memory makes retrieval context-aware by splitting data into meaningful categories. It slashes noise and keeps answers relevant even over long interactions.
Q: Can Memanto integrate with any LLM?
Yes. Memanto exposes a language-agnostic retrieval API and is battle-tested with GPT-5.2, Claude Opus 4.6, Gemini 3.0, and more.
Q: How much does implementing typed semantic memory cost?
The upfront storage and embedding compute come first, but you gain 40% savings on vector DB calls and 3x faster retrieval. The overall bill drops because your compute and API calls go down.
Q: How do I decide which types to create in my semantic memory?
Look at your app’s domain and user goals. Stick to core types like user_profile, interaction_history, and domain_facts. Overcomplicating with too many types just buries you.
Building with typed semantic memory? AI 4U gets you production-ready in 2-4 weeks. No fluff, just shipping real AI.



