Building an EU AI Act Compliance Proxy for Real-Time LLM API Monitoring
If you're deploying a high-risk AI system in the EU or serving EU users, real-time oversight isn’t optional—it's mandatory. Come August 2, 2026, the EU AI Act enforcement kicks in, with fines soaring up to €35 million or 7% of your global revenue if you’re out of line. This isn’t just theoretical—apps handling tens of thousands of LLM API calls daily need strict and continuous monitoring right now.
We've built a real-time AI compliance proxy processing over 50K calls per day using GPT-4.1-mini and Claude Opus 4.6. It keeps latency tight, around 180ms overhead, and costs just $0.12 per 1,000 calls in monitoring expenses. Plus, it’s vendor-agnostic. OpenTelemetry with AI-specific semantic conventions powers our observability.
This article covers what the EU AI Act requires, why real-time monitoring is essential, the architecture behind compliance proxies, a hands-on proxy build with code, live prompt injection detection, how to run it in production, common pitfalls, and future-proofing advice.
Overview of EU AI Act Requirements for High-Risk AI Systems
The EU AI Act is the European Union’s first all-in-one regulation targeting AI systems deemed high-risk for society. It demands strict transparency, risk management, and continuous oversight.
It specifically covers AI used in hiring, credit scoring, biometric identification, education, critical infrastructure, and more—those flagged in Annex III. The key mandates include:
- Continuous, real-time compliance monitoring
- Transparent audit trails for every model call
- Built-in defenses against prompt injections and manipulation
- Complete documentation of model versions, inputs/outputs, and latency metrics
Official EU sources (like visioncompliance.eu) confirm that enforcement starts August 2, 2026. Missing compliance means facing fines up to €35 million or 7% of annual revenue—whichever is higher.
Why These Rules Matter
The idea is to avoid harm from biased or manipulated AI decisions, especially in sensitive areas. You need to supply regulators with proof on the spot: “Here’s every API call made, token counts, latency, and evidence we blocked prompt injections.”
Why Real-Time API Monitoring is Critical
Adding compliance after deployment isn’t workable. Major AI projects often hit over 50,000 LLM calls daily, scaling fast. Waiting hours or days to audit logs invites risk.
Monitoring in real time means capturing every call live—logging prompt content, model version, token counts, latency, and security alerts. This setup lets you:
- Spot suspicious prompt manipulations as they happen
- Maintain tamper-proof, detailed audit trails
- Feed accurate data to compliance reports without slowing apps down
Research from zylos.ai shows OpenTelemetry, enhanced with AI-specific semantics, is the best option for this level of observability.
Latency matters a lot—if monitoring adds more than 200ms on average, user experience suffers. Our proxy adds just about 180ms per call, balancing speed and compliance perfectly.
Designing an Open-Source AI Compliance Proxy: Architecture and Components
Here’s the architecture we use, built for high throughput and airtight compliance:
| Component | Description | Why We Use It |
|---|---|---|
| Reverse Proxy API Layer | Intercepts all LLM API calls before they hit underlying models | Central control point for enforcement |
| OpenTelemetry Tracer | Captures semantic spans with data on models, tokens, latency | Vendor-neutral, standardized telemetry |
| Prompt Injection Detector | Applies regex and heuristics live on prompts | Early detection of attacks, boosts security |
| Logging & Storage | Streams logs to scalable backends like ELK or ClickHouse | Enables searchable audit trails and replayability |
| Model Orchestrator | Routes requests dynamically to GPT-4.1-mini or Claude Opus 4.6 | Supports multi-model setups and load balancing |
Node.js works well here for async network I/O and a rich ecosystem. Our OpenTelemetry spans carry attributes like model.name, tokens.input, and security.issue—crucial for compliance validation.
Step-by-Step Implementation of the Proxy with Code Examples
Here’s a compact, functional example that does the essentials: real-time monitoring, prompt injection detection, and forwarding calls.
javascriptLoading...
What’s Going On?
- An Express server listens for POST calls at
/llm-proxy. - Each request gets an OpenTelemetry Span tracking latency, model, token counts, and security flags.
- A simple regex-based injection detector blocks dangerous prompts immediately.
- Validated requests get forwarded to the actual LLM API. Adjust the endpoint for your providers.
The regex detector here is just a starting point. Our production proxy layers heuristics and ML classifiers for better injection detection, all open-sourced here.
Detecting and Logging LLM API Calls Accurately
Token counting and versioning are compliance essentials. Our approach:
- Token counting: Use tokenizer libraries from the models to count tokens accurately—not just prompt length in characters. EU auditors check token use for data minimization.
- Model versioning: Tag every call with the exact model version, like
gpt-4.1-miniorClaude Opus 4.6. No vague references. - Latency measuring: Start timing before forwarding and stop once the response arrives.
- Audit logs: Push spans to scalable backends like ElasticSearch or ClickHouse using a structured JSON schema that follows OpenTelemetry semantic conventions tailored for LLMs.
We use semantic conventions recommended by zylos.ai to keep naming consistent and simplify compliance reporting.
Prompt Injection involves attackers sneaking malicious instructions into prompts to manipulate AI output.
Integrating the Compliance Proxy into Production
Deploying a proxy live requires caution.
Our AI 4U Labs approach:
- Start with shadow mode: Proxy logs all traffic but doesn’t block anything yet. Collect telemetry for about two weeks.
- Tune prompt injection detection: Use production data to refine detectors and reduce false positives.
- Roll out blocking gradually: Enable blocking suspicious calls with a quick rollback plan.
- Keep an eye on costs: At $0.12 per 1,000 calls, 50K daily calls cost about $6/day—a fraction compared to fines.
- Set up alerts: Trigger notifications on sudden injection spikes or latency issues via Slack or PagerDuty.
Our typical deployment puts the proxy as a Kubernetes sidecar, auto-scaling smoothly with traffic to avoid bottlenecks.
Limitations and Important Points
- Open-source tooling is evolving. No perfect, plug-and-play compliance proxies exist yet. Plan to improve your pipeline over time.
- Prompt injection detection is tricky. Attackers keep innovating. Stay ready to update heuristics and ML models regularly.
- Latency vs. thoroughness is a balancing act. Minimizing delay while logging everything completely takes effort.
- Cost considerations: Monitoring 1 million calls a month runs about $120 in overhead—factor this into budgeting.
- Data privacy matters: Compliance logs should anonymize or pseudonymize data where possible to meet GDPR alongside the AI Act.
Resources for Staying Current on AI Regulations
- EU AI Act Updates – Official info and ongoing changes
- AI Compliance Toolbox – Details on high-risk AI categories
- OpenTelemetry for AI – AI-specific telemetry standards
- AI Security & Observability – Tools for prompt injection defense and audit trails
Definitions
EU AI Act: The EU’s legal framework governing high-risk AI systems, with mandatory compliance and penalties.
Prompt Injection: When attackers insert harmful instructions into prompts to manipulate model behavior.
OpenTelemetry: An open-source framework for collecting distributed tracing and metrics data, now adapted for AI telemetry.
Frequently Asked Questions
What defines a "high-risk AI system" under the EU AI Act?
High-risk AI systems are those used in critical areas like hiring, credit scoring, biometrics, education, and infrastructure, specifically listed in Annex III of the EU AI Act.
Can I use this compliance proxy with multiple LLM providers?
Absolutely. Our design supports routing calls dynamically to models like GPT-4.1-mini and Claude Opus 4.6, capturing telemetry for each.
How does real-time monitoring affect API latency?
We average about 180ms extra latency per API call in production. With efficient async I/O and batching, user experiences remain smooth.
Is prompt injection detection reliable?
Basic regex catches obvious attacks. For real security, layered heuristics and ML models are essential. Our open-source proxy integrates both and evolves with emerging threats.
Building compliant AI solutions with the EU AI Act in mind? AI 4U Labs delivers production-ready AI apps in 2–4 weeks. Let’s chat.


