GPI Technical Whitepaper
A detailed walkthrough of GPI's security architecture, PII protection and GDPR compliance
1. Architecture overview
GPI is a microservice-based platform consisting of 11 Docker containers running on-premise or in your own Azure environment. The architecture is designed for single-tenant deployment — one installation per customer, full data isolation.
Gateway (BFF) serves a React 19 SPA, handles authentication via Microsoft Entra ID and proxies API calls to backend services.
DataHub (main API), Audit service, Telemetry service, Indexing worker, PII service, DocParser and Embeddings service.
PostgreSQL with pgvector for semantic search, Redis for token storage and pub/sub, RabbitMQ for async messaging.
Nginx terminates TLS 1.3 at the edge. All internal communication runs over an isolated Docker bridge network.
2. PII detection pipeline
GPI uses a hybrid approach combining regex-based rules with a specially trained NER model for maximum accuracy.
NER model
Custom thomasbeste/danish-xlmr-ner-large — XLM-R large, two-stage fine-tuned on DANSK+DaNE datasets. Exported to ONNX with INT8 quantization via optimum-cli. Baked into the Docker image for reproducible deployments.
Regex layer
Pattern matching for CPR numbers (Danish SSN), CVR numbers (company IDs), phone numbers, email addresses, bank accounts and card numbers. CPR/CVR rules include checksum validation that eliminates ~99% of false positives — unlike generic regex implementations.
Detection categories
PERSON, CPR/SSN, ADDRESS, PHONE, EMAIL, CVR/EIN, ACCOUNT, HEALTH — with confidence scores for each detection.
3. Tokenization and encryption
Once PII is detected, it is tokenized with reversible tokens and encrypted with AES-256-GCM.
Symmetric encryption with authenticated encryption. Each token gets a unique nonce. Byte-compatible between .NET and Python services.
Key derivation from master key. Ensures that compromising one derived key does not compromise others.
Token mappings are stored in Redis with configurable TTL (default: 120 minutes). Automatic cleanup after expiration.
AI responses are de-tokenized by looking up tokens in Redis and decrypting. The user sees the full response with original data.
4. Access control
GPI uses a two-layer access model with Microsoft Entra ID as identity provider.
Roles
System roles (Administrator, Auditor, Analyst, User) control feature access. Administrators can create and remove users, configure data sources and view audit logs. Auditors have read-only access to compliance data.
Groups and data sources
Users are assigned to groups that grant access to specific data sources. A user only sees data from the sources their group has access to. Zero-trust principle: no access unless explicitly granted.
Microsoft Entra ID
OIDC-based authentication with Entra ID as external identity provider. Certificate-based authentication in production environments, client secret in test environments.
5. GDPR compliance
GPI implements 9 GDPR articles with documentable compliance.
Storage limitation
Configurable retention policies with daily cleanup and complete audit log.
Right of access
DSAR reports with one click. Export as JSON or CSV.
Right to erasure
Cascade deletion with 7-year immutable deletion log.
Data portability
CSV/JSON export via API. Machine-readable format.
Data protection by design
Hybrid PII detection tokenizes before AI processing.
Records of processing
Meta-audit logging of all access to audit data.
Security of processing
AES-256-GCM, TLS 1.3, RBAC via Entra ID.
Breach notification
Real-time anomaly detection, automatic alerts, 72-hour flow.
6. Anomaly detection
GPI monitors all activity with 8+ rules that detect suspicious behavior in real time.
Rules cover: unauthorized access attempts, bulk data extraction, unusual access patterns, access outside normal working hours, repeated failed logins, privilege escalation, and more.
Upon detection, automatic email alerts are sent to administrators. All events are logged in the immutable audit trail with full context: user, IP, timestamp, action and result.
7. Document handling
GPI supports 30+ file formats with automatic indexing, OCR and semantic search.
PDF, DOCX, XLSX, PPTX, ODT, ODS, RTF, TXT, CSV, HTML, XML, Markdown and more.
PNG, JPG, TIFF, BMP, GIF, WebP — with OCR text extraction from scanned documents.
ZIP, TAR, GZ — automatic extraction and indexing of contents.
pgvector embeddings combined with PostgreSQL full-text search (Danish stemming). Hybrid vector + keyword retrieval.
8. Technology stack
.NET 10, Custom ChatAgent + MCP protocol, Conduit (mediator + messaging), C# 12 with primary constructors.
React 19, TypeScript, Vite, Tailwind CSS 4, TanStack React Query.
PostgreSQL + pgvector, Redis, RabbitMQ, ONNX Runtime (INT8 quantized NER).
Docker (11 containers), Nginx (TLS termination), Microsoft Entra ID (OIDC), on-premise or Azure.