GPI
GOVERNANCE PRIVACY INTELLIGENCE

GPI Technical Whitepaper

A detailed walkthrough of GPI's security architecture, PII protection and GDPR compliance

1. Architecture overview

GPI is a microservice-based platform consisting of 11 Docker containers running on-premise or in your own Azure environment. The architecture is designed for single-tenant deployment — one installation per customer, full data isolation.

Presentation

Gateway (BFF) serves a React 19 SPA, handles authentication via Microsoft Entra ID and proxies API calls to backend services.

Infrastructure

DataHub (main API), Audit service, Telemetry service, Indexing worker, PII service, DocParser and Embeddings service.

Data

PostgreSQL with pgvector for semantic search, Redis for token storage and pub/sub, RabbitMQ for async messaging.

Edge

Nginx terminates TLS 1.3 at the edge. All internal communication runs over an isolated Docker bridge network.

2. PII detection pipeline

GPI uses a hybrid approach combining regex-based rules with a specially trained NER model for maximum accuracy.

F1: 87.6–93.0 on Danish legal text
~99% false positives eliminated via checksum
ONNX INT8 quantized model (~600MB)

NER model

Custom thomasbeste/danish-xlmr-ner-large — XLM-R large, two-stage fine-tuned on DANSK+DaNE datasets. Exported to ONNX with INT8 quantization via optimum-cli. Baked into the Docker image for reproducible deployments.

Regex layer

Pattern matching for CPR numbers (Danish SSN), CVR numbers (company IDs), phone numbers, email addresses, bank accounts and card numbers. CPR/CVR rules include checksum validation that eliminates ~99% of false positives — unlike generic regex implementations.

Detection categories

PERSON, CPR/SSN, ADDRESS, PHONE, EMAIL, CVR/EIN, ACCOUNT, HEALTH — with confidence scores for each detection.

3. Tokenization and encryption

Once PII is detected, it is tokenized with reversible tokens and encrypted with AES-256-GCM.

AES-256-GCM

Symmetric encryption with authenticated encryption. Each token gets a unique nonce. Byte-compatible between .NET and Python services.

HKDF-SHA256

Key derivation from master key. Ensures that compromising one derived key does not compromise others.

Redis token storage

Token mappings are stored in Redis with configurable TTL (default: 120 minutes). Automatic cleanup after expiration.

Reversible de-tokenization

AI responses are de-tokenized by looking up tokens in Redis and decrypting. The user sees the full response with original data.

4. Access control

GPI uses a two-layer access model with Microsoft Entra ID as identity provider.

Roles

System roles (Administrator, Auditor, Analyst, User) control feature access. Administrators can create and remove users, configure data sources and view audit logs. Auditors have read-only access to compliance data.

Groups and data sources

Users are assigned to groups that grant access to specific data sources. A user only sees data from the sources their group has access to. Zero-trust principle: no access unless explicitly granted.

Microsoft Entra ID

OIDC-based authentication with Entra ID as external identity provider. Certificate-based authentication in production environments, client secret in test environments.

5. GDPR compliance

GPI implements 9 GDPR articles with documentable compliance.

Art. 5(1)(e)

Storage limitation

Configurable retention policies with daily cleanup and complete audit log.

Art. 15

Right of access

DSAR reports with one click. Export as JSON or CSV.

Art. 17

Right to erasure

Cascade deletion with 7-year immutable deletion log.

Art. 20

Data portability

CSV/JSON export via API. Machine-readable format.

Art. 25

Data protection by design

Hybrid PII detection tokenizes before AI processing.

Art. 30

Records of processing

Meta-audit logging of all access to audit data.

Art. 32

Security of processing

AES-256-GCM, TLS 1.3, RBAC via Entra ID.

Art. 33

Breach notification

Real-time anomaly detection, automatic alerts, 72-hour flow.

6. Anomaly detection

GPI monitors all activity with 8+ rules that detect suspicious behavior in real time.

Rules cover: unauthorized access attempts, bulk data extraction, unusual access patterns, access outside normal working hours, repeated failed logins, privilege escalation, and more.

Upon detection, automatic email alerts are sent to administrators. All events are logged in the immutable audit trail with full context: user, IP, timestamp, action and result.

7. Document handling

GPI supports 30+ file formats with automatic indexing, OCR and semantic search.

Documents

PDF, DOCX, XLSX, PPTX, ODT, ODS, RTF, TXT, CSV, HTML, XML, Markdown and more.

Images & OCR

PNG, JPG, TIFF, BMP, GIF, WebP — with OCR text extraction from scanned documents.

Archives

ZIP, TAR, GZ — automatic extraction and indexing of contents.

Semantic search

pgvector embeddings combined with PostgreSQL full-text search (Danish stemming). Hybrid vector + keyword retrieval.

8. Technology stack

Backend

.NET 10, Custom ChatAgent + MCP protocol, Conduit (mediator + messaging), C# 12 with primary constructors.

Frontend

React 19, TypeScript, Vite, Tailwind CSS 4, TanStack React Query.

Data & AI

PostgreSQL + pgvector, Redis, RabbitMQ, ONNX Runtime (INT8 quantized NER).

Infrastructure

Docker (11 containers), Nginx (TLS termination), Microsoft Entra ID (OIDC), on-premise or Azure.

Ready to protect your data?

Book a free demo and see GPI in action.

Book a demo