Data Sovereignty
Data sovereignty on Alien Intelligence means your documents, embeddings, vector indexes, and processed files are isolated in dedicated per-tenant data clusters — never shared with other tenants and never accessible from the platform's orchestration layer.
By default, Alien hosts and manages your data clusters on Alien's infrastructure. For enterprise clients in regulated industries who require full physical control, data clusters can be deployed on your own infrastructure — on-premises or in your cloud account. This on-premise option provides the strongest form of data sovereignty: your data physically resides on infrastructure you control.
Two Levels of Data Sovereignty
Level 1: Tenant Isolation (All Deployments)
Every tenant on Alien Intelligence — whether Alien-hosted or on-premise — gets complete data isolation enforced at the architecture level:
- Each tenant has dedicated databases, storage buckets, vector collections, and search indexes
- Credentials are scoped per tenant — one tenant's services cannot access another tenant's resources
- The platform orchestration layer stores only metadata pointers, never document content
- All data access is authenticated, per-entry, and logged
This isolation is not a configuration option or a contractual promise. It is enforced by the system's architecture: separate databases, separate credentials, separate namespaces.
Level 2: Physical Data Sovereignty (On-Premise)
For organizations that require data to physically reside on their own infrastructure — due to GDPR, HIPAA, or organizational policy — data clusters can be deployed on-premise. This adds:
- Physical data residency on infrastructure you control
- Outbound-only network connections via encrypted mTLS tunnels
- No inbound firewall rules required
- Full independence during network partitions
On-premise deployment is recommended only for teams with the capacity to manage Kubernetes infrastructure. Alien-hosted clusters are maintained by Alien with better response times and automatic updates.
Platform vs Data Cluster Split
The platform is split into two architectural layers:
Platform (Orchestration — managed by Alien)
The platform is the orchestration layer, operated by Alien Intelligence as a managed service. It handles:
- User authentication and authorization
- Dataset catalog (metadata pointers only — never content)
- Job scheduling and workflow orchestration
- AI agent access management (MCP servers)
- Billing and subscription management
- Cross-cluster search coordination
What the platform never holds: document files, processed text, embeddings, vector indexes, full-text search indexes, figures, or any customer content.
Data Clusters (Alien Hosted or On-Premise)
Data clusters hold all customer data. Each tenant's data cluster contains:
- Original uploaded documents (PDF, DOCX, images, XML)
- Processed content (extracted text, figures, metadata)
- Embedding vectors for semantic search
- Full-text search indexes for keyword search
- Object storage (MinIO) with all file data
- Relational databases (PostgreSQL) with entry metadata and manifests
What leaves the data cluster: only metadata summaries — dataset names, entry counts, processing status, and sync timestamps. Never content.
Five Enforcement Mechanisms
Data isolation is enforced through five independent mechanisms. Each provides meaningful protection on its own; together, they make unauthorized data movement architecturally impossible. These mechanisms apply to all deployments — Alien-hosted and on-premise alike.
1. Network Topology
For on-premise deployments, data clusters initiate outbound-only connections to the platform through encrypted mTLS tunnels (Skupper). There are no inbound ports opened on your infrastructure, no firewall rules to configure, and no way for the platform to initiate a connection to your cluster.
This means that even if the platform were compromised, an attacker could not reach into your infrastructure. The tunnel is client-initiated, encrypted end-to-end, and authenticated with mutual TLS certificates.
For Alien-hosted deployments, the connectivity between the platform and data clusters is managed internally by Alien on secured infrastructure.
2. Proxy Architecture
Platform services (workers, AI agents, MCP tools) never connect to data clusters directly. All data access goes through an authenticated proxy endpoint on the platform backend. The proxy:
- Authenticates every request (user token or service credential)
- Checks authorization (organization membership, cluster status)
- Forwards the request to the data cluster using per-cluster service credentials
- Streams the response back without caching or storing any content
The platform backend acts as a relay — it forwards request bytes and streams response bytes, but it never persists response content. There is no cache, no temporary storage, and no log that captures document content.
3. Metadata-Only Sync
Data clusters periodically push metadata to the platform to keep the catalog up to date. This sync includes:
| What syncs | Example | Purpose |
|---|---|---|
| Dataset names and descriptions | "BioRxiv 2024 Archive" | Catalog discovery |
| Entry counts and sizes | 12,450 entries, 8.2 GB | Dashboard statistics |
| Processing status | 98% processed | Progress tracking |
| Sync timestamps | Last sync: 30 seconds ago | Health monitoring |
What never syncs: file content, processed text, embedding vectors, search index data, figure images, or any document payload.
The platform's dataset catalog stores pointers — enough information to route a request to the right cluster, but never enough to reconstruct the data.
4. Namespace Isolation
Each tenant on a data cluster gets a dedicated Kubernetes namespace with completely separate infrastructure resources:
| Resource | Isolation | Credential Scope |
|---|---|---|
| PostgreSQL database | Separate database per tenant | Per-tenant database role |
| MinIO storage | Separate bucket per tenant | Per-tenant IAM credentials |
| Qdrant collection | Separate collection per tenant | Per-tenant JWT token |
| Meilisearch indexes | Separate indexes per tenant | Per-tenant API key |
| Data API deployment | Separate deployment per tenant | Namespace-scoped |
| Network connector | Separate tunnel endpoint per tenant | Per-tenant routing key |
There is no cross-tenant namespace access. Credentials are scoped so that one tenant's Data API deployment can only reach its own database, storage bucket, vector collection, and search indexes.
5. No Data Egress Paths
The Data API — the service that manages all customer data on the cluster — has no endpoints designed for bulk data export to the platform. Every data access operation is:
- Authenticated — requires a valid service credential
- Scoped to a single entry — no "export all" endpoints exist
- Logged — every proxy call is recorded with the requestor identity, timestamp, and access path
There is no mechanism in the API to stream an entire dataset, export a collection of embeddings, or transfer vector indexes back to the platform.
What Flows Where
This table summarizes exactly what data crosses the boundary between data clusters and the platform:
| Direction | What | How | Frequency |
|---|---|---|---|
| Data Cluster to Platform | Dataset/entry metadata (names, counts, status) | Batch sync API call | Every 30 seconds |
| Data Cluster to Platform | Cluster health metrics | Heartbeat API call | Every 30 seconds |
| Data Cluster to Platform | Operator infrastructure status | Operator heartbeat | Every 60 seconds |
| Platform to Data Cluster | Authenticated data requests (proxied) | Encrypted tunnel, per-request | On demand |
| Platform to Data Cluster | Pipeline trigger commands | Kubernetes API (in-cluster) | On file upload |
| Neither direction | Document content, embeddings, vectors, figures | Never crosses boundary | Never |
Data flow between the platform and data clusters:
How the Platform Catalog Works
The platform maintains a catalog of all datasets across all clusters. This catalog enables users to browse, search, and manage their data from a single dashboard — but it contains only metadata.
When a document is processed on a data cluster, the batch sync service sends a summary to the platform:
{
"datasets": [{
"name": "Research Papers 2024",
"entry_count": 5420,
"total_size_bytes": 3200000000,
"last_updated": "2026-03-26T10:30:00Z"
}],
"entries": [{
"name": "paper-001.pdf",
"status": "processed",
"mime_type": "application/pdf",
"file_size_bytes": 2400000
}]
}
The platform stores this metadata for catalog display and search routing. When a user wants to actually read the document, the platform proxies the request to the data cluster in real time — it does not serve content from its own storage.
Compliance Implications
The architectural separation has direct implications for regulatory compliance:
GDPR
- Customer data is isolated in dedicated per-tenant clusters. For on-premise deployments, data is physically stored in the customer's chosen jurisdiction
- The platform cannot access data without the data cluster being operational and the connection active
- Right to deletion cascades through all storage systems (database, object storage, vector database, search indexes)
- Audit trail tracks all data access with requestor identity and access path
HIPAA
- Encryption in transit via mTLS (between platform and cluster) and within the cluster (Istio service mesh)
- Encryption at rest via managed storage providers
- Role-based access control enforced at every layer (organization roles, API token scopes, per-service authentication)
- Access logging covers all proxy calls and direct API access
ISO 27001
- All infrastructure changes tracked through GitOps (every change is a Git commit)
- Documented development workflow with audit trail
- Automated compliance scanning across repositories
- Security coding patterns enforced in CI pipelines
The platform architecture is designed to be compatible with these frameworks. Achieving formal certification requires additional organizational controls (policies, procedures, training) beyond the technical architecture.
Network Partition Behavior
If the connection between a data cluster and the platform is interrupted:
- Data processing continues. Pipelines running on the cluster complete normally. New uploads are processed. Search works.
- Metadata changes accumulate. The batch sync service queues changes locally and flushes them when connectivity resumes.
- The platform marks the cluster as offline. Proxy requests return an error until the tunnel is re-established.
- No data is lost. The cluster is fully self-contained and operates independently during disconnection.
This design means that data clusters are not dependent on platform availability for their core data operations.
Next Steps
- Architecture Overview — The full platform and data cluster architecture
- Data Clusters — How per-tenant isolation works in practice
- Security Model — Authentication, authorization, and network security details for both hosting modes
- Create a Data Plane — Deploy a data plane on your own infrastructure for full physical sovereignty
- Compliance — GDPR, HIPAA, and ISO 27001 compliance posture