Data Sovereignty

Data sovereignty on Alien Intelligence means your documents, embeddings, vector indexes, and processed files are isolated in dedicated per-tenant data clusters — never shared with other tenants and never accessible from the platform's orchestration layer.

By default, Alien hosts and manages your data clusters on Alien's infrastructure. For enterprise clients in regulated industries who require full physical control, data clusters can be deployed on your own infrastructure — on-premises or in your cloud account. This on-premise option provides the strongest form of data sovereignty: your data physically resides on infrastructure you control.

Two Levels of Data Sovereignty

Level 1: Tenant Isolation (All Deployments)

Every tenant on Alien Intelligence — whether Alien-hosted or on-premise — gets complete data isolation enforced at the architecture level:

Each tenant has dedicated databases, storage buckets, vector collections, and search indexes
Credentials are scoped per tenant — one tenant's services cannot access another tenant's resources
The platform orchestration layer stores only metadata pointers, never document content
All data access is authenticated, per-entry, and logged

This isolation is not a configuration option or a contractual promise. It is enforced by the system's architecture: separate databases, separate credentials, separate namespaces.

Level 2: Physical Data Sovereignty (On-Premise)

For organizations that require data to physically reside on their own infrastructure — due to GDPR, HIPAA, or organizational policy — data clusters can be deployed on-premise. This adds:

Physical data residency on infrastructure you control
Outbound-only network connections via encrypted mTLS tunnels
No inbound firewall rules required
Full independence during network partitions

On-premise deployment is recommended only for teams with the capacity to manage Kubernetes infrastructure. Alien-hosted clusters are maintained by Alien with better response times and automatic updates.

Platform vs Data Cluster Split

The platform is split into two architectural layers:

Platform (Orchestration — managed by Alien)

The platform is the orchestration layer, operated by Alien Intelligence as a managed service. It handles:

User authentication and authorization
Dataset catalog (metadata pointers only — never content)
Job scheduling and workflow orchestration
AI agent access management (MCP servers)
Billing and subscription management
Cross-cluster search coordination

What the platform never holds: document files, processed text, embeddings, vector indexes, full-text search indexes, figures, or any customer content.

Data Clusters (Alien Hosted or On-Premise)

Data clusters hold all customer data. Each tenant's data cluster contains:

Original uploaded documents (PDF, DOCX, images, XML)
Processed content (extracted text, figures, metadata)
Embedding vectors for semantic search
Full-text search indexes for keyword search
Object storage (MinIO) with all file data
Relational databases (PostgreSQL) with entry metadata and manifests

What leaves the data cluster: only metadata summaries — dataset names, entry counts, processing status, and sync timestamps. Never content.

Five Enforcement Mechanisms

Data isolation is enforced through five independent mechanisms. Each provides meaningful protection on its own; together, they make unauthorized data movement architecturally impossible. These mechanisms apply to all deployments — Alien-hosted and on-premise alike.

1. Network Topology

For on-premise deployments, data clusters initiate outbound-only connections to the platform through encrypted mTLS tunnels (Skupper). There are no inbound ports opened on your infrastructure, no firewall rules to configure, and no way for the platform to initiate a connection to your cluster.

This means that even if the platform were compromised, an attacker could not reach into your infrastructure. The tunnel is client-initiated, encrypted end-to-end, and authenticated with mutual TLS certificates.

For Alien-hosted deployments, the connectivity between the platform and data clusters is managed internally by Alien on secured infrastructure.

2. Proxy Architecture

Platform services (workers, AI agents, MCP tools) never connect to data clusters directly. All data access goes through an authenticated proxy endpoint on the platform backend. The proxy:

Authenticates every request (user token or service credential)
Checks authorization (organization membership, cluster status)
Forwards the request to the data cluster using per-cluster service credentials
Streams the response back without caching or storing any content

The platform backend acts as a relay — it forwards request bytes and streams response bytes, but it never persists response content. There is no cache, no temporary storage, and no log that captures document content.

3. Metadata-Only Sync

Data clusters periodically push metadata to the platform to keep the catalog up to date. This sync includes:

What syncs	Example	Purpose
Dataset names and descriptions	"BioRxiv 2024 Archive"	Catalog discovery
Entry counts and sizes	12,450 entries, 8.2 GB	Dashboard statistics
Processing status	98% processed	Progress tracking
Sync timestamps	Last sync: 30 seconds ago	Health monitoring

What never syncs: file content, processed text, embedding vectors, search index data, figure images, or any document payload.

The platform's dataset catalog stores pointers — enough information to route a request to the right cluster, but never enough to reconstruct the data.

4. Namespace Isolation

Each tenant on a data cluster gets a dedicated Kubernetes namespace with completely separate infrastructure resources:

Resource	Isolation	Credential Scope
PostgreSQL database	Separate database per tenant	Per-tenant database role
MinIO storage	Separate bucket per tenant	Per-tenant IAM credentials
Qdrant collection	Separate collection per tenant	Per-tenant JWT token
Meilisearch indexes	Separate indexes per tenant	Per-tenant API key
Data API deployment	Separate deployment per tenant	Namespace-scoped
Network connector	Separate tunnel endpoint per tenant	Per-tenant routing key

There is no cross-tenant namespace access. Credentials are scoped so that one tenant's Data API deployment can only reach its own database, storage bucket, vector collection, and search indexes.

5. No Data Egress Paths

The Data API — the service that manages all customer data on the cluster — has no endpoints designed for bulk data export to the platform. Every data access operation is:

Authenticated — requires a valid service credential
Scoped to a single entry — no "export all" endpoints exist
Logged — every proxy call is recorded with the requestor identity, timestamp, and access path

There is no mechanism in the API to stream an entire dataset, export a collection of embeddings, or transfer vector indexes back to the platform.

What Flows Where

This table summarizes exactly what data crosses the boundary between data clusters and the platform:

Direction	What	How	Frequency
Data Cluster to Platform	Dataset/entry metadata (names, counts, status)	Batch sync API call	Every 30 seconds
Data Cluster to Platform	Cluster health metrics	Heartbeat API call	Every 30 seconds
Data Cluster to Platform	Operator infrastructure status	Operator heartbeat	Every 60 seconds
Platform to Data Cluster	Authenticated data requests (proxied)	Encrypted tunnel, per-request	On demand
Platform to Data Cluster	Pipeline trigger commands	Kubernetes API (in-cluster)	On file upload
Neither direction	Document content, embeddings, vectors, figures	Never crosses boundary	Never

Data flow between the platform and data clusters:

How the Platform Catalog Works

The platform maintains a catalog of all datasets across all clusters. This catalog enables users to browse, search, and manage their data from a single dashboard — but it contains only metadata.

When a document is processed on a data cluster, the batch sync service sends a summary to the platform:

{
  "datasets": [{
    "name": "Research Papers 2024",
    "entry_count": 5420,
    "total_size_bytes": 3200000000,
    "last_updated": "2026-03-26T10:30:00Z"
  }],
  "entries": [{
    "name": "paper-001.pdf",
    "status": "processed",
    "mime_type": "application/pdf",
    "file_size_bytes": 2400000
  }]
}

The platform stores this metadata for catalog display and search routing. When a user wants to actually read the document, the platform proxies the request to the data cluster in real time — it does not serve content from its own storage.

Compliance Implications

The architectural separation has direct implications for regulatory compliance:

Customer data is isolated in dedicated per-tenant clusters. For on-premise deployments, data is physically stored in the customer's chosen jurisdiction
The platform cannot access data without the data cluster being operational and the connection active
Right to deletion cascades through all storage systems (database, object storage, vector database, search indexes)
Audit trail tracks all data access with requestor identity and access path

HIPAA

Encryption in transit via mTLS (between platform and cluster) and within the cluster (Istio service mesh)
Encryption at rest via managed storage providers
Role-based access control enforced at every layer (organization roles, API token scopes, per-service authentication)
Access logging covers all proxy calls and direct API access

ISO 27001

All infrastructure changes tracked through GitOps (every change is a Git commit)
Documented development workflow with audit trail
Automated compliance scanning across repositories
Security coding patterns enforced in CI pipelines

info

The platform architecture is designed to be compatible with these frameworks. Achieving formal certification requires additional organizational controls (policies, procedures, training) beyond the technical architecture.

Network Partition Behavior

If the connection between a data cluster and the platform is interrupted:

Data processing continues. Pipelines running on the cluster complete normally. New uploads are processed. Search works.
Metadata changes accumulate. The batch sync service queues changes locally and flushes them when connectivity resumes.
The platform marks the cluster as offline. Proxy requests return an error until the tunnel is re-established.
No data is lost. The cluster is fully self-contained and operates independently during disconnection.

This design means that data clusters are not dependent on platform availability for their core data operations.

Next Steps

Architecture Overview — The full platform and data cluster architecture
Data Clusters — How per-tenant isolation works in practice
Security Model — Authentication, authorization, and network security details for both hosting modes
Create a Data Plane — Deploy a data plane on your own infrastructure for full physical sovereignty
Compliance — GDPR, HIPAA, and ISO 27001 compliance posture

Two Levels of Data Sovereignty​

Level 1: Tenant Isolation (All Deployments)​

Level 2: Physical Data Sovereignty (On-Premise)​

Platform vs Data Cluster Split​

Platform (Orchestration — managed by Alien)​

Data Clusters (Alien Hosted or On-Premise)​

Five Enforcement Mechanisms​

1. Network Topology​

2. Proxy Architecture​

3. Metadata-Only Sync​

4. Namespace Isolation​

5. No Data Egress Paths​

What Flows Where​

How the Platform Catalog Works​

Compliance Implications​

GDPR​

HIPAA​

ISO 27001​

Network Partition Behavior​

Next Steps​