Skip to main content

Data Sovereignty

Data sovereignty on Alien Intelligence means your documents, embeddings, vector indexes, and processed files are isolated in dedicated per-tenant data clusters — never shared with other tenants and never accessible from the platform's orchestration layer.

By default, Alien hosts and manages your data clusters on Alien's infrastructure. For enterprise clients in regulated industries who require full physical control, data clusters can be deployed on your own infrastructure — on-premises or in your cloud account. This on-premise option provides the strongest form of data sovereignty: your data physically resides on infrastructure you control.

Two Levels of Data Sovereignty

Level 1: Tenant Isolation (All Deployments)

Every tenant on Alien Intelligence — whether Alien-hosted or on-premise — gets complete data isolation enforced at the architecture level:

  • Each tenant has dedicated databases, storage buckets, vector collections, and search indexes
  • Credentials are scoped per tenant — one tenant's services cannot access another tenant's resources
  • The platform orchestration layer stores only metadata pointers, never document content
  • All data access is authenticated, per-entry, and logged

This isolation is not a configuration option or a contractual promise. It is enforced by the system's architecture: separate databases, separate credentials, separate namespaces.

Level 2: Physical Data Sovereignty (On-Premise)

For organizations that require data to physically reside on their own infrastructure — due to GDPR, HIPAA, or organizational policy — data clusters can be deployed on-premise. This adds:

  • Physical data residency on infrastructure you control
  • Outbound-only network connections via encrypted mTLS tunnels
  • No inbound firewall rules required
  • Full independence during network partitions

On-premise deployment is recommended only for teams with the capacity to manage Kubernetes infrastructure. Alien-hosted clusters are maintained by Alien with better response times and automatic updates.

Platform vs Data Cluster Split

The platform is split into two architectural layers:

Platform (Orchestration — managed by Alien)

The platform is the orchestration layer, operated by Alien Intelligence as a managed service. It handles:

  • User authentication and authorization
  • Dataset catalog (metadata pointers only — never content)
  • Job scheduling and workflow orchestration
  • AI agent access management (MCP servers)
  • Billing and subscription management
  • Cross-cluster search coordination

What the platform never holds: document files, processed text, embeddings, vector indexes, full-text search indexes, figures, or any customer content.

Data Clusters (Alien Hosted or On-Premise)

Data clusters hold all customer data. Each tenant's data cluster contains:

  • Original uploaded documents (PDF, DOCX, images, XML)
  • Processed content (extracted text, figures, metadata)
  • Embedding vectors for semantic search
  • Full-text search indexes for keyword search
  • Object storage (MinIO) with all file data
  • Relational databases (PostgreSQL) with entry metadata and manifests

What leaves the data cluster: only metadata summaries — dataset names, entry counts, processing status, and sync timestamps. Never content.

Five Enforcement Mechanisms

Data isolation is enforced through five independent mechanisms. Each provides meaningful protection on its own; together, they make unauthorized data movement architecturally impossible. These mechanisms apply to all deployments — Alien-hosted and on-premise alike.

1. Network Topology

For on-premise deployments, data clusters initiate outbound-only connections to the platform through encrypted mTLS tunnels (Skupper). There are no inbound ports opened on your infrastructure, no firewall rules to configure, and no way for the platform to initiate a connection to your cluster.

This means that even if the platform were compromised, an attacker could not reach into your infrastructure. The tunnel is client-initiated, encrypted end-to-end, and authenticated with mutual TLS certificates.

For Alien-hosted deployments, the connectivity between the platform and data clusters is managed internally by Alien on secured infrastructure.

2. Proxy Architecture

Platform services (workers, AI agents, MCP tools) never connect to data clusters directly. All data access goes through an authenticated proxy endpoint on the platform backend. The proxy:

  • Authenticates every request (user token or service credential)
  • Checks authorization (organization membership, cluster status)
  • Forwards the request to the data cluster using per-cluster service credentials
  • Streams the response back without caching or storing any content

The platform backend acts as a relay — it forwards request bytes and streams response bytes, but it never persists response content. There is no cache, no temporary storage, and no log that captures document content.

3. Metadata-Only Sync

Data clusters periodically push metadata to the platform to keep the catalog up to date. This sync includes:

What syncsExamplePurpose
Dataset names and descriptions"BioRxiv 2024 Archive"Catalog discovery
Entry counts and sizes12,450 entries, 8.2 GBDashboard statistics
Processing status98% processedProgress tracking
Sync timestampsLast sync: 30 seconds agoHealth monitoring

What never syncs: file content, processed text, embedding vectors, search index data, figure images, or any document payload.

The platform's dataset catalog stores pointers — enough information to route a request to the right cluster, but never enough to reconstruct the data.

4. Namespace Isolation

Each tenant on a data cluster gets a dedicated Kubernetes namespace with completely separate infrastructure resources:

ResourceIsolationCredential Scope
PostgreSQL databaseSeparate database per tenantPer-tenant database role
MinIO storageSeparate bucket per tenantPer-tenant IAM credentials
Qdrant collectionSeparate collection per tenantPer-tenant JWT token
Meilisearch indexesSeparate indexes per tenantPer-tenant API key
Data API deploymentSeparate deployment per tenantNamespace-scoped
Network connectorSeparate tunnel endpoint per tenantPer-tenant routing key

There is no cross-tenant namespace access. Credentials are scoped so that one tenant's Data API deployment can only reach its own database, storage bucket, vector collection, and search indexes.

5. No Data Egress Paths

The Data API — the service that manages all customer data on the cluster — has no endpoints designed for bulk data export to the platform. Every data access operation is:

  • Authenticated — requires a valid service credential
  • Scoped to a single entry — no "export all" endpoints exist
  • Logged — every proxy call is recorded with the requestor identity, timestamp, and access path

There is no mechanism in the API to stream an entire dataset, export a collection of embeddings, or transfer vector indexes back to the platform.

What Flows Where

This table summarizes exactly what data crosses the boundary between data clusters and the platform:

DirectionWhatHowFrequency
Data Cluster to PlatformDataset/entry metadata (names, counts, status)Batch sync API callEvery 30 seconds
Data Cluster to PlatformCluster health metricsHeartbeat API callEvery 30 seconds
Data Cluster to PlatformOperator infrastructure statusOperator heartbeatEvery 60 seconds
Platform to Data ClusterAuthenticated data requests (proxied)Encrypted tunnel, per-requestOn demand
Platform to Data ClusterPipeline trigger commandsKubernetes API (in-cluster)On file upload
Neither directionDocument content, embeddings, vectors, figuresNever crosses boundaryNever

Data flow between the platform and data clusters:

How the Platform Catalog Works

The platform maintains a catalog of all datasets across all clusters. This catalog enables users to browse, search, and manage their data from a single dashboard — but it contains only metadata.

When a document is processed on a data cluster, the batch sync service sends a summary to the platform:

{
"datasets": [{
"name": "Research Papers 2024",
"entry_count": 5420,
"total_size_bytes": 3200000000,
"last_updated": "2026-03-26T10:30:00Z"
}],
"entries": [{
"name": "paper-001.pdf",
"status": "processed",
"mime_type": "application/pdf",
"file_size_bytes": 2400000
}]
}

The platform stores this metadata for catalog display and search routing. When a user wants to actually read the document, the platform proxies the request to the data cluster in real time — it does not serve content from its own storage.

Compliance Implications

The architectural separation has direct implications for regulatory compliance:

GDPR

  • Customer data is isolated in dedicated per-tenant clusters. For on-premise deployments, data is physically stored in the customer's chosen jurisdiction
  • The platform cannot access data without the data cluster being operational and the connection active
  • Right to deletion cascades through all storage systems (database, object storage, vector database, search indexes)
  • Audit trail tracks all data access with requestor identity and access path

HIPAA

  • Encryption in transit via mTLS (between platform and cluster) and within the cluster (Istio service mesh)
  • Encryption at rest via managed storage providers
  • Role-based access control enforced at every layer (organization roles, API token scopes, per-service authentication)
  • Access logging covers all proxy calls and direct API access

ISO 27001

  • All infrastructure changes tracked through GitOps (every change is a Git commit)
  • Documented development workflow with audit trail
  • Automated compliance scanning across repositories
  • Security coding patterns enforced in CI pipelines
info

The platform architecture is designed to be compatible with these frameworks. Achieving formal certification requires additional organizational controls (policies, procedures, training) beyond the technical architecture.

Network Partition Behavior

If the connection between a data cluster and the platform is interrupted:

  • Data processing continues. Pipelines running on the cluster complete normally. New uploads are processed. Search works.
  • Metadata changes accumulate. The batch sync service queues changes locally and flushes them when connectivity resumes.
  • The platform marks the cluster as offline. Proxy requests return an error until the tunnel is re-established.
  • No data is lost. The cluster is fully self-contained and operates independently during disconnection.

This design means that data clusters are not dependent on platform availability for their core data operations.

Next Steps

  • Architecture Overview — The full platform and data cluster architecture
  • Data Clusters — How per-tenant isolation works in practice
  • Security Model — Authentication, authorization, and network security details for both hosting modes
  • Create a Data Plane — Deploy a data plane on your own infrastructure for full physical sovereignty
  • Compliance — GDPR, HIPAA, and ISO 27001 compliance posture