Data Clusters
A data cluster is an isolated tenant environment running on a data plane. Each data cluster has its own database, object storage, vector search collection, full-text search indexes, API deployment, and network connector. This is the unit of data isolation — one organization's data is completely separated from another's at every layer of the stack.
What Is a Data Cluster?
When an organization needs to store and process documents on the Alien Intelligence platform, a data cluster is created for them on a data plane. The data cluster provides:
- A dedicated Data API — the REST service that handles all data operations
- A separate database — for entry metadata, dataset configuration, and manifests
- A separate storage bucket — for uploaded files, processed content, and figures
- A separate vector collection — for embedding vectors used in semantic search
- Separate search indexes — for keyword search over document content
- A dedicated network tunnel — connecting the cluster to the platform
From the tenant's perspective, their data cluster is a self-contained data service. They interact with it through the platform (which proxies requests) or through AI agents (which access it via MCP tools). The tenant's data stays in their isolated cluster — whether it is hosted by Alien or deployed on-premise.
Per-Tenant Isolation
Every data cluster gets complete logical isolation within the shared infrastructure of its data plane. Here is exactly what is separated:
| Resource | Isolation Mechanism | Naming Pattern |
|---|---|---|
| Kubernetes namespace | Separate namespace | tenant-{slug} |
| PostgreSQL database | Separate database + role | {slug} database |
| MinIO bucket | Separate bucket + IAM user | {slug}-data |
| Qdrant collection | Separate collection + JWT | {slug}_entries |
| Meilisearch indexes | Separate indexes + API key | {slug}_entries, {slug}_datasets |
| Kubernetes secrets | Separate secret per namespace | data-api-secrets |
| Data API deployment | Separate pod(s) per tenant | In tenant-{slug} namespace |
| Skupper connector | Separate routing key | data-api-{slug} |
Credentials are scoped per tenant. The Data API for Tenant A cannot access Tenant B's database, storage bucket, or vector collection — the credentials simply do not grant access to any other tenant's resources.
The isolation is logical, not physical. Multiple tenants share the same PostgreSQL cluster, MinIO deployment, and Qdrant replicas. But each tenant's data is in separate databases, buckets, and collections with independently scoped credentials. This provides strong isolation while keeping infrastructure costs proportional to actual usage.
Automated Provisioning
When a new data cluster is created, the data plane operator runs an automated provisioning sequence. No manual infrastructure work is required — the operator handles everything from namespace creation to network tunnel setup.
Provisioning Sequence
The operator provisions resources in dependency order:
- Create namespace —
tenant-{slug}with appropriate labels - Prepare secrets — Empty secret placeholder for credential injection
- Discover infrastructure — Locate shared PostgreSQL, MinIO, Qdrant, and Meilisearch endpoints
- Provision database — Create PostgreSQL database, role, and connection pool entry
- Provision storage — Create MinIO bucket and IAM user with per-tenant credentials
- Provision vector database — Create Qdrant collection with configured vector dimensions
- Provision search indexes — Create Meilisearch indexes and tenant-scoped API key
- Deploy Data API — Create ArgoCD application for the tenant's Data API Helm chart
- Configure networking — Create Skupper connector to expose the Data API via mTLS tunnel
- Mark ready — Update cluster status to Ready
The entire sequence typically completes in minutes. If any step fails, the operator records the failure and can retry from the failed step during reconciliation.
First Heartbeat
Once the Data API deployment is running, it sends its first heartbeat to the platform. This heartbeat transitions the cluster status from provisioning to active and begins the regular 30-second sync cycle.
Cluster Lifecycle
A data cluster moves through these states during its lifetime:
| State | Description |
|---|---|
| Provisioning | Operator is creating infrastructure resources |
| Active | Cluster is healthy, heartbeat is current, data operations work |
| Degraded | One or more infrastructure components are unhealthy |
| Offline | Heartbeat has not been received within the timeout period |
| Terminating | Cluster is being deleted, resources are being cleaned up |
State Transitions
Health Monitoring
Each data cluster's Data API sends a heartbeat to the platform every 30 seconds. The heartbeat includes:
- Health status — Per-component checks for PostgreSQL, MinIO, Qdrant, and Meilisearch, with latency measurements
- Sync statistics — Number of pending metadata changes, dataset count, total entries, total storage size
- Version information — API version and configuration
The platform uses heartbeats to:
- Track cluster health on the dashboard
- Route requests only to active clusters (offline/suspended clusters reject proxy requests)
- Detect connectivity issues and alert operators
If a heartbeat fails, the Data API retries with exponential backoff up to a maximum interval of 5 minutes. During a network partition, the cluster continues operating normally — data processing, search, and all local operations work independently.
Cluster Connectivity
Data clusters connect to the Alien Intelligence platform through Skupper mTLS tunnels. This connectivity model has several important properties:
Outbound-Only
For on-premise deployments, the data cluster initiates the connection. No inbound firewall rules are required on your infrastructure. This makes on-premise data clusters deployable behind corporate firewalls, in air-gapped environments (with proxy), and on restrictive networks. For Alien-hosted clusters, connectivity is managed automatically.
Encrypted End-to-End
All traffic through the tunnel is encrypted with mutual TLS. Both sides authenticate with certificates — the platform cannot impersonate a data cluster, and vice versa.
Per-Tenant Routing
Each data cluster has a unique routing key that maps to its Data API endpoint. The platform uses this key to route proxy requests to the correct cluster. Routing is deterministic — the same cluster always receives the same routing key.
Resilient to Disconnection
If the tunnel drops, the data cluster continues operating independently. Metadata changes queue locally and sync when connectivity resumes. The platform marks the cluster as offline, and proxy requests return errors until the tunnel is re-established.
Infrastructure Per Cluster
Each data cluster's Data API provides access to the following per-tenant infrastructure:
PostgreSQL
Stores entry metadata, dataset configuration, manifest data (JSONB), the resource change log for sync, and webhook records. The database tracks the full state of every document — its files, processing status, versions, and relationships.
Key tables:
| Table | Purpose |
|---|---|
datasets | Dataset definitions with schema, pipeline config, and storage paths |
entries | Individual document records with manifests, status, and version tracking |
weights | AI model weights and training artifacts |
resource_change_log | Queue for batch sync to platform |
webhook_records | Outbound webhook delivery tracking |
MinIO (S3-Compatible Storage)
Stores all file data in a structured path hierarchy:
{slug}-data/
datasets/{dataset_id}/
entries/{entry_id}/
original/{timestamp}_{filename} # Uploaded files
processed/
content.json # Extracted text content
figures/
fig_001.png # Extracted figures
fig_002.png
processing/ # Intermediate pipeline artifacts
weights/{type}/{model_id}/
{timestamp}_{filename} # AI model weights
Qdrant (Vector Database)
Stores embedding vectors for semantic search. Each entry's text is split into chunks during pipeline processing, and each chunk gets an embedding vector. A Qdrant point stores:
- The embedding vector
- The chunk text
- Metadata: dataset ID, entry ID, chunk index, figure references
Meilisearch (Full-Text Search)
Maintains two indexes per tenant:
{slug}_entries— Searchable document content with faceted filtering by dataset, status, MIME type, and tags{slug}_datasets— Dataset names and descriptions for internal discovery
Reconciliation
The data plane operator periodically checks each data cluster's actual infrastructure against its desired state. Every 300 seconds, the reconciliation loop:
- Verifies all provisioned resources exist (database, bucket, collection, indexes)
- Recreates missing resources if needed
- Regenerates missing credentials (does not rotate healthy credentials)
- Updates the cluster's status based on current infrastructure health
This self-healing behavior means that transient infrastructure issues (a temporarily unavailable component, a deleted secret) are corrected automatically without operator intervention.
Cluster Deletion
When a data cluster is deleted, the operator performs a reverse-order cleanup:
- Remove the ArgoCD application (stops the Data API)
- Remove the Skupper connector (disconnects the tunnel)
- Delete Meilisearch indexes and credentials
- Delete the Qdrant collection and credentials
- Delete MinIO IAM user and bucket data
- Delete the PostgreSQL database and credentials
- Notify the platform that deletion is complete
- Delete the Kubernetes namespace
The deletion sequence continues even if individual steps fail, ensuring idempotent cleanup. Once complete, all tenant data is permanently removed from the data plane.
Next Steps
- Data Planes — The infrastructure layer that hosts data clusters
- Datasets and Entries — The data model inside a data cluster
- Data Sovereignty — How data isolation and sovereignty work
- Pipelines — How documents are processed into searchable knowledge
- Create a Data Cluster — Step-by-step guide to creating a data cluster