Data Clusters

A data cluster is an isolated tenant environment running on a data plane. Each data cluster has its own database, object storage, vector search collection, full-text search indexes, API deployment, and network connector. This is the unit of data isolation — one organization's data is completely separated from another's at every layer of the stack.

What Is a Data Cluster?

When an organization needs to store and process documents on the Alien Intelligence platform, a data cluster is created for them on a data plane. The data cluster provides:

A dedicated Data API — the REST service that handles all data operations
A separate database — for entry metadata, dataset configuration, and manifests
A separate storage bucket — for uploaded files, processed content, and figures
A separate vector collection — for embedding vectors used in semantic search
Separate search indexes — for keyword search over document content
A dedicated network tunnel — connecting the cluster to the platform

From the tenant's perspective, their data cluster is a self-contained data service. They interact with it through the platform (which proxies requests) or through AI agents (which access it via MCP tools). The tenant's data stays in their isolated cluster — whether it is hosted by Alien or deployed on-premise.

Per-Tenant Isolation

Every data cluster gets complete logical isolation within the shared infrastructure of its data plane. Here is exactly what is separated:

Resource	Isolation Mechanism	Naming Pattern
Kubernetes namespace	Separate namespace	`tenant-{slug}`
PostgreSQL database	Separate database + role	`{slug}` database
MinIO bucket	Separate bucket + IAM user	`{slug}-data`
Qdrant collection	Separate collection + JWT	`{slug}_entries`
Meilisearch indexes	Separate indexes + API key	`{slug}_entries`, `{slug}_datasets`
Kubernetes secrets	Separate secret per namespace	`data-api-secrets`
Data API deployment	Separate pod(s) per tenant	In `tenant-{slug}` namespace
Skupper connector	Separate routing key	`data-api-{slug}`

Credentials are scoped per tenant. The Data API for Tenant A cannot access Tenant B's database, storage bucket, or vector collection — the credentials simply do not grant access to any other tenant's resources.

info

The isolation is logical, not physical. Multiple tenants share the same PostgreSQL cluster, MinIO deployment, and Qdrant replicas. But each tenant's data is in separate databases, buckets, and collections with independently scoped credentials. This provides strong isolation while keeping infrastructure costs proportional to actual usage.

Automated Provisioning

When a new data cluster is created, the data plane operator runs an automated provisioning sequence. No manual infrastructure work is required — the operator handles everything from namespace creation to network tunnel setup.

Provisioning Sequence

The operator provisions resources in dependency order:

Create namespace — tenant-{slug} with appropriate labels
Prepare secrets — Empty secret placeholder for credential injection
Discover infrastructure — Locate shared PostgreSQL, MinIO, Qdrant, and Meilisearch endpoints
Provision database — Create PostgreSQL database, role, and connection pool entry
Provision storage — Create MinIO bucket and IAM user with per-tenant credentials
Provision vector database — Create Qdrant collection with configured vector dimensions
Provision search indexes — Create Meilisearch indexes and tenant-scoped API key
Deploy Data API — Create ArgoCD application for the tenant's Data API Helm chart
Configure networking — Create Skupper connector to expose the Data API via mTLS tunnel
Mark ready — Update cluster status to Ready

The entire sequence typically completes in minutes. If any step fails, the operator records the failure and can retry from the failed step during reconciliation.

First Heartbeat

Once the Data API deployment is running, it sends its first heartbeat to the platform. This heartbeat transitions the cluster status from provisioning to active and begins the regular 30-second sync cycle.

Cluster Lifecycle

A data cluster moves through these states during its lifetime:

State	Description
Provisioning	Operator is creating infrastructure resources
Active	Cluster is healthy, heartbeat is current, data operations work
Degraded	One or more infrastructure components are unhealthy
Offline	Heartbeat has not been received within the timeout period
Terminating	Cluster is being deleted, resources are being cleaned up

State Transitions

Health Monitoring

Each data cluster's Data API sends a heartbeat to the platform every 30 seconds. The heartbeat includes:

Health status — Per-component checks for PostgreSQL, MinIO, Qdrant, and Meilisearch, with latency measurements
Sync statistics — Number of pending metadata changes, dataset count, total entries, total storage size
Version information — API version and configuration

The platform uses heartbeats to:

Track cluster health on the dashboard
Route requests only to active clusters (offline/suspended clusters reject proxy requests)
Detect connectivity issues and alert operators

tip

If a heartbeat fails, the Data API retries with exponential backoff up to a maximum interval of 5 minutes. During a network partition, the cluster continues operating normally — data processing, search, and all local operations work independently.

Cluster Connectivity

Data clusters connect to the Alien Intelligence platform through Skupper mTLS tunnels. This connectivity model has several important properties:

Outbound-Only

For on-premise deployments, the data cluster initiates the connection. No inbound firewall rules are required on your infrastructure. This makes on-premise data clusters deployable behind corporate firewalls, in air-gapped environments (with proxy), and on restrictive networks. For Alien-hosted clusters, connectivity is managed automatically.

Encrypted End-to-End

All traffic through the tunnel is encrypted with mutual TLS. Both sides authenticate with certificates — the platform cannot impersonate a data cluster, and vice versa.

Per-Tenant Routing

Each data cluster has a unique routing key that maps to its Data API endpoint. The platform uses this key to route proxy requests to the correct cluster. Routing is deterministic — the same cluster always receives the same routing key.

Resilient to Disconnection

If the tunnel drops, the data cluster continues operating independently. Metadata changes queue locally and sync when connectivity resumes. The platform marks the cluster as offline, and proxy requests return errors until the tunnel is re-established.

Infrastructure Per Cluster

Each data cluster's Data API provides access to the following per-tenant infrastructure:

PostgreSQL

Stores entry metadata, dataset configuration, manifest data (JSONB), the resource change log for sync, and webhook records. The database tracks the full state of every document — its files, processing status, versions, and relationships.

Key tables:

Table	Purpose
`datasets`	Dataset definitions with schema, pipeline config, and storage paths
`entries`	Individual document records with manifests, status, and version tracking
`weights`	AI model weights and training artifacts
`resource_change_log`	Queue for batch sync to platform
`webhook_records`	Outbound webhook delivery tracking

MinIO (S3-Compatible Storage)

Stores all file data in a structured path hierarchy:

{slug}-data/
  datasets/{dataset_id}/
    entries/{entry_id}/
      original/{timestamp}_{filename}     # Uploaded files
      processed/
        content.json                       # Extracted text content
        figures/
          fig_001.png                      # Extracted figures
          fig_002.png
      processing/                          # Intermediate pipeline artifacts
  weights/{type}/{model_id}/
    {timestamp}_{filename}                 # AI model weights

Qdrant (Vector Database)

Stores embedding vectors for semantic search. Each entry's text is split into chunks during pipeline processing, and each chunk gets an embedding vector. A Qdrant point stores:

The embedding vector
The chunk text
Metadata: dataset ID, entry ID, chunk index, figure references

Meilisearch (Full-Text Search)

Maintains two indexes per tenant:

{slug}_entries — Searchable document content with faceted filtering by dataset, status, MIME type, and tags
{slug}_datasets — Dataset names and descriptions for internal discovery

Reconciliation

The data plane operator periodically checks each data cluster's actual infrastructure against its desired state. Every 300 seconds, the reconciliation loop:

Verifies all provisioned resources exist (database, bucket, collection, indexes)
Recreates missing resources if needed
Regenerates missing credentials (does not rotate healthy credentials)
Updates the cluster's status based on current infrastructure health

This self-healing behavior means that transient infrastructure issues (a temporarily unavailable component, a deleted secret) are corrected automatically without operator intervention.

Cluster Deletion

When a data cluster is deleted, the operator performs a reverse-order cleanup:

Remove the ArgoCD application (stops the Data API)
Remove the Skupper connector (disconnects the tunnel)
Delete Meilisearch indexes and credentials
Delete the Qdrant collection and credentials
Delete MinIO IAM user and bucket data
Delete the PostgreSQL database and credentials
Notify the platform that deletion is complete
Delete the Kubernetes namespace

The deletion sequence continues even if individual steps fail, ensuring idempotent cleanup. Once complete, all tenant data is permanently removed from the data plane.

Next Steps

Data Planes — The infrastructure layer that hosts data clusters
Datasets and Entries — The data model inside a data cluster
Data Sovereignty — How data isolation and sovereignty work
Pipelines — How documents are processed into searchable knowledge
Create a Data Cluster — Step-by-step guide to creating a data cluster

What Is a Data Cluster?​

Per-Tenant Isolation​

Automated Provisioning​

Provisioning Sequence​

First Heartbeat​

Cluster Lifecycle​

State Transitions​

Health Monitoring​

Cluster Connectivity​

Outbound-Only​

Encrypted End-to-End​

Per-Tenant Routing​

Resilient to Disconnection​

Infrastructure Per Cluster​

PostgreSQL​

MinIO (S3-Compatible Storage)​

Qdrant (Vector Database)​

Meilisearch (Full-Text Search)​

Reconciliation​

Cluster Deletion​

Next Steps​