Skip to main content

Data Clusters

A data cluster is an isolated tenant environment running on a data plane. Each data cluster has its own database, object storage, vector search collection, full-text search indexes, API deployment, and network connector. This is the unit of data isolation — one organization's data is completely separated from another's at every layer of the stack.

What Is a Data Cluster?

When an organization needs to store and process documents on the Alien Intelligence platform, a data cluster is created for them on a data plane. The data cluster provides:

  • A dedicated Data API — the REST service that handles all data operations
  • A separate database — for entry metadata, dataset configuration, and manifests
  • A separate storage bucket — for uploaded files, processed content, and figures
  • A separate vector collection — for embedding vectors used in semantic search
  • Separate search indexes — for keyword search over document content
  • A dedicated network tunnel — connecting the cluster to the platform

From the tenant's perspective, their data cluster is a self-contained data service. They interact with it through the platform (which proxies requests) or through AI agents (which access it via MCP tools). The tenant's data stays in their isolated cluster — whether it is hosted by Alien or deployed on-premise.

Per-Tenant Isolation

Every data cluster gets complete logical isolation within the shared infrastructure of its data plane. Here is exactly what is separated:

ResourceIsolation MechanismNaming Pattern
Kubernetes namespaceSeparate namespacetenant-{slug}
PostgreSQL databaseSeparate database + role{slug} database
MinIO bucketSeparate bucket + IAM user{slug}-data
Qdrant collectionSeparate collection + JWT{slug}_entries
Meilisearch indexesSeparate indexes + API key{slug}_entries, {slug}_datasets
Kubernetes secretsSeparate secret per namespacedata-api-secrets
Data API deploymentSeparate pod(s) per tenantIn tenant-{slug} namespace
Skupper connectorSeparate routing keydata-api-{slug}

Credentials are scoped per tenant. The Data API for Tenant A cannot access Tenant B's database, storage bucket, or vector collection — the credentials simply do not grant access to any other tenant's resources.

info

The isolation is logical, not physical. Multiple tenants share the same PostgreSQL cluster, MinIO deployment, and Qdrant replicas. But each tenant's data is in separate databases, buckets, and collections with independently scoped credentials. This provides strong isolation while keeping infrastructure costs proportional to actual usage.

Automated Provisioning

When a new data cluster is created, the data plane operator runs an automated provisioning sequence. No manual infrastructure work is required — the operator handles everything from namespace creation to network tunnel setup.

Provisioning Sequence

The operator provisions resources in dependency order:

  1. Create namespacetenant-{slug} with appropriate labels
  2. Prepare secrets — Empty secret placeholder for credential injection
  3. Discover infrastructure — Locate shared PostgreSQL, MinIO, Qdrant, and Meilisearch endpoints
  4. Provision database — Create PostgreSQL database, role, and connection pool entry
  5. Provision storage — Create MinIO bucket and IAM user with per-tenant credentials
  6. Provision vector database — Create Qdrant collection with configured vector dimensions
  7. Provision search indexes — Create Meilisearch indexes and tenant-scoped API key
  8. Deploy Data API — Create ArgoCD application for the tenant's Data API Helm chart
  9. Configure networking — Create Skupper connector to expose the Data API via mTLS tunnel
  10. Mark ready — Update cluster status to Ready

The entire sequence typically completes in minutes. If any step fails, the operator records the failure and can retry from the failed step during reconciliation.

First Heartbeat

Once the Data API deployment is running, it sends its first heartbeat to the platform. This heartbeat transitions the cluster status from provisioning to active and begins the regular 30-second sync cycle.

Cluster Lifecycle

A data cluster moves through these states during its lifetime:

StateDescription
ProvisioningOperator is creating infrastructure resources
ActiveCluster is healthy, heartbeat is current, data operations work
DegradedOne or more infrastructure components are unhealthy
OfflineHeartbeat has not been received within the timeout period
TerminatingCluster is being deleted, resources are being cleaned up

State Transitions

Health Monitoring

Each data cluster's Data API sends a heartbeat to the platform every 30 seconds. The heartbeat includes:

  • Health status — Per-component checks for PostgreSQL, MinIO, Qdrant, and Meilisearch, with latency measurements
  • Sync statistics — Number of pending metadata changes, dataset count, total entries, total storage size
  • Version information — API version and configuration

The platform uses heartbeats to:

  • Track cluster health on the dashboard
  • Route requests only to active clusters (offline/suspended clusters reject proxy requests)
  • Detect connectivity issues and alert operators
tip

If a heartbeat fails, the Data API retries with exponential backoff up to a maximum interval of 5 minutes. During a network partition, the cluster continues operating normally — data processing, search, and all local operations work independently.

Cluster Connectivity

Data clusters connect to the Alien Intelligence platform through Skupper mTLS tunnels. This connectivity model has several important properties:

Outbound-Only

For on-premise deployments, the data cluster initiates the connection. No inbound firewall rules are required on your infrastructure. This makes on-premise data clusters deployable behind corporate firewalls, in air-gapped environments (with proxy), and on restrictive networks. For Alien-hosted clusters, connectivity is managed automatically.

Encrypted End-to-End

All traffic through the tunnel is encrypted with mutual TLS. Both sides authenticate with certificates — the platform cannot impersonate a data cluster, and vice versa.

Per-Tenant Routing

Each data cluster has a unique routing key that maps to its Data API endpoint. The platform uses this key to route proxy requests to the correct cluster. Routing is deterministic — the same cluster always receives the same routing key.

Resilient to Disconnection

If the tunnel drops, the data cluster continues operating independently. Metadata changes queue locally and sync when connectivity resumes. The platform marks the cluster as offline, and proxy requests return errors until the tunnel is re-established.

Infrastructure Per Cluster

Each data cluster's Data API provides access to the following per-tenant infrastructure:

PostgreSQL

Stores entry metadata, dataset configuration, manifest data (JSONB), the resource change log for sync, and webhook records. The database tracks the full state of every document — its files, processing status, versions, and relationships.

Key tables:

TablePurpose
datasetsDataset definitions with schema, pipeline config, and storage paths
entriesIndividual document records with manifests, status, and version tracking
weightsAI model weights and training artifacts
resource_change_logQueue for batch sync to platform
webhook_recordsOutbound webhook delivery tracking

MinIO (S3-Compatible Storage)

Stores all file data in a structured path hierarchy:

{slug}-data/
datasets/{dataset_id}/
entries/{entry_id}/
original/{timestamp}_{filename} # Uploaded files
processed/
content.json # Extracted text content
figures/
fig_001.png # Extracted figures
fig_002.png
processing/ # Intermediate pipeline artifacts
weights/{type}/{model_id}/
{timestamp}_{filename} # AI model weights

Qdrant (Vector Database)

Stores embedding vectors for semantic search. Each entry's text is split into chunks during pipeline processing, and each chunk gets an embedding vector. A Qdrant point stores:

  • The embedding vector
  • The chunk text
  • Metadata: dataset ID, entry ID, chunk index, figure references

Maintains two indexes per tenant:

  • {slug}_entries — Searchable document content with faceted filtering by dataset, status, MIME type, and tags
  • {slug}_datasets — Dataset names and descriptions for internal discovery

Reconciliation

The data plane operator periodically checks each data cluster's actual infrastructure against its desired state. Every 300 seconds, the reconciliation loop:

  1. Verifies all provisioned resources exist (database, bucket, collection, indexes)
  2. Recreates missing resources if needed
  3. Regenerates missing credentials (does not rotate healthy credentials)
  4. Updates the cluster's status based on current infrastructure health

This self-healing behavior means that transient infrastructure issues (a temporarily unavailable component, a deleted secret) are corrected automatically without operator intervention.

Cluster Deletion

When a data cluster is deleted, the operator performs a reverse-order cleanup:

  1. Remove the ArgoCD application (stops the Data API)
  2. Remove the Skupper connector (disconnects the tunnel)
  3. Delete Meilisearch indexes and credentials
  4. Delete the Qdrant collection and credentials
  5. Delete MinIO IAM user and bucket data
  6. Delete the PostgreSQL database and credentials
  7. Notify the platform that deletion is complete
  8. Delete the Kubernetes namespace

The deletion sequence continues even if individual steps fail, ensuring idempotent cleanup. Once complete, all tenant data is permanently removed from the data plane.

Next Steps