Skip to main content

Infrastructure

The platform is built on Kubernetes and uses GitOps principles for all infrastructure management. Every configuration change is tracked in version control, deployed declaratively, and auditable. This page covers the infrastructure components, how they are orchestrated, and how the platform scales.

Kubernetes-Native Architecture

Both the platform and data clusters run on Kubernetes. This is not an implementation detail — it is a core architectural decision that enables:

  • Declarative infrastructure — the desired state of every service is defined in YAML and Helm charts
  • Automated recovery — Kubernetes restarts failed pods, reschedules workloads, and manages rolling updates
  • Namespace isolation — each tenant gets a dedicated namespace with separate resources and credentials
  • Consistent environments — the same Helm charts work across cloud providers, on-premise data centers, and local development clusters

GitOps with ArgoCD

All deployments use ArgoCD, a Kubernetes-native GitOps continuous delivery tool. ArgoCD continuously reconciles the live cluster state with the desired state defined in Git repositories.

How It Works

  1. Infrastructure changes are made in Git — Helm values, chart versions, configuration
  2. ArgoCD detects the change and compares it to the live cluster state
  3. ArgoCD applies the difference — creating, updating, or removing resources
  4. If manual changes are made directly to the cluster, ArgoCD reverts them to match Git (self-healing)

App of Apps Pattern

The platform uses ArgoCD's "App of Apps" pattern: a root ArgoCD Application generates child Applications for each service. This creates a dependency tree that deploys services in the correct order using sync waves — infrastructure operators are deployed before the databases they manage, databases before the APIs that connect to them, and so on.

info

For Alien Hosted deployments, ArgoCD manages everything automatically. For On-Premise deployments, ArgoCD runs on your data cluster and manages the data cluster infrastructure. You control when to update chart versions.

Deployment Benefits

BenefitHow ArgoCD Delivers It
Audit trailEvery change is a Git commit — who changed what, when, and why
RollbackRevert to any previous state by reverting the Git commit
Drift detectionArgoCD detects and corrects manual changes to the cluster
Multi-environmentSame charts, different values files per environment
DeclarativeNo imperative scripts — the desired state is always defined, not a sequence of commands

Per-Tenant Infrastructure Isolation

When a new tenant is created, the data cluster operator automatically provisions a complete, isolated infrastructure set. This happens in minutes — no manual intervention required.

What Each Tenant Gets

ResourceIsolation LevelDetails
Kubernetes namespaceDedicatedtenant-{slug} — network policies enforce boundaries
PostgreSQL databaseDedicated databaseSeparate DB within a shared PostgreSQL cluster, with scoped credentials
MinIO bucketDedicated bucketBucket-level IAM policies, per-tenant access credentials
Qdrant collectionDedicated collectionJWT-scoped access, separate vector space
Meilisearch indexesDedicated indexesAPI key scoped to tenant-specific indexes
Data APIDedicated deploymentSeparate pod running in the tenant namespace
Network connectorDedicatedSkupper connector for platform-to-tenant communication
SecretsDedicatedAll credentials stored as Kubernetes Secrets in the tenant namespace

Shared Infrastructure

The infrastructure components themselves are shared across tenants for efficiency, with logical isolation enforced at the application level:

This model gives each tenant complete data isolation while avoiding the operational overhead of running separate database clusters per tenant.

Storage Systems

PostgreSQL (Relational Database)

PostgreSQL serves two roles in the platform:

  • Platform database — stores user accounts, organizations, dataset catalog (metadata pointers), billing records, job history, and audit logs
  • Tenant databases — each tenant gets a dedicated PostgreSQL database storing entry metadata, dataset configurations, manifests, and change logs

PostgreSQL runs on CloudNativePG, a Kubernetes operator that provides:

  • High-availability with automatic failover (multi-replica configurations)
  • Connection pooling via PgBouncer (transaction mode)
  • Automated configuration management
  • Point-in-time recovery support

MinIO (S3-Compatible Object Storage)

MinIO provides S3-compatible object storage for all document files:

  • Original uploaded documents (PDFs, images, etc.)
  • Processed content (extracted text, figures)
  • Pipeline artifacts (intermediate processing outputs)

MinIO runs with erasure coding for data durability — it can survive disk failures without data loss. Each tenant gets a dedicated bucket with IAM policies that prevent cross-tenant access.

Qdrant (Vector Database)

Qdrant stores document embeddings for semantic search:

  • Each tenant gets a dedicated collection with JWT-scoped access
  • Replicated StatefulSet with anti-affinity (replicas on different nodes) for availability
  • Supports cosine similarity search with payload filtering
  • Sub-100ms search latency with pre-computed vectors

Meilisearch provides keyword search with:

  • Typo tolerance and fuzzy matching
  • Faceted filtering (by dataset, status, MIME type, tags)
  • Sub-50ms search latency
  • Per-tenant indexes with API key scoping

Autoscaling

The platform scales automatically based on workload:

ComponentScaling MechanismTrigger
Workers (AI workflows)KEDA (event-driven)SQS queue depth — scales based on pending job count
Data API (per tenant)HPA (horizontal pod autoscaler)CPU utilization
MCP ServersHPACPU utilization
Connection pooling (PgBouncer)HPACPU utilization
Argo WorkflowsController concurrency limitFixed maximum concurrent workflows per cluster

How Worker Scaling Works

Workers consume jobs from an SQS queue. KEDA monitors the queue depth and scales worker replicas accordingly:

Queue has 0 pending jobs  → minimum replicas running
Queue has 10 pending jobs → additional replicas created
Queue has 50 pending jobs → maximum replicas running
Queue drains to 0 → scale back to minimum

This ensures that compute resources are allocated proportionally to demand — idle periods use minimal resources, and burst processing scales up within seconds.

Infrastructure Provisioning: The Operator

The data cluster operator is a Kubernetes operator that automates tenant lifecycle management. When a DataClusterTenant custom resource is created, the operator executes a multi-step provisioning sequence:

  1. Create the tenant's Kubernetes namespace
  2. Provision a PostgreSQL database and user
  3. Create a MinIO bucket with IAM policies
  4. Create a Qdrant collection
  5. Create Meilisearch indexes and API key
  6. Generate all credentials and store them as Kubernetes Secrets
  7. Deploy the Data API as an ArgoCD Application
  8. Create a Skupper connector to expose the Data API to the platform

This entire sequence runs automatically. The operator also handles updates (changing configuration, resizing resources), reconciliation (detecting and fixing drift), and deletion (cleanly removing all tenant resources).

tip

The operator sends heartbeats to the platform every 60 seconds, reporting infrastructure status, tenant health, and resource metrics. The platform uses these heartbeats to detect degraded or offline clusters.

Monitoring and Health

Heartbeat System

Data clusters report their health to the platform through two heartbeat channels:

HeartbeatFrequencyReporterContent
Cluster heartbeatEvery 30 secondsData APICluster status, connectivity, sync state
Operator heartbeatEvery 60 secondsOperatorInfrastructure status, tenant list, resource metrics, chart versions

Based on heartbeats, the platform tracks cluster health:

StatusMeaning
OnlineHeartbeats received on schedule, all services healthy
DegradedSome services reporting issues but cluster is reachable
OfflineNo heartbeats received within the expected window

Infrastructure Metrics

The operator collects metrics from each infrastructure component:

  • PostgreSQL — database sizes, connection counts, replication lag
  • MinIO — storage usage per bucket, object counts
  • Qdrant — collection sizes, vector counts, query latency
  • Meilisearch — index sizes, document counts

These metrics are reported via the operator heartbeat and displayed in the platform dashboard.

Disaster Recovery

Alien Hosted

Alien manages backups and disaster recovery for hosted deployments:

  • PostgreSQL backups with point-in-time recovery capability
  • MinIO erasure coding for object storage durability
  • Qdrant replication across multiple nodes
  • Infrastructure-as-code (GitOps) enables full cluster reconstruction from Git

On-Premise

For On-Premise deployments, disaster recovery is your responsibility. The architecture supports it through:

  • CloudNativePG supports backup configuration to S3-compatible storage
  • MinIO erasure coding protects against disk failures
  • Qdrant replication provides vector database redundancy
  • GitOps ensures all configuration is in Git — a destroyed cluster can be rebuilt by reapplying the ArgoCD configuration
  • Operator reconciliation — once infrastructure is restored, the operator detects missing resources and re-provisions them
note

We recommend configuring PostgreSQL backups for On-Premise deployments. Contact us for guidance on backup configuration and disaster recovery planning.

Next Steps