Infrastructure

The platform is built on Kubernetes and uses GitOps principles for all infrastructure management. Every configuration change is tracked in version control, deployed declaratively, and auditable. This page covers the infrastructure components, how they are orchestrated, and how the platform scales.

Kubernetes-Native Architecture

Both the platform and data clusters run on Kubernetes. This is not an implementation detail — it is a core architectural decision that enables:

Declarative infrastructure — the desired state of every service is defined in YAML and Helm charts
Automated recovery — Kubernetes restarts failed pods, reschedules workloads, and manages rolling updates
Namespace isolation — each tenant gets a dedicated namespace with separate resources and credentials
Consistent environments — the same Helm charts work across cloud providers, on-premise data centers, and local development clusters

GitOps with ArgoCD

All deployments use ArgoCD, a Kubernetes-native GitOps continuous delivery tool. ArgoCD continuously reconciles the live cluster state with the desired state defined in Git repositories.

How It Works

Infrastructure changes are made in Git — Helm values, chart versions, configuration
ArgoCD detects the change and compares it to the live cluster state
ArgoCD applies the difference — creating, updating, or removing resources
If manual changes are made directly to the cluster, ArgoCD reverts them to match Git (self-healing)

App of Apps Pattern

The platform uses ArgoCD's "App of Apps" pattern: a root ArgoCD Application generates child Applications for each service. This creates a dependency tree that deploys services in the correct order using sync waves — infrastructure operators are deployed before the databases they manage, databases before the APIs that connect to them, and so on.

info

For Alien Hosted deployments, ArgoCD manages everything automatically. For On-Premise deployments, ArgoCD runs on your data cluster and manages the data cluster infrastructure. You control when to update chart versions.

Deployment Benefits

Benefit	How ArgoCD Delivers It
Audit trail	Every change is a Git commit — who changed what, when, and why
Rollback	Revert to any previous state by reverting the Git commit
Drift detection	ArgoCD detects and corrects manual changes to the cluster
Multi-environment	Same charts, different values files per environment
Declarative	No imperative scripts — the desired state is always defined, not a sequence of commands

Per-Tenant Infrastructure Isolation

When a new tenant is created, the data cluster operator automatically provisions a complete, isolated infrastructure set. This happens in minutes — no manual intervention required.

What Each Tenant Gets

Resource	Isolation Level	Details
Kubernetes namespace	Dedicated	`tenant-{slug}` — network policies enforce boundaries
PostgreSQL database	Dedicated database	Separate DB within a shared PostgreSQL cluster, with scoped credentials
MinIO bucket	Dedicated bucket	Bucket-level IAM policies, per-tenant access credentials
Qdrant collection	Dedicated collection	JWT-scoped access, separate vector space
Meilisearch indexes	Dedicated indexes	API key scoped to tenant-specific indexes
Data API	Dedicated deployment	Separate pod running in the tenant namespace
Network connector	Dedicated	Skupper connector for platform-to-tenant communication
Secrets	Dedicated	All credentials stored as Kubernetes Secrets in the tenant namespace

Shared Infrastructure

The infrastructure components themselves are shared across tenants for efficiency, with logical isolation enforced at the application level:

This model gives each tenant complete data isolation while avoiding the operational overhead of running separate database clusters per tenant.

Storage Systems

PostgreSQL (Relational Database)

PostgreSQL serves two roles in the platform:

Platform database — stores user accounts, organizations, dataset catalog (metadata pointers), billing records, job history, and audit logs
Tenant databases — each tenant gets a dedicated PostgreSQL database storing entry metadata, dataset configurations, manifests, and change logs

PostgreSQL runs on CloudNativePG, a Kubernetes operator that provides:

High-availability with automatic failover (multi-replica configurations)
Connection pooling via PgBouncer (transaction mode)
Automated configuration management
Point-in-time recovery support

MinIO (S3-Compatible Object Storage)

MinIO provides S3-compatible object storage for all document files:

Original uploaded documents (PDFs, images, etc.)
Processed content (extracted text, figures)
Pipeline artifacts (intermediate processing outputs)

MinIO runs with erasure coding for data durability — it can survive disk failures without data loss. Each tenant gets a dedicated bucket with IAM policies that prevent cross-tenant access.

Qdrant (Vector Database)

Qdrant stores document embeddings for semantic search:

Each tenant gets a dedicated collection with JWT-scoped access
Replicated StatefulSet with anti-affinity (replicas on different nodes) for availability
Supports cosine similarity search with payload filtering
Sub-100ms search latency with pre-computed vectors

Meilisearch (Full-Text Search)

Meilisearch provides keyword search with:

Typo tolerance and fuzzy matching
Faceted filtering (by dataset, status, MIME type, tags)
Sub-50ms search latency
Per-tenant indexes with API key scoping

Autoscaling

The platform scales automatically based on workload:

Component	Scaling Mechanism	Trigger
Workers (AI workflows)	KEDA (event-driven)	SQS queue depth — scales based on pending job count
Data API (per tenant)	HPA (horizontal pod autoscaler)	CPU utilization
MCP Servers	HPA	CPU utilization
Connection pooling (PgBouncer)	HPA	CPU utilization
Argo Workflows	Controller concurrency limit	Fixed maximum concurrent workflows per cluster

How Worker Scaling Works

Workers consume jobs from an SQS queue. KEDA monitors the queue depth and scales worker replicas accordingly:

Queue has 0 pending jobs  → minimum replicas running
Queue has 10 pending jobs → additional replicas created
Queue has 50 pending jobs → maximum replicas running
Queue drains to 0         → scale back to minimum

This ensures that compute resources are allocated proportionally to demand — idle periods use minimal resources, and burst processing scales up within seconds.

Infrastructure Provisioning: The Operator

The data cluster operator is a Kubernetes operator that automates tenant lifecycle management. When a DataClusterTenant custom resource is created, the operator executes a multi-step provisioning sequence:

Create the tenant's Kubernetes namespace
Provision a PostgreSQL database and user
Create a MinIO bucket with IAM policies
Create a Qdrant collection
Create Meilisearch indexes and API key
Generate all credentials and store them as Kubernetes Secrets
Deploy the Data API as an ArgoCD Application
Create a Skupper connector to expose the Data API to the platform

This entire sequence runs automatically. The operator also handles updates (changing configuration, resizing resources), reconciliation (detecting and fixing drift), and deletion (cleanly removing all tenant resources).

tip

The operator sends heartbeats to the platform every 60 seconds, reporting infrastructure status, tenant health, and resource metrics. The platform uses these heartbeats to detect degraded or offline clusters.

Monitoring and Health

Heartbeat System

Data clusters report their health to the platform through two heartbeat channels:

Heartbeat	Frequency	Reporter	Content
Cluster heartbeat	Every 30 seconds	Data API	Cluster status, connectivity, sync state
Operator heartbeat	Every 60 seconds	Operator	Infrastructure status, tenant list, resource metrics, chart versions

Based on heartbeats, the platform tracks cluster health:

Status	Meaning
Online	Heartbeats received on schedule, all services healthy
Degraded	Some services reporting issues but cluster is reachable
Offline	No heartbeats received within the expected window

Infrastructure Metrics

The operator collects metrics from each infrastructure component:

PostgreSQL — database sizes, connection counts, replication lag
MinIO — storage usage per bucket, object counts
Qdrant — collection sizes, vector counts, query latency
Meilisearch — index sizes, document counts

These metrics are reported via the operator heartbeat and displayed in the platform dashboard.

Disaster Recovery

Alien Hosted

Alien manages backups and disaster recovery for hosted deployments:

PostgreSQL backups with point-in-time recovery capability
MinIO erasure coding for object storage durability
Qdrant replication across multiple nodes
Infrastructure-as-code (GitOps) enables full cluster reconstruction from Git

On-Premise

For On-Premise deployments, disaster recovery is your responsibility. The architecture supports it through:

CloudNativePG supports backup configuration to S3-compatible storage
MinIO erasure coding protects against disk failures
Qdrant replication provides vector database redundancy
GitOps ensures all configuration is in Git — a destroyed cluster can be rebuilt by reapplying the ArgoCD configuration
Operator reconciliation — once infrastructure is restored, the operator detects missing resources and re-provisions them

note

We recommend configuring PostgreSQL backups for On-Premise deployments. Contact us for guidance on backup configuration and disaster recovery planning.

Next Steps

Processing Engine — How document processing and AI workflows execute
Networking — Cross-cluster tunnels and the proxy architecture
Deployment Model — Alien Hosted vs On-Premise comparison
Data Clusters — Per-tenant isolation in practice

Kubernetes-Native Architecture​

GitOps with ArgoCD​

How It Works​

App of Apps Pattern​

Deployment Benefits​

Per-Tenant Infrastructure Isolation​

What Each Tenant Gets​

Shared Infrastructure​

Storage Systems​

PostgreSQL (Relational Database)​

MinIO (S3-Compatible Object Storage)​

Qdrant (Vector Database)​

Meilisearch (Full-Text Search)​

Autoscaling​

How Worker Scaling Works​

Infrastructure Provisioning: The Operator​

Monitoring and Health​

Heartbeat System​

Infrastructure Metrics​

Disaster Recovery​

Alien Hosted​

On-Premise​

Next Steps​