Infrastructure
The platform is built on Kubernetes and uses GitOps principles for all infrastructure management. Every configuration change is tracked in version control, deployed declaratively, and auditable. This page covers the infrastructure components, how they are orchestrated, and how the platform scales.
Kubernetes-Native Architecture
Both the platform and data clusters run on Kubernetes. This is not an implementation detail — it is a core architectural decision that enables:
- Declarative infrastructure — the desired state of every service is defined in YAML and Helm charts
- Automated recovery — Kubernetes restarts failed pods, reschedules workloads, and manages rolling updates
- Namespace isolation — each tenant gets a dedicated namespace with separate resources and credentials
- Consistent environments — the same Helm charts work across cloud providers, on-premise data centers, and local development clusters
GitOps with ArgoCD
All deployments use ArgoCD, a Kubernetes-native GitOps continuous delivery tool. ArgoCD continuously reconciles the live cluster state with the desired state defined in Git repositories.
How It Works
- Infrastructure changes are made in Git — Helm values, chart versions, configuration
- ArgoCD detects the change and compares it to the live cluster state
- ArgoCD applies the difference — creating, updating, or removing resources
- If manual changes are made directly to the cluster, ArgoCD reverts them to match Git (self-healing)
App of Apps Pattern
The platform uses ArgoCD's "App of Apps" pattern: a root ArgoCD Application generates child Applications for each service. This creates a dependency tree that deploys services in the correct order using sync waves — infrastructure operators are deployed before the databases they manage, databases before the APIs that connect to them, and so on.
For Alien Hosted deployments, ArgoCD manages everything automatically. For On-Premise deployments, ArgoCD runs on your data cluster and manages the data cluster infrastructure. You control when to update chart versions.
Deployment Benefits
| Benefit | How ArgoCD Delivers It |
|---|---|
| Audit trail | Every change is a Git commit — who changed what, when, and why |
| Rollback | Revert to any previous state by reverting the Git commit |
| Drift detection | ArgoCD detects and corrects manual changes to the cluster |
| Multi-environment | Same charts, different values files per environment |
| Declarative | No imperative scripts — the desired state is always defined, not a sequence of commands |
Per-Tenant Infrastructure Isolation
When a new tenant is created, the data cluster operator automatically provisions a complete, isolated infrastructure set. This happens in minutes — no manual intervention required.
What Each Tenant Gets
| Resource | Isolation Level | Details |
|---|---|---|
| Kubernetes namespace | Dedicated | tenant-{slug} — network policies enforce boundaries |
| PostgreSQL database | Dedicated database | Separate DB within a shared PostgreSQL cluster, with scoped credentials |
| MinIO bucket | Dedicated bucket | Bucket-level IAM policies, per-tenant access credentials |
| Qdrant collection | Dedicated collection | JWT-scoped access, separate vector space |
| Meilisearch indexes | Dedicated indexes | API key scoped to tenant-specific indexes |
| Data API | Dedicated deployment | Separate pod running in the tenant namespace |
| Network connector | Dedicated | Skupper connector for platform-to-tenant communication |
| Secrets | Dedicated | All credentials stored as Kubernetes Secrets in the tenant namespace |
Shared Infrastructure
The infrastructure components themselves are shared across tenants for efficiency, with logical isolation enforced at the application level:
This model gives each tenant complete data isolation while avoiding the operational overhead of running separate database clusters per tenant.
Storage Systems
PostgreSQL (Relational Database)
PostgreSQL serves two roles in the platform:
- Platform database — stores user accounts, organizations, dataset catalog (metadata pointers), billing records, job history, and audit logs
- Tenant databases — each tenant gets a dedicated PostgreSQL database storing entry metadata, dataset configurations, manifests, and change logs
PostgreSQL runs on CloudNativePG, a Kubernetes operator that provides:
- High-availability with automatic failover (multi-replica configurations)
- Connection pooling via PgBouncer (transaction mode)
- Automated configuration management
- Point-in-time recovery support
MinIO (S3-Compatible Object Storage)
MinIO provides S3-compatible object storage for all document files:
- Original uploaded documents (PDFs, images, etc.)
- Processed content (extracted text, figures)
- Pipeline artifacts (intermediate processing outputs)
MinIO runs with erasure coding for data durability — it can survive disk failures without data loss. Each tenant gets a dedicated bucket with IAM policies that prevent cross-tenant access.
Qdrant (Vector Database)
Qdrant stores document embeddings for semantic search:
- Each tenant gets a dedicated collection with JWT-scoped access
- Replicated StatefulSet with anti-affinity (replicas on different nodes) for availability
- Supports cosine similarity search with payload filtering
- Sub-100ms search latency with pre-computed vectors
Meilisearch (Full-Text Search)
Meilisearch provides keyword search with:
- Typo tolerance and fuzzy matching
- Faceted filtering (by dataset, status, MIME type, tags)
- Sub-50ms search latency
- Per-tenant indexes with API key scoping
Autoscaling
The platform scales automatically based on workload:
| Component | Scaling Mechanism | Trigger |
|---|---|---|
| Workers (AI workflows) | KEDA (event-driven) | SQS queue depth — scales based on pending job count |
| Data API (per tenant) | HPA (horizontal pod autoscaler) | CPU utilization |
| MCP Servers | HPA | CPU utilization |
| Connection pooling (PgBouncer) | HPA | CPU utilization |
| Argo Workflows | Controller concurrency limit | Fixed maximum concurrent workflows per cluster |
How Worker Scaling Works
Workers consume jobs from an SQS queue. KEDA monitors the queue depth and scales worker replicas accordingly:
Queue has 0 pending jobs → minimum replicas running
Queue has 10 pending jobs → additional replicas created
Queue has 50 pending jobs → maximum replicas running
Queue drains to 0 → scale back to minimum
This ensures that compute resources are allocated proportionally to demand — idle periods use minimal resources, and burst processing scales up within seconds.
Infrastructure Provisioning: The Operator
The data cluster operator is a Kubernetes operator that automates tenant lifecycle management. When a DataClusterTenant custom resource is created, the operator executes a multi-step provisioning sequence:
- Create the tenant's Kubernetes namespace
- Provision a PostgreSQL database and user
- Create a MinIO bucket with IAM policies
- Create a Qdrant collection
- Create Meilisearch indexes and API key
- Generate all credentials and store them as Kubernetes Secrets
- Deploy the Data API as an ArgoCD Application
- Create a Skupper connector to expose the Data API to the platform
This entire sequence runs automatically. The operator also handles updates (changing configuration, resizing resources), reconciliation (detecting and fixing drift), and deletion (cleanly removing all tenant resources).
The operator sends heartbeats to the platform every 60 seconds, reporting infrastructure status, tenant health, and resource metrics. The platform uses these heartbeats to detect degraded or offline clusters.
Monitoring and Health
Heartbeat System
Data clusters report their health to the platform through two heartbeat channels:
| Heartbeat | Frequency | Reporter | Content |
|---|---|---|---|
| Cluster heartbeat | Every 30 seconds | Data API | Cluster status, connectivity, sync state |
| Operator heartbeat | Every 60 seconds | Operator | Infrastructure status, tenant list, resource metrics, chart versions |
Based on heartbeats, the platform tracks cluster health:
| Status | Meaning |
|---|---|
| Online | Heartbeats received on schedule, all services healthy |
| Degraded | Some services reporting issues but cluster is reachable |
| Offline | No heartbeats received within the expected window |
Infrastructure Metrics
The operator collects metrics from each infrastructure component:
- PostgreSQL — database sizes, connection counts, replication lag
- MinIO — storage usage per bucket, object counts
- Qdrant — collection sizes, vector counts, query latency
- Meilisearch — index sizes, document counts
These metrics are reported via the operator heartbeat and displayed in the platform dashboard.
Disaster Recovery
Alien Hosted
Alien manages backups and disaster recovery for hosted deployments:
- PostgreSQL backups with point-in-time recovery capability
- MinIO erasure coding for object storage durability
- Qdrant replication across multiple nodes
- Infrastructure-as-code (GitOps) enables full cluster reconstruction from Git
On-Premise
For On-Premise deployments, disaster recovery is your responsibility. The architecture supports it through:
- CloudNativePG supports backup configuration to S3-compatible storage
- MinIO erasure coding protects against disk failures
- Qdrant replication provides vector database redundancy
- GitOps ensures all configuration is in Git — a destroyed cluster can be rebuilt by reapplying the ArgoCD configuration
- Operator reconciliation — once infrastructure is restored, the operator detects missing resources and re-provisions them
We recommend configuring PostgreSQL backups for On-Premise deployments. Contact us for guidance on backup configuration and disaster recovery planning.
Next Steps
- Processing Engine — How document processing and AI workflows execute
- Networking — Cross-cluster tunnels and the proxy architecture
- Deployment Model — Alien Hosted vs On-Premise comparison
- Data Clusters — Per-tenant isolation in practice