Skip to main content

Search

The Alien Intelligence platform provides three complementary search paths, each optimized for a different type of query. Keyword search finds exact and near-exact matches in document text. Semantic search finds conceptually similar content using embedding vectors. Global discovery searches across all clusters for datasets by name and description.

All three search paths operate on data that stays in your isolated data cluster — the platform coordinates search requests but does not store or index document content centrally.

Search Architecture Overview

All search requests are routed through the platform's authenticated backend proxy. No data cluster is directly exposed to the internet — every request passes through the backend, which enforces authentication, authorization, and audit logging before forwarding to the target data cluster.

The three search paths available through this proxy:

Search PathEngineRuns OnPurpose
Keyword SearchMeilisearchData clusterFull-text matching with typo tolerance
Semantic SearchQdrantData clusterMeaning-based similarity search
Global DiscoveryAlgoliaPlatform (metadata only)Dataset catalog browsing

Keyword search uses Meilisearch, a full-text search engine running on each data cluster. It is optimized for fast, typo-tolerant text matching with faceted filtering.

Capabilities

FeatureDescription
Typo toleranceFinds results even with spelling mistakes (configurable edit distance)
Faceted filteringFilter by dataset, status, MIME type, tags, and other metadata fields
Highlighted snippetsReturns matching text with highlights showing where the query matched
Sub-50ms latencyTypical query response time under 50 milliseconds
Schema-aware indexingField mappings differ per dataset type — text fields are searchable, metadata fields are filterable

How It Works

When a document is processed by a pipeline, the registration step indexes its content into Meilisearch. The indexed document includes:

  • Document name and title
  • Description and abstract (if present)
  • Extracted text content
  • Metadata fields defined by the dataset schema

The search request is routed from the user through the platform proxy to the data cluster's Data API, which queries Meilisearch and returns results.

Search Request

// POST /api/v1/search
{
"query": "machine learning protein folding",
"dataset_ids": ["uuid-1", "uuid-2"],
"filters": {
"status": "processed",
"mime_types": ["application/pdf"],
"tags": ["biology"]
},
"limit": 20,
"offset": 0
}
  • Finding documents by title, author, or known terms
  • Filtering by metadata attributes (date, type, tags)
  • Quick lookups where you know approximately what you are looking for
  • Building filtered views of document collections
tip

Keyword search is non-blocking for the platform — if Meilisearch is temporarily unavailable, the data cluster continues to function normally. Vector search and data operations are unaffected.

Semantic search uses Qdrant, a vector database running on each data cluster. Instead of matching keywords, it finds documents whose meaning is similar to the query, even if they use completely different words.

How It Works

During pipeline processing, each document is split into chunks and each chunk is converted into an embedding vector — a numerical representation of its meaning. These vectors are stored in Qdrant alongside the chunk text and metadata.

When a search query arrives, it is converted into a vector using the same embedding model, and Qdrant finds the stored vectors most similar to it using approximate nearest neighbor search.

Two Search Modes

The platform offers two vector search endpoints:

Chunk Search (Fast)

Returns individual chunks with their similarity scores. Does not fetch the full document — just the matching text segments.

// POST /api/v1/vector/chunks
{
"query": "novel approaches to protein structure prediction",
"dataset_ids": ["uuid-1"],
"score_threshold": 0.7,
"limit": 10
}

Response includes: chunk text, similarity score, entry ID, dataset ID, chunk index, figure references.

Latency:

  • Pre-computed query vector: under 100ms
  • Text query (auto-embedding): 500ms to 2 seconds (includes embedding generation)

Entry Search (Full Documents)

Performs a chunk search, groups results by entry, and fetches the full processed content from MinIO for each matching entry. Slower but returns complete documents.

// POST /api/v1/vector/entries
{
"query": "protein folding mechanisms",
"dataset_ids": ["uuid-1"],
"max_chunks_per_entry": 3,
"limit": 5
}

Response includes: full document content, matching chunks with scores, entry metadata, figures.

  • Finding conceptually related documents ("papers about climate change effects on agriculture")
  • Exploratory research where you do not know exact terminology
  • Cross-language discovery (embedding models support multilingual content)
  • AI agent workflows that need to find relevant context

Embedding Providers

The platform supports multiple embedding providers, configurable per data cluster:

ProviderNotes
OpenAIVia OpenAI API
MistralVia Mistral API
GoogleVia Google AI API
info

All entries in a dataset should use the same embedding model. Mixing models within a single Qdrant collection produces inconsistent search results because vectors from different models are not comparable.

Multi-Cluster Fan-Out

When datasets are distributed across multiple data clusters, a single search query can span all of them simultaneously. This is called multi-cluster fan-out.

How It Works

  1. Route resolution — The platform resolves the target dataset IDs to their hosting cluster IDs
  2. Group by cluster — Datasets are grouped by the cluster they reside on
  3. Parallel search — Async search requests are fired to all relevant clusters simultaneously
  4. Merge results — Results from all clusters are merged and sorted by similarity score
  5. Return top-k — The top results across all clusters are returned to the caller

When Multi-Cluster Fan-Out Is Used

Multi-cluster fan-out is primarily used by the platform's AI workflow engine (Workers). When an AI workflow runs a vector search node, it automatically resolves datasets to clusters and fans out the search. This is transparent to the user — a single search query in a workflow searches across all relevant clusters regardless of where the data is physically stored.

note

Multi-cluster fan-out goes through the platform proxy for each cluster. Each request is independently authenticated and logged. The platform merges results in memory and does not persist the search results.

Global Dataset Discovery

Global discovery uses Algolia, a hosted search service that indexes dataset metadata on the platform side. This enables users to find datasets across all clusters without knowing which cluster hosts them.

What Is Indexed

Algolia indexes only dataset-level metadata:

  • Dataset name
  • Description
  • Tags

Algolia never receives: document content, processed text, embedding vectors, or entry-level data.

How It Works

When the batch sync runs every 30 seconds, updated dataset metadata is pushed to Algolia as a side effect. The platform frontend uses Algolia's search widgets to provide instant dataset discovery on the catalog and marketplace pages.

When to Use Global Discovery

  • Browsing available datasets across the organization
  • Searching for datasets by topic, provider, or tag
  • Marketplace-style discovery of shared and public datasets

Search Configuration

Per-Dataset Configuration

Each dataset's type and schema affect how its entries are indexed in Meilisearch. Field mappings determine which fields are searchable, filterable, and sortable:

Field TypeSearchableFilterableSortable
Text contentYesNoNo
Title / nameYesNoYes
TagsYesYesNo
StatusNoYesYes
MIME typeNoYesNo
Dataset IDNoYesNo
Date fieldsNoYesYes

Per-Cluster Configuration

Each data cluster's embedding configuration determines which model and provider are used for vector search. This is set at the data cluster level and applies to all datasets on that cluster.

Choosing the Right Search Path

NeedSearch PathWhy
Find a specific document by titleKeyword searchExact matching with typo tolerance
Find documents about a conceptSemantic searchMeaning-based similarity
Filter by metadata attributesKeyword searchFaceted filtering support
Search across multiple clustersMulti-cluster fan-outTransparent cross-cluster coordination
Browse available datasetsGlobal discoveryPlatform-wide catalog search
AI agent document retrievalSemantic searchBest relevance for automated analysis

Next Steps