Search

The Alien Intelligence platform provides three complementary search paths, each optimized for a different type of query. Keyword search finds exact and near-exact matches in document text. Semantic search finds conceptually similar content using embedding vectors. Global discovery searches across all clusters for datasets by name and description.

All three search paths operate on data that stays in your isolated data cluster — the platform coordinates search requests but does not store or index document content centrally.

Search Architecture Overview

All search requests are routed through the platform's authenticated backend proxy. No data cluster is directly exposed to the internet — every request passes through the backend, which enforces authentication, authorization, and audit logging before forwarding to the target data cluster.

The three search paths available through this proxy:

Search Path	Engine	Runs On	Purpose
Keyword Search	Meilisearch	Data cluster	Full-text matching with typo tolerance
Semantic Search	Qdrant	Data cluster	Meaning-based similarity search
Global Discovery	Algolia	Platform (metadata only)	Dataset catalog browsing

Keyword Search

Keyword search uses Meilisearch, a full-text search engine running on each data cluster. It is optimized for fast, typo-tolerant text matching with faceted filtering.

Capabilities

Feature	Description
Typo tolerance	Finds results even with spelling mistakes (configurable edit distance)
Faceted filtering	Filter by dataset, status, MIME type, tags, and other metadata fields
Highlighted snippets	Returns matching text with highlights showing where the query matched
Sub-50ms latency	Typical query response time under 50 milliseconds
Schema-aware indexing	Field mappings differ per dataset type — text fields are searchable, metadata fields are filterable

How It Works

When a document is processed by a pipeline, the registration step indexes its content into Meilisearch. The indexed document includes:

Document name and title
Description and abstract (if present)
Extracted text content
Metadata fields defined by the dataset schema

The search request is routed from the user through the platform proxy to the data cluster's Data API, which queries Meilisearch and returns results.

Search Request

// POST /api/v1/search
{
  "query": "machine learning protein folding",
  "dataset_ids": ["uuid-1", "uuid-2"],
  "filters": {
    "status": "processed",
    "mime_types": ["application/pdf"],
    "tags": ["biology"]
  },
  "limit": 20,
  "offset": 0
}

When to Use Keyword Search

Finding documents by title, author, or known terms
Filtering by metadata attributes (date, type, tags)
Quick lookups where you know approximately what you are looking for
Building filtered views of document collections

tip

Keyword search is non-blocking for the platform — if Meilisearch is temporarily unavailable, the data cluster continues to function normally. Vector search and data operations are unaffected.

Semantic Search (Vector Search)

Semantic search uses Qdrant, a vector database running on each data cluster. Instead of matching keywords, it finds documents whose meaning is similar to the query, even if they use completely different words.

How It Works

During pipeline processing, each document is split into chunks and each chunk is converted into an embedding vector — a numerical representation of its meaning. These vectors are stored in Qdrant alongside the chunk text and metadata.

When a search query arrives, it is converted into a vector using the same embedding model, and Qdrant finds the stored vectors most similar to it using approximate nearest neighbor search.

Two Search Modes

The platform offers two vector search endpoints:

Chunk Search (Fast)

Returns individual chunks with their similarity scores. Does not fetch the full document — just the matching text segments.

// POST /api/v1/vector/chunks
{
  "query": "novel approaches to protein structure prediction",
  "dataset_ids": ["uuid-1"],
  "score_threshold": 0.7,
  "limit": 10
}

Response includes: chunk text, similarity score, entry ID, dataset ID, chunk index, figure references.

Latency:

Pre-computed query vector: under 100ms
Text query (auto-embedding): 500ms to 2 seconds (includes embedding generation)

Entry Search (Full Documents)

Performs a chunk search, groups results by entry, and fetches the full processed content from MinIO for each matching entry. Slower but returns complete documents.

// POST /api/v1/vector/entries
{
  "query": "protein folding mechanisms",
  "dataset_ids": ["uuid-1"],
  "max_chunks_per_entry": 3,
  "limit": 5
}

Response includes: full document content, matching chunks with scores, entry metadata, figures.

When to Use Semantic Search

Finding conceptually related documents ("papers about climate change effects on agriculture")
Exploratory research where you do not know exact terminology
Cross-language discovery (embedding models support multilingual content)
AI agent workflows that need to find relevant context

Embedding Providers

The platform supports multiple embedding providers, configurable per data cluster:

Provider	Notes
OpenAI	Via OpenAI API
Mistral	Via Mistral API
Google	Via Google AI API

info

All entries in a dataset should use the same embedding model. Mixing models within a single Qdrant collection produces inconsistent search results because vectors from different models are not comparable.

Multi-Cluster Fan-Out

When datasets are distributed across multiple data clusters, a single search query can span all of them simultaneously. This is called multi-cluster fan-out.

How It Works

Route resolution — The platform resolves the target dataset IDs to their hosting cluster IDs
Group by cluster — Datasets are grouped by the cluster they reside on
Parallel search — Async search requests are fired to all relevant clusters simultaneously
Merge results — Results from all clusters are merged and sorted by similarity score
Return top-k — The top results across all clusters are returned to the caller

When Multi-Cluster Fan-Out Is Used

Multi-cluster fan-out is primarily used by the platform's AI workflow engine (Workers). When an AI workflow runs a vector search node, it automatically resolves datasets to clusters and fans out the search. This is transparent to the user — a single search query in a workflow searches across all relevant clusters regardless of where the data is physically stored.

note

Multi-cluster fan-out goes through the platform proxy for each cluster. Each request is independently authenticated and logged. The platform merges results in memory and does not persist the search results.

Global Dataset Discovery

Global discovery uses Algolia, a hosted search service that indexes dataset metadata on the platform side. This enables users to find datasets across all clusters without knowing which cluster hosts them.

What Is Indexed

Algolia indexes only dataset-level metadata:

Dataset name
Description
Tags

Algolia never receives: document content, processed text, embedding vectors, or entry-level data.

How It Works

When the batch sync runs every 30 seconds, updated dataset metadata is pushed to Algolia as a side effect. The platform frontend uses Algolia's search widgets to provide instant dataset discovery on the catalog and marketplace pages.

When to Use Global Discovery

Browsing available datasets across the organization
Searching for datasets by topic, provider, or tag
Marketplace-style discovery of shared and public datasets

Search Configuration

Per-Dataset Configuration

Each dataset's type and schema affect how its entries are indexed in Meilisearch. Field mappings determine which fields are searchable, filterable, and sortable:

Field Type	Searchable	Filterable	Sortable
Text content	Yes	No	No
Title / name	Yes	No	Yes
Tags	Yes	Yes	No
Status	No	Yes	Yes
MIME type	No	Yes	No
Dataset ID	No	Yes	No
Date fields	No	Yes	Yes

Per-Cluster Configuration

Each data cluster's embedding configuration determines which model and provider are used for vector search. This is set at the data cluster level and applies to all datasets on that cluster.

Choosing the Right Search Path

Need	Search Path	Why
Find a specific document by title	Keyword search	Exact matching with typo tolerance
Find documents about a concept	Semantic search	Meaning-based similarity
Filter by metadata attributes	Keyword search	Faceted filtering support
Search across multiple clusters	Multi-cluster fan-out	Transparent cross-cluster coordination
Browse available datasets	Global discovery	Platform-wide catalog search
AI agent document retrieval	Semantic search	Best relevance for automated analysis

Next Steps

Pipelines — How documents are processed to become searchable
Datasets and Entries — The data model that search operates on
Data Clusters — The infrastructure where search engines run
Search and Query — Step-by-step guide to searching your documents
AI Agent Integration — How AI agents use search to access your data

Search Architecture Overview​

Keyword Search​

Capabilities​

How It Works​

Search Request​

When to Use Keyword Search​

Semantic Search (Vector Search)​

How It Works​

Two Search Modes​

Chunk Search (Fast)​

Entry Search (Full Documents)​

When to Use Semantic Search​

Embedding Providers​

Multi-Cluster Fan-Out​

How It Works​

When Multi-Cluster Fan-Out Is Used​

Global Dataset Discovery​

What Is Indexed​

How It Works​

When to Use Global Discovery​

Search Configuration​

Per-Dataset Configuration​

Per-Cluster Configuration​

Choosing the Right Search Path​

Next Steps​

Search Architecture Overview

Keyword Search

Capabilities

How It Works

Search Request

When to Use Keyword Search

Semantic Search (Vector Search)

How It Works

Two Search Modes

Chunk Search (Fast)

Entry Search (Full Documents)

When to Use Semantic Search

Embedding Providers

Multi-Cluster Fan-Out

How It Works

When Multi-Cluster Fan-Out Is Used

Global Dataset Discovery

What Is Indexed

How It Works

When to Use Global Discovery

Search Configuration

Per-Dataset Configuration

Per-Cluster Configuration

Choosing the Right Search Path

Next Steps