Search
The Alien Intelligence platform provides three complementary search paths, each optimized for a different type of query. Keyword search finds exact and near-exact matches in document text. Semantic search finds conceptually similar content using embedding vectors. Global discovery searches across all clusters for datasets by name and description.
All three search paths operate on data that stays in your isolated data cluster — the platform coordinates search requests but does not store or index document content centrally.
Search Architecture Overview
All search requests are routed through the platform's authenticated backend proxy. No data cluster is directly exposed to the internet — every request passes through the backend, which enforces authentication, authorization, and audit logging before forwarding to the target data cluster.
The three search paths available through this proxy:
| Search Path | Engine | Runs On | Purpose |
|---|---|---|---|
| Keyword Search | Meilisearch | Data cluster | Full-text matching with typo tolerance |
| Semantic Search | Qdrant | Data cluster | Meaning-based similarity search |
| Global Discovery | Algolia | Platform (metadata only) | Dataset catalog browsing |
Keyword Search
Keyword search uses Meilisearch, a full-text search engine running on each data cluster. It is optimized for fast, typo-tolerant text matching with faceted filtering.
Capabilities
| Feature | Description |
|---|---|
| Typo tolerance | Finds results even with spelling mistakes (configurable edit distance) |
| Faceted filtering | Filter by dataset, status, MIME type, tags, and other metadata fields |
| Highlighted snippets | Returns matching text with highlights showing where the query matched |
| Sub-50ms latency | Typical query response time under 50 milliseconds |
| Schema-aware indexing | Field mappings differ per dataset type — text fields are searchable, metadata fields are filterable |
How It Works
When a document is processed by a pipeline, the registration step indexes its content into Meilisearch. The indexed document includes:
- Document name and title
- Description and abstract (if present)
- Extracted text content
- Metadata fields defined by the dataset schema
The search request is routed from the user through the platform proxy to the data cluster's Data API, which queries Meilisearch and returns results.
Search Request
// POST /api/v1/search
{
"query": "machine learning protein folding",
"dataset_ids": ["uuid-1", "uuid-2"],
"filters": {
"status": "processed",
"mime_types": ["application/pdf"],
"tags": ["biology"]
},
"limit": 20,
"offset": 0
}
When to Use Keyword Search
- Finding documents by title, author, or known terms
- Filtering by metadata attributes (date, type, tags)
- Quick lookups where you know approximately what you are looking for
- Building filtered views of document collections
Keyword search is non-blocking for the platform — if Meilisearch is temporarily unavailable, the data cluster continues to function normally. Vector search and data operations are unaffected.
Semantic Search (Vector Search)
Semantic search uses Qdrant, a vector database running on each data cluster. Instead of matching keywords, it finds documents whose meaning is similar to the query, even if they use completely different words.
How It Works
During pipeline processing, each document is split into chunks and each chunk is converted into an embedding vector — a numerical representation of its meaning. These vectors are stored in Qdrant alongside the chunk text and metadata.
When a search query arrives, it is converted into a vector using the same embedding model, and Qdrant finds the stored vectors most similar to it using approximate nearest neighbor search.
Two Search Modes
The platform offers two vector search endpoints:
Chunk Search (Fast)
Returns individual chunks with their similarity scores. Does not fetch the full document — just the matching text segments.
// POST /api/v1/vector/chunks
{
"query": "novel approaches to protein structure prediction",
"dataset_ids": ["uuid-1"],
"score_threshold": 0.7,
"limit": 10
}
Response includes: chunk text, similarity score, entry ID, dataset ID, chunk index, figure references.
Latency:
- Pre-computed query vector: under 100ms
- Text query (auto-embedding): 500ms to 2 seconds (includes embedding generation)
Entry Search (Full Documents)
Performs a chunk search, groups results by entry, and fetches the full processed content from MinIO for each matching entry. Slower but returns complete documents.
// POST /api/v1/vector/entries
{
"query": "protein folding mechanisms",
"dataset_ids": ["uuid-1"],
"max_chunks_per_entry": 3,
"limit": 5
}
Response includes: full document content, matching chunks with scores, entry metadata, figures.
When to Use Semantic Search
- Finding conceptually related documents ("papers about climate change effects on agriculture")
- Exploratory research where you do not know exact terminology
- Cross-language discovery (embedding models support multilingual content)
- AI agent workflows that need to find relevant context
Embedding Providers
The platform supports multiple embedding providers, configurable per data cluster:
| Provider | Notes |
|---|---|
| OpenAI | Via OpenAI API |
| Mistral | Via Mistral API |
| Via Google AI API |
All entries in a dataset should use the same embedding model. Mixing models within a single Qdrant collection produces inconsistent search results because vectors from different models are not comparable.
Multi-Cluster Fan-Out
When datasets are distributed across multiple data clusters, a single search query can span all of them simultaneously. This is called multi-cluster fan-out.
How It Works
- Route resolution — The platform resolves the target dataset IDs to their hosting cluster IDs
- Group by cluster — Datasets are grouped by the cluster they reside on
- Parallel search — Async search requests are fired to all relevant clusters simultaneously
- Merge results — Results from all clusters are merged and sorted by similarity score
- Return top-k — The top results across all clusters are returned to the caller
When Multi-Cluster Fan-Out Is Used
Multi-cluster fan-out is primarily used by the platform's AI workflow engine (Workers). When an AI workflow runs a vector search node, it automatically resolves datasets to clusters and fans out the search. This is transparent to the user — a single search query in a workflow searches across all relevant clusters regardless of where the data is physically stored.
Multi-cluster fan-out goes through the platform proxy for each cluster. Each request is independently authenticated and logged. The platform merges results in memory and does not persist the search results.
Global Dataset Discovery
Global discovery uses Algolia, a hosted search service that indexes dataset metadata on the platform side. This enables users to find datasets across all clusters without knowing which cluster hosts them.
What Is Indexed
Algolia indexes only dataset-level metadata:
- Dataset name
- Description
- Tags
Algolia never receives: document content, processed text, embedding vectors, or entry-level data.
How It Works
When the batch sync runs every 30 seconds, updated dataset metadata is pushed to Algolia as a side effect. The platform frontend uses Algolia's search widgets to provide instant dataset discovery on the catalog and marketplace pages.
When to Use Global Discovery
- Browsing available datasets across the organization
- Searching for datasets by topic, provider, or tag
- Marketplace-style discovery of shared and public datasets
Search Configuration
Per-Dataset Configuration
Each dataset's type and schema affect how its entries are indexed in Meilisearch. Field mappings determine which fields are searchable, filterable, and sortable:
| Field Type | Searchable | Filterable | Sortable |
|---|---|---|---|
| Text content | Yes | No | No |
| Title / name | Yes | No | Yes |
| Tags | Yes | Yes | No |
| Status | No | Yes | Yes |
| MIME type | No | Yes | No |
| Dataset ID | No | Yes | No |
| Date fields | No | Yes | Yes |
Per-Cluster Configuration
Each data cluster's embedding configuration determines which model and provider are used for vector search. This is set at the data cluster level and applies to all datasets on that cluster.
Choosing the Right Search Path
| Need | Search Path | Why |
|---|---|---|
| Find a specific document by title | Keyword search | Exact matching with typo tolerance |
| Find documents about a concept | Semantic search | Meaning-based similarity |
| Filter by metadata attributes | Keyword search | Faceted filtering support |
| Search across multiple clusters | Multi-cluster fan-out | Transparent cross-cluster coordination |
| Browse available datasets | Global discovery | Platform-wide catalog search |
| AI agent document retrieval | Semantic search | Best relevance for automated analysis |
Next Steps
- Pipelines — How documents are processed to become searchable
- Datasets and Entries — The data model that search operates on
- Data Clusters — The infrastructure where search engines run
- Search and Query — Step-by-step guide to searching your documents
- AI Agent Integration — How AI agents use search to access your data