Skip to main content

Search and Query

After your documents are uploaded and processed, they are indexed for two types of search: keyword search (exact and near-exact text matching) and semantic search (meaning-based similarity).

The primary way your documents reach end users and clients is through AI workflows and MCP servers. Search is the engine that powers these integrations — AI agents use keyword and semantic search behind the scenes to find, read, and analyze your documents on behalf of users.

For the full architecture behind search, see Search (Concepts).

How Search Powers Your Workflows

When you deploy an MCP server on your data cluster, AI assistants like Claude can search and read your documents directly. The MCP server wraps the search capabilities described on this page as structured tools that AI agents invoke autonomously.

For example, when a user asks an AI assistant "Find papers about CRISPR in agriculture", the assistant:

  1. Calls datacluster_vector_search_chunks — a semantic search over your documents.
  2. Reads the top results using datacluster_get_entry_content.
  3. Synthesizes an answer grounded in your actual data.

This means that the quality of your search results directly determines the quality of AI-generated answers. The search configuration, embedding model, and chunking strategy all affect how well AI agents can serve your users.

See AI Agent Integration for details on connecting AI agents and Deploy MCP for deploying an MCP server.

Prerequisites

  • At least one dataset with entries in the Processed status.
  • Read access to the organization that owns the data cluster.

Search via the Web Interface

info

The web interface search is intended for debugging and testing purposes only. It is useful for verifying that documents are indexed correctly and that search results match your expectations. For production use, search should be accessed through MCP servers, AI workflows, or the API.

Keyword search finds documents that contain your search terms, with typo tolerance and highlighted snippets.

  1. Navigate to your dataset in the Clusters section.
  2. Use the Search bar at the top of the entries list.
  3. Type your query. Results appear as you type, with matching text highlighted.

You can filter results by:

  • Status — Show only processed entries, or entries in a specific state.
  • Tags — Filter by tags assigned to entries.
  • File type — Filter by MIME type (PDF, DOCX, etc.).

Global Dataset Discovery

The platform-wide search bar in the top navigation searches across dataset names, descriptions, and tags from all clusters in your organization. This is useful for finding datasets when you do not know which cluster hosts them.

Choosing the Right Search Type

I want to...Use
Find a document by title or authorKeyword search
Find documents about a concept (even with different wording)Semantic search (chunks)
Get full documents related to a topicSemantic search (entries)
Filter by metadata (tags, status, file type)Keyword search
Browse all available datasetsGlobal discovery (platform search bar)

Search API Reference

For programmatic access, use the platform backend proxy to send search requests to your data cluster. All requests are authenticated and routed through https://api.alien.club/clusters/{cluster_id}/proxy.

Keyword Search

Send a search query with optional filters:

Python

from data_api_client import ApiClient, Configuration
from data_api_client.api.search_api import SearchApi

config = Configuration(
host="https://api.alien.club/clusters/YOUR_CLUSTER_ID/proxy"
)
config.api_key["Authorization"] = "Bearer oat_YOUR_API_TOKEN"
config.api_key_prefix["Authorization"] = "Bearer"

with ApiClient(config) as client:
search_api = SearchApi(client)

results = search_api.keyword_search(
query="machine learning protein folding",
dataset_ids=[1, 2],
limit=20
)

for result in results.hits:
print(f"{result.name} (score: {result.score})")

cURL

curl -X POST "https://api.alien.club/clusters/YOUR_CLUSTER_ID/proxy/api/v1/search" \
-H "Authorization: Bearer oat_YOUR_API_TOKEN" \
-H "Content-Type: application/json" \
-d '{
"query": "machine learning protein folding",
"dataset_ids": [1, 2],
"limit": 20
}'

Semantic search finds documents whose meaning is similar to your query, even if they use different words. The query text is automatically converted to an embedding vector and compared against stored document vectors.

Two modes are available:

Chunk Search — Fast, Targeted Results

Returns individual text chunks with similarity scores. Use this when you need specific passages rather than full documents.

results = search_api.vector_search_chunks(
query="novel approaches to protein structure prediction",
dataset_ids=[1],
score_threshold=0.7,
limit=10
)

for chunk in results.results:
print(f"Score: {chunk.score:.2f}")
print(f"Text: {chunk.chunk_text[:200]}...")
print(f"Entry: {chunk.entry_id}")
print()
curl -X POST "https://api.alien.club/clusters/YOUR_CLUSTER_ID/proxy/api/v1/vector/chunks" \
-H "Authorization: Bearer oat_YOUR_API_TOKEN" \
-H "Content-Type: application/json" \
-d '{
"query": "novel approaches to protein structure prediction",
"dataset_ids": [1],
"score_threshold": 0.7,
"limit": 10
}'

Entry Search — Full Documents

Performs a chunk search, groups results by entry, and returns full processed content for each matching document. Slower but gives you complete documents.

results = search_api.vector_search_entries(
query="protein folding mechanisms",
dataset_ids=[1],
max_chunks_per_entry=3,
limit=5
)

for entry in results.results:
print(f"Entry: {entry.name}")
print(f"Best chunk score: {entry.chunks[0].score:.2f}")
print(f"Content length: {len(entry.content)} characters")
print()
curl -X POST "https://api.alien.club/clusters/YOUR_CLUSTER_ID/proxy/api/v1/vector/entries" \
-H "Authorization: Bearer oat_YOUR_API_TOKEN" \
-H "Content-Type: application/json" \
-d '{
"query": "protein folding mechanisms",
"dataset_ids": [1],
"max_chunks_per_entry": 3,
"limit": 5
}'

Search Latency

Search TypeTypical Latency
Keyword searchUnder 50ms
Chunk search (text query)500ms to 2 seconds (includes embedding generation)
Chunk search (pre-computed vector)Under 100ms
Entry search1 to 3 seconds (includes full content retrieval)

Next Steps