Configure and manage the Search AI index, including content extraction, transformation, workbench processing, and content browsing.
Overview
The Index Configuration in Search AI encompasses the complete data processing pipeline that transforms raw content into searchable, high-quality chunks optimized for answer generation. This pipeline consists of three main phases:
- Extraction - Breaking down source content into manageable chunks using various extraction strategies
- Transformation - Enriching and refining extracted content through configurable processing stages
- Indexing - Storing processed chunks with vector embeddings for efficient retrieval
Content extraction segments ingested data into smaller chunks to organize and efficiently retrieve relevant information for answer generation. The application supports various extraction strategies that can be customized based on content format, structure, and answer requirements.
Combines NLP and machine learning techniques based on tokenization, where text is segmented into smaller units.
Configuration Options:
-
Chunk Size (Pages): Treats every page as a single chunk
-
Chunk Size (Tokens): Prepares chunks using:
- Tokens: Maximum tokens per chunk (up to 5000). Smaller chunks suit granular tasks; larger chunks better preserve context.
- Chunk Overlap: Number of tokens overlapping between consecutive chunks
Extracts data by considering content layout and structure, improving precision for documents with tables, graphs, and charts. It identifies objects in documents by combining OCR technology, layout detection models, and layout awareness rules.
Configuration:
- General Template: Extracts content from complex PDFs and DOCX files including tables and images
Designed specifically for extracting data from HTML files, including tables, images, and textual content. Videos are included in chunks but transcripts aren’t extracted.
Configuration Templates:
-
General: Identifies different components and classes within HTML documents. Tables and images present between the content are also extracted and stored as chunks.
-
Token-Based: Generates chunks based on token size. Images within the content are extracted and stored in the chunk.
| Field | Description | Range | Default |
|---|
| Tokens | Max tokens per chunk | 100 - 1000 | 300 |
| Chunk Overlap | Overlapping tokens between adjacent chunks | 10 - 100 | — |
Documents with content shorter than 60 characters are skipped when using the General template.
Transforms each page of a source document into structured Markdown format before processing. Effective for preserving semantic structure.
Supported formats: PDF files (uploaded directly or via connectors).
Handles complex PDFs with non-textual layouts such as forms, tables, or visually rich content. Each page is converted to an image and processed using VDR embedding models.
How it works:
- Each page in the PDF document is converted to an image, preserving visual structure and layout.
- Pages are processed using a VDR embedding model that captures both textual and visual semantics.
- Page contents are also extracted in standard chunk format alongside the visual embedding.
- Each extracted chunk includes a
page_image_url field referencing its corresponding page image.
Supported formats: PDF files
Requires image-based embedding model selection in Vector Configuration. Answer generation not supported by XO GPT with this strategy. This strategy is supported for a limited set of languages.
Learn More.
Enables integration with third-party services for custom content processing.
How it works:
- Ingested content is sent to an external service for processing.
- The service structures the data into chunks and returns them to Search AI via a callback API.
- Search AI indexes the returned chunks for search and retrieval.
Configuration:
- Endpoint: URL for sending content (POST endpoint).
- Concurrency: Maximum API calls per second.
- Headers: Additional request headers. Default headers will be added automatically and can’t be edited.
- Request Body: Content fields to send to the service.
Click Test to send a sample request using the configured headers and body. Once invoked successfully:
- Generated Response — displays the response returned via the callback URL.
- Response Path — provide the JSON path to locate extracted chunks within the response.
- Response Comparison — validates the actual response against the expected structure. An error is shown if they don’t match.
By default, every application has a Default Strategy that uses different extraction methods for different types of content.
| Content Source | Content Type | Extraction Strategy |
|---|
| WebPages | html | Advanced HTML Extraction |
| Documents | pdf, doc, docx | Markdown Extraction |
| Documents | pptx, txt | Text Extraction |
| Connectors | pdf, doc, docx, html, aspx | Markdown Extraction |
| Connectors | pptx, txt | Text Extraction |
| Connectors | json | JSON Extraction |
The default extraction strategy applies only to applications created after November 19, 2025.
Adding a Strategy
-
Navigate to Index > Extract.
-
Click +Add Strategy.
-
Configure:
- Strategy Name: Unique identifier
- Define Source: Select data sources and content types using filters
- Extraction Model: Choose the extraction method
- Configure model-specific settings
A new strategy is enabled automatically upon creation, but extraction does not begin until you trigger it manually using the Train option.
Multiple Strategies:
When multiple strategies exist, they apply based on their sequence in the list - top-down priority.
Example: With a web pages strategy and a default (all content) strategy:
Default on top → applies to all content; web pages strategy is never triggered.
Web pages strategy on top → processes HTML first; default handles the rest.
To reorder, drag strategies above or below as needed.
Managing Strategies:
- Enable/Disable: Toggle from the strategy page. Use it for testing purposes to evaluate alternative extraction strategies or when it’s no longer needed.
- Delete: Use the Delete button (doesn’t affect existing chunks).
- Reorder: Drag to change sequence.
Content Transformation
Content transformation refines extracted text into high-quality data after the extraction phase. This enrichment process addresses incomplete metadata, formatting inconsistencies, and missing context to improve search and retrieval effectiveness. The transformed output feeds directly into vectorization for AI processing.
Example use case: A crawled blog page may contain ads, recent post references, and embedded author info alongside the core content. Transformation stages can strip the irrelevant sections and remap author info to a dedicated metadata field — ensuring only clean, structured content is indexed.
Benefits
- Improved Data Quality: Fix errors, standardize fields, add contextual information.
- Enhanced Control: Adapt enrichment to specific business needs.
- Bulk Transformation: Apply rules to all applicable content simultaneously.
Field Mapping Stage
Adds, updates, or deletes specific fields from input content. Useful for ensuring uniformity across pages, for example, adding a title to pages where it’s missing.
Configuration:
-
Name: Unique stage identifier
-
Condition: Rules for selecting content
- Field Name, Operator, Value
-
Outcome: Actions to perform:
- Set: Sets value for target field
- Delete: Removes target field
- Copy: Copies value between fields
Custom Script Stage
Implements custom JavaScript transformations for specific business needs. For example, prepending a source title to all extracted content.
For this stage, sub-conditions and outcomes are defined together within the same script.
Configuration:
- Outcome: JavaScript defining transformations.
- View Fields: Lists all the fields of the schema.
Example - Count pages:
int temp_total_pages = 0;
if(ctx.file_content_obj != null){
for (def item: ctx.file_content_obj) {
if (item!="") {
temp_total_pages = temp_total_pages+1;
}
}
}
context.total_pages = temp_total_pages;
Exclude Documents Stage
Filters out unnecessary or irrelevant content before ingestion.
Configuration:
- Field: Document field for condition (for example, creation date and file type).
- Operator: Comparison operator (greater than, less than, equals).
- Value: Comparison value.
Example: To exclude documents older than a specific date, set Field = Created On, Operator = less than, Value = [cutoff date].
API Stage
Invokes external APIs to modify, enrich, or analyze content during transformation. Useful for metadata extraction, summarization, or translation via custom models.
Configuration:
- Endpoint: POST endpoint URL (supports dynamic fields with
{{field_name}}). Example: https://api.example.com/meta/{{chunkTitle}}?type={{cfs1}}
- Headers: Key-value pairs for authentication
- Request Body: Content to send (use
{{field_name}} for dynamic values). Example: "content": {{content}}.
Testing: Click Test to send a sample request and validate configuration. Use the Response tab to view returned data and map API response fields to Search AI schema fields.
Note: Only POST APIs and Sync APIs are supported. You can map one or more API response fields to the Search AI schema fields corresponding to the content.
LLM Stage
Leverages external LLMs to refine, update, or enrich content (summarization, readability improvements, context addition).
Prerequisites:
- Set up required LLM in Models Library.
- Create custom prompt for ‘Transform Documents with LLM’.
- Enable the feature in Gen AI features page.
Configuration:
- LLM: Select the language model.
- Prompt: Select the custom prompt.
- Target Field: Field for storing enriched content.
| Strategy | Field Mapping | Custom Script | Exclude Documents | API Stage | LLM Stage |
|---|
| Text Extraction | ✓ | ✓ | ✓ | ✓ | ✓ |
| Advanced HTML Extraction | Yes | Yes | Yes | Yes | ✓ |
| Layout Aware Extraction | NA | NA | NA | NA | NA |
| Markdown Extraction | NA | NA | NA | NA | NA |
| Image-based Document Extraction | NA | NA | NA | NA | NA |
Adding a Stage
- Navigate to Index > Enrich (or Transform page)
- Click +New Stage
- Select stage type and configure
- Click Save
Stage Operations
- Testing a Stage: Click Simulate on the Transform page, select the stage and number of documents to test with. Results appear in the Viewer as a JSON object. Any transformation errors are listed under
simulate_errors.
- Enable/Disable: Use ellipsis menu or status toggle. Disabling a stage doesn’t delete or remove it permanently. When a stage is disabled, the output of the previous stage in the sequence is directly sent as input to the next stage.
- Delete: Permanently removes the stage.
- Sequencing: Stages execute top-down. The output of each stage becomes the input of the next — so order directly affects outcomes. Use the drag handle to reorder stages.
Content Enrichment Using Workbench
The Workbench is a tool for processing and enhancing ingested content through a series of configurable stages. Each stage performs specific data transformations before passing content to the next stage.
Example: To prevent confidential information from appearing in answers, set up an Exclude Document stage that filters out any chunk containing the word “Confidential” in its title or body.
Key Features
- Pipeline Processing: Sequential stage execution; the output of each stage feeds into the next.
- Custom Transformation: Customizable stages per business needs.
- Simulation Capabilities: Built-in Built-in simulator to test individual stages or the cumulative effect of multiple stages before deployment.
- Flexible Sequencing: Design efficient processing workflows.
Supported Stages
| Stage Type | Purpose |
|---|
| Field Mapping | Map document fields to target fields based on conditions |
| Custom Script | Run custom scripts on input data |
| Exclude Document | Exclude documents from indexing based on conditions |
| API Stage | Connect with third-party services for dynamic updates |
| LLM Stage | Leverage external LLMs for content enrichment |
Content Transformation vs Enrichment
| Transform | Enrich |
|---|
| Where in pipeline | Operates on raw documents — immediately after extraction, before chunking | Operates on extracted chunks — after chunking is complete |
| Input | Full source documents or pages | Individual chunks |
| Purpose | Clean and enrich the document-level content before it becomes chunks | Refine and enhance chunk-level content before indexing |
| Accessibility | Index > Enrich (Transform page) | Index > Enrich (Workbench page) |
Adding Workbench Stages
-
Navigate to Index > Enrich.
-
Click +New Stage.
-
Select stage type.
-
Configure:
- Unique stage name
- Conditions for content selection
- Outcomes defining transformations
-
Click Save.
Stage Management
Ordering: Data processes through stages sequentially. The sequence directly affects the cumulative output. Drag to reorder.
Enable/Disable: Toggle stages on/off without deleting. When disabled, the stage is skipped and the previous stage’s output passes directly to the next active stage. Example: With three stages where stage 2 is disabled — data flows through stage 1, then directly to stage 3, skipping stage 2 entirely.
Deleting: Permanently removes the stage from the Workbench.
Workbench Simulator
The Workbench includes a built-in simulator for testing stage behavior before deployment.
Features
- Test and verify individual stage outputs
- Test cumulative effects of multiple stages
- Works with any data source type
Running a Simulation
- Click Simulate option on the Workbench page.
- Select the stages to test.
- Choose number of documents for testing.
- View results in JSON format in the Chunk Viewer.
Key Points:
- Simulator shows changes from all stages in sequence order
- Testing from a specific stage shows cumulative transformations up to that point
- Temporarily disable other stages to test individual stage behavior
- Errors during transformation appear in the
simulate_errors object
Use the Workbench Simulator frequently during configuration to catch errors early.
Content Browser
The Content Browser provides tools to observe, verify, and edit extracted chunks from source data.
- When new content is added, the application auto-trains and generates chunks automatically.
- When existing content is updated, all related chunks are deleted and recreated — meaning any manual edits to those chunks will be lost on the next sync.
Key Capabilities
- Observation and Verification: Inspect and verify extracted chunks for accuracy
- Editing of Chunks: Modify chunk information directly within the browser
Viewing Chunks
Navigate to Index > Browse to view all chunks. Each chunk displays:
- Preview of any images or tables present in the chunk. Click a preview to enlarge it to full size. If a chunk contains multiple images, a carousel allows navigation through them.
- Summary information
Click Details to view:
- Document Title: Source document/page title
- Chunk Title: Title assigned to the chunk
- Chunk Text: Content of the chunk
- Chunk ID: Unique identifier
- Page Number: Source page (if applicable)
- Source Title: Name of the source in the application
- Source: Content type (web pages, files, connectors)
- Extraction Strategy: Strategy used for extraction
- Edited on: Last update date (if applicable)
- Source URL: Source URL (if applicable)
View JSON: Displays chunk contents in JSON format with all properties.
Editing Chunks
- Click the Details icon on a chunk
- Modify title and/or text
- Click Save
Alternatively, use the Edit option directly from the browser home page.
When documents are updated, all related chunks are deleted and recreated, losing manual edits.
Search and Filter
Search: Use the search bar to find chunks by properties (chunkTitle, chunkText, source, etc.)
Filter: Advanced search using multiple chunk properties:
- Source types
- Extraction strategy
- Content keywords
- And more
Vector Configuration
Indexing converts extracted chunks into vector embeddings and stores them in a knowledge index used for answer generation. Vectors are multidimensional numerical representations of chunks that carry their semantic meaning.
Navigate to Index > Index Configuration to configure this.
The platform supports embedding models including BGE-M3, VDR, and custom models.
By default, new applications use the BGE-M3 model for vector generation.
Key Features
- Select from out-of-the-box XO GPT embedding models or bring your own custom model.
- Choose which chunk fields are used to generate embeddings.
- Generate up to three vectors per chunk, each capturing different semantic aspects via distinct field combinations (multi-vector search).
Glossary
| Term | Description |
|---|
| Vector / Embedding | A numerical representation of data that captures its semantic meaning. |
| Embedding Model | A machine learning model that converts text or images into vectors. |
| Multi-Vector | Generating multiple vectors per chunk, each using different fields to capture distinct semantic aspects. |
| Field Combination | The specific set of fields (for example, Chunk Title, Chunk Text, Record Title) selected to generate a vector. |
| Vector Column | The data column that stores embeddings for a specific vector. In a 3-vector setup, there are three vector columns. |
| Vector Weight | A percentage value assigned to each vector to control its influence on retrieval scoring. |
| Vision Models | AI models designed to interpret and analyze visual data such as images. |
| Rebalancing | When a vector is unavailable for a chunk, its weight is proportionally redistributed among the remaining available vectors. |
Prerequisites
The XO GPT model is configured for vector generation by default. To verify or change the embedding model:
- Navigate to Generative AI Tools > Model Library.
- Confirm that vector generation is enabled for the desired model.
All configured embedding models appear in the Vector Configuration page drop-down.
Vector Configuration for Textual Data
Configure the following fields:
| Field | Description |
|---|
| Vector Model | Embedding model for vector generation. Options: BGE-M3 or a custom model |
| Prompt | Prompt used by the model. For custom models, create a new prompt using +New Prompt |
The model and prompt selected on the Gen AI page are applied here by default. Changes made here are also reflected in the Vector Generation - Text feature on the Gen AI page.
Multi-Vector Search
Search AI supports generating up to three vectors per chunk, each configured to capture different semantic aspects using distinct field combinations. This improves retrieval accuracy by evaluating each chunk from multiple perspectives.
How it works:
- Configure vectors — Define up to three vectors, each targeting specific sources, file types, and fields.
- Assign weights — Set a percentage weight per vector to control its influence on the final relevance score.
- Retrieval — At query time, the user query is converted to a vector and matched against all configured vectors, calculating a similarity score corresponding to each vector. Each similarity score is multiplied by its weight, and scores are summed to produce a final relevance ranking. Chunks are ranked based on these final scores, and the most relevant results are returned.
Configuring Vectors
Generate and assign up to three different vectors per chunk.
Vector 1 is configured by default, applies to all content from all sources, and cannot be deleted. Its default fields are: Chunk Title, Chunk Text, Source Name, Record Title. You can change the fields but not the content source and file types.
Vectors 2 and 3 must be manually configured. For each additional vector, provide:
| Field | Description |
|---|
| Name | A meaningful identifier for the vector configuration |
| Field Combination | Defines which content the vector applies to and which fields are used |
Each Field Combination has three components:
- Source — Select the content source (e.g., uploaded files, connectors). Use the AND operator to narrow the scope further.
- File Types — Select supported file types (e.g., PDF, HTML).
- Fields — Select one or more chunk fields (e.g., Chunk Text, Chunk Title, Record Title) for embedding generation.
The order of selected fields influences the generated embedding. Semantically rich fields improve embedding quality and search accuracy.
Example: To use a custom summary field (cfs1) for uploaded PDFs:
- Source: Files
- File Type: PDF
- Fields:
cfs1
Use the Add button to define multiple field combinations within a single vector — useful when different content types need different field selections.
Field Combinations Precedence
When multiple field combinations are defined within a vector, they are evaluated top to bottom. The first matching combination is applied. Place more specific combinations above generic ones.
Example: If the first combination targets the Default Directory and the second targets all sources — the Default Directory uses the first combination; everything else uses the second.
Vector Coverage
If a vector’s field combinations don’t cover certain content types, no embeddings are generated for those types in that vector column — meaning those chunks won’t contribute to semantic matching for that vector. The system automatically rebalances weights to compensate.
For instance, if Vector 2 is configured with fields that apply only to web files, then embeddings are generated only for web content in Vector 2. Other content types, such as PDFs or connector-based files, won’t have embeddings in Vector 2 and therefore won’t contribute to semantic matching for that vector column.
Learn about Weight Rebalancing.
Assigning Weights to Vectors
Weights control how much each vector contributes to the final relevance score. All weights must total 100%.
To assign weights, click Manage Weights in the Vector Configuration section and set a percentage for each vector.
Example configuration:
Use case: A company wants to improve semantic search by capturing multiple aspects of the same document — prioritizing technical summaries while supporting broad and product-specific queries.
| Vector | Fields Used | Weight | Rationale |
|---|
| Vector 1 (Title-Focused Vector) | Chunk Text, Chunk Title | 30% | Broad general coverage |
| Vector 2 (Summary-Focused Vector) | Summary (cfs1) | 50% | High semantic significance |
| Vector 3 (Metadata-Driven Vector) | Product Name (cfs2), Version (cfs3) | 20% | Specific but limited scope |
Automatic Weight Rebalancing
When a vector has no embedding for a particular chunk (e.g., Vector 2 is only configured for web pages, so PDF chunks have no Vector 2 embedding), the system automatically redistributes that vector’s weight proportionally among the remaining available vectors, referred to as Automatic Weight Rebalancing.
Example: Three vectors configured as:
- Vector 1 — all sources, fields
a, b, weight 50%
- Vector 2 — web pages only, fields
c, d, weight 20%
- Vector 3 — all sources, fields
e, f, weight 30%
Vector 2 embeddings are only generated for web page chunks. For all other content types (uploaded files, connectors), Vector 2 is unavailable:
| Chunk | Vector 1 | Vector 2 | Vector 3 |
|---|
| Chunk 1 (web page) | ✓ | ✓ | ✓ |
| Chunk 2 (uploaded file) | ✓ | Not available | ✓ |
| Chunk 3 (connector) | ✓ | Not available | ✓ |
For Chunks 2 and 3, Vector 2’s 20% weight is redistributed using:
Adjusted Weight = Original Weight + ((Unavailable Weight × Original Weight) / Sum of Available Weights)
This gives the following adjusted weights for Chunks 2 and 3:
- Vector 1 → 62.5%
- Vector 3 → 37.5%
This ensures fair and accurate scoring without manual adjustment.
Points to Note
- The default Vector 1 cannot be disabled or deleted.
- Avoid overlapping conditions across vector definitions.
- After changing the embedding model, updating vector configurations, or reassigning weights — always Train the application to regenerate embeddings with the new settings.
Vector Configuration for Image-Based Data
Required only when using the Image-Based Document Extraction strategy.
| Field | Description |
|---|
| Vector Model | VDR embedding model (selected by default) |
| Prompt | Select or create a prompt for image-based vector generation |
Changes to the prompt here are automatically reflected in the Vector Generation - Image feature on the Gen AI page.
Batch Processing of Vector Generation
Search AI supports batch processing, which groups multiple chunks into a single API request rather than sending them individually for vector generation. This reduces API overhead and improves throughput. The system dynamically packs chunks up to a configured token limit per request and throttles batches according to token-per-minute and request-per-minute rate limits.
For configuration details, see Configure Batch Processing.
Best Practices
- Match strategy to content type and structure
- Consider expected query complexity when setting chunk sizes
- Use Layout Aware for documents with tables and charts
- Order stages logically (e.g., exclude before enrichment)
- Use simulation frequently during development
- Keep transformations focused and modular
Content Quality
- Review chunks in Content Browser after processing
- Edit chunks only when necessary (edits lost on content updates)
- Use filters to identify problematic chunks
Testing and Validation
- Simulate all stages before training
- Test with representative sample documents
- Verify cumulative effects of multiple stages
Processing Workflow Summary
1. Content Ingestion
└── Sources: Websites, Documents, Connectors
2. Extraction Phase
└── Apply extraction strategy based on content type
└── Generate initial chunks
3. Transformation Phase (Document Workbench)
└── Stage 1 → Stage 2 → Stage N
└── Each stage processes and passes to next
4. Enrichment Phase (Workbench)
└── Chunks processing stages
└── Field mapping, scripts, API calls, LLM enrichment
5. Indexing Phase
└── Vector embedding generation
└── Storage in vector database
6. Verification
└── Content Browser review
└── Manual edits if needed