Skip to main content
Configure and manage the Search AI index, including content extraction, transformation, workbench processing, and content browsing.

Overview

The Index Configuration in Search AI encompasses the complete data processing pipeline that transforms raw content into searchable, high-quality chunks optimized for answer generation. This pipeline consists of three main phases:
  1. Extraction - Breaking down source content into manageable chunks using various extraction strategies
  2. Transformation - Enriching and refining extracted content through configurable processing stages
  3. Indexing - Storing processed chunks with vector embeddings for efficient retrieval

Content Extraction

Content extraction segments ingested data into smaller chunks to organize and efficiently retrieve relevant information for answer generation. The application supports various extraction strategies that can be customized based on content format, structure, and answer requirements.

Extraction Strategies

Text Extraction Model

Combines NLP and machine learning techniques based on tokenization, where text is segmented into smaller units. Configuration Options:
  • Chunk Size (Pages): Treats every page as a single chunk
  • Chunk Size (Tokens): Prepares chunks using:
    • Tokens: Maximum tokens per chunk (up to 5000). Smaller chunks suit granular tasks; larger chunks better preserve context.
    • Chunk Overlap: Number of tokens overlapping between consecutive chunks

Layout Aware Extraction

Extracts data by considering content layout and structure, improving precision for documents with tables, graphs, and charts. It identifies objects in documents by combining OCR technology, layout detection models, and layout awareness rules. Configuration:
  • General Template: Extracts content from complex PDFs and DOCX files including tables and images

Advanced HTML Extraction

Designed specifically for extracting data from HTML files, including tables, images, and textual content. Videos are included in chunks but transcripts aren’t extracted. Configuration Templates:
  • General: Identifies different components and classes within HTML documents. Tables and images present between the content are also extracted and stored as chunks.
  • Token-Based: Generates chunks based on token size. Images within the content are extracted and stored in the chunk.
    FieldDescriptionRangeDefault
    TokensMax tokens per chunk100 - 1000300
    Chunk OverlapOverlapping tokens between adjacent chunks10 - 100
Documents with content shorter than 60 characters are skipped when using the General template.

Markdown Extraction

Transforms each page of a source document into structured Markdown format before processing. Effective for preserving semantic structure. Supported formats: PDF files (uploaded directly or via connectors).

Image-Based Document Extraction

Handles complex PDFs with non-textual layouts such as forms, tables, or visually rich content. Each page is converted to an image and processed using VDR embedding models. How it works:
  • Each page in the PDF document is converted to an image, preserving visual structure and layout.
  • Pages are processed using a VDR embedding model that captures both textual and visual semantics.
  • Page contents are also extracted in standard chunk format alongside the visual embedding.
  • Each extracted chunk includes a page_image_url field referencing its corresponding page image.
Supported formats: PDF files
Requires image-based embedding model selection in Vector Configuration. Answer generation not supported by XO GPT with this strategy. This strategy is supported for a limited set of languages. Learn More.

Custom Extraction

Enables integration with third-party services for custom content processing. How it works:
  • Ingested content is sent to an external service for processing.
  • The service structures the data into chunks and returns them to Search AI via a callback API.
  • Search AI indexes the returned chunks for search and retrieval.
Configuration:
  • Endpoint: URL for sending content (POST endpoint).
  • Concurrency: Maximum API calls per second.
  • Headers: Additional request headers. Default headers will be added automatically and can’t be edited.
  • Request Body: Content fields to send to the service.
Click Test to send a sample request using the configured headers and body. Once invoked successfully:
  • Generated Response — displays the response returned via the callback URL.
  • Response Path — provide the JSON path to locate extracted chunks within the response.
  • Response Comparison — validates the actual response against the expected structure. An error is shown if they don’t match.

Default Extraction Strategies

By default, every application has a Default Strategy that uses different extraction methods for different types of content.
Content SourceContent TypeExtraction Strategy
WebPageshtmlAdvanced HTML Extraction
Documentspdf, doc, docxMarkdown Extraction
Documentspptx, txtText Extraction
Connectorspdf, doc, docx, html, aspxMarkdown Extraction
Connectorspptx, txtText Extraction
ConnectorsjsonJSON Extraction
The default extraction strategy applies only to applications created after November 19, 2025.

Managing Extraction Strategies

Adding a Strategy

  1. Navigate to Index > Extract.
  2. Click +Add Strategy.
  3. Configure:
    • Strategy Name: Unique identifier
    • Define Source: Select data sources and content types using filters
    • Extraction Model: Choose the extraction method
    • Configure model-specific settings
A new strategy is enabled automatically upon creation, but extraction does not begin until you trigger it manually using the Train option.

Multiple Strategies:

When multiple strategies exist, they apply based on their sequence in the list - top-down priority. Example: With a web pages strategy and a default (all content) strategy: Default on top → applies to all content; web pages strategy is never triggered. Web pages strategy on top → processes HTML first; default handles the rest. To reorder, drag strategies above or below as needed.

Managing Strategies:

  • Enable/Disable: Toggle from the strategy page. Use it for testing purposes to evaluate alternative extraction strategies or when it’s no longer needed.
  • Delete: Use the Delete button (doesn’t affect existing chunks).
  • Reorder: Drag to change sequence.

Content Transformation

Content transformation refines extracted text into high-quality data after the extraction phase. This enrichment process addresses incomplete metadata, formatting inconsistencies, and missing context to improve search and retrieval effectiveness. The transformed output feeds directly into vectorization for AI processing. Example use case: A crawled blog page may contain ads, recent post references, and embedded author info alongside the core content. Transformation stages can strip the irrelevant sections and remap author info to a dedicated metadata field — ensuring only clean, structured content is indexed.

Benefits

  • Improved Data Quality: Fix errors, standardize fields, add contextual information.
  • Enhanced Control: Adapt enrichment to specific business needs.
  • Bulk Transformation: Apply rules to all applicable content simultaneously.

Transformation Stages

Field Mapping Stage

Adds, updates, or deletes specific fields from input content. Useful for ensuring uniformity across pages, for example, adding a title to pages where it’s missing. Configuration:
  • Name: Unique stage identifier
  • Condition: Rules for selecting content
    • Field Name, Operator, Value
  • Outcome: Actions to perform:
    • Set: Sets value for target field
    • Delete: Removes target field
    • Copy: Copies value between fields

Custom Script Stage

Implements custom JavaScript transformations for specific business needs. For example, prepending a source title to all extracted content.
For this stage, sub-conditions and outcomes are defined together within the same script.
Configuration:
  • Outcome: JavaScript defining transformations.
  • View Fields: Lists all the fields of the schema.
Example - Count pages:
int temp_total_pages = 0;
if(ctx.file_content_obj != null){
    for (def item: ctx.file_content_obj) {
        if (item!="") {
            temp_total_pages = temp_total_pages+1;
        }
    }
}
context.total_pages = temp_total_pages;

Exclude Documents Stage

Filters out unnecessary or irrelevant content before ingestion. Configuration:
  • Field: Document field for condition (for example, creation date and file type).
  • Operator: Comparison operator (greater than, less than, equals).
  • Value: Comparison value.
Example: To exclude documents older than a specific date, set Field = Created On, Operator = less than, Value = [cutoff date].

API Stage

Invokes external APIs to modify, enrich, or analyze content during transformation. Useful for metadata extraction, summarization, or translation via custom models. Configuration:
  • Endpoint: POST endpoint URL (supports dynamic fields with {{field_name}}). Example: https://api.example.com/meta/{{chunkTitle}}?type={{cfs1}}
  • Headers: Key-value pairs for authentication
  • Request Body: Content to send (use {{field_name}} for dynamic values). Example: "content": {{content}}.
Testing: Click Test to send a sample request and validate configuration. Use the Response tab to view returned data and map API response fields to Search AI schema fields. Note: Only POST APIs and Sync APIs are supported. You can map one or more API response fields to the Search AI schema fields corresponding to the content.

LLM Stage

Leverages external LLMs to refine, update, or enrich content (summarization, readability improvements, context addition). Prerequisites:
  1. Set up required LLM in Models Library.
  2. Create custom prompt for ‘Transform Documents with LLM’.
  3. Enable the feature in Gen AI features page.
Configuration:
  • LLM: Select the language model.
  • Prompt: Select the custom prompt.
  • Target Field: Field for storing enriched content.

Stage Availability by Extraction Strategy

StrategyField MappingCustom ScriptExclude DocumentsAPI StageLLM Stage
Text Extraction
Advanced HTML ExtractionYesYesYesYes
Layout Aware ExtractionNANANANANA
Markdown ExtractionNANANANANA
Image-based Document ExtractionNANANANANA

Managing Transformation Stages

Adding a Stage

  1. Navigate to Index > Enrich (or Transform page)
  2. Click +New Stage
  3. Select stage type and configure
  4. Click Save

Stage Operations

  • Testing a Stage: Click Simulate on the Transform page, select the stage and number of documents to test with. Results appear in the Viewer as a JSON object. Any transformation errors are listed under simulate_errors.
  • Enable/Disable: Use ellipsis menu or status toggle. Disabling a stage doesn’t delete or remove it permanently. When a stage is disabled, the output of the previous stage in the sequence is directly sent as input to the next stage.
  • Delete: Permanently removes the stage.
  • Sequencing: Stages execute top-down. The output of each stage becomes the input of the next — so order directly affects outcomes. Use the drag handle to reorder stages.

Content Enrichment Using Workbench

The Workbench is a tool for processing and enhancing ingested content through a series of configurable stages. Each stage performs specific data transformations before passing content to the next stage. Example: To prevent confidential information from appearing in answers, set up an Exclude Document stage that filters out any chunk containing the word “Confidential” in its title or body.

Key Features

  • Pipeline Processing: Sequential stage execution; the output of each stage feeds into the next.
  • Custom Transformation: Customizable stages per business needs.
  • Simulation Capabilities: Built-in Built-in simulator to test individual stages or the cumulative effect of multiple stages before deployment.
  • Flexible Sequencing: Design efficient processing workflows.

Supported Stages

Stage TypePurpose
Field MappingMap document fields to target fields based on conditions
Custom ScriptRun custom scripts on input data
Exclude DocumentExclude documents from indexing based on conditions
API StageConnect with third-party services for dynamic updates
LLM StageLeverage external LLMs for content enrichment

Content Transformation vs Enrichment

TransformEnrich
Where in pipelineOperates on raw documents — immediately after extraction, before chunkingOperates on extracted chunks — after chunking is complete
InputFull source documents or pagesIndividual chunks
PurposeClean and enrich the document-level content before it becomes chunksRefine and enhance chunk-level content before indexing
AccessibilityIndex > Enrich (Transform page)Index > Enrich (Workbench page)

Adding Workbench Stages

  1. Navigate to Index > Enrich.
  2. Click +New Stage.
  3. Select stage type.
  4. Configure:
    • Unique stage name
    • Conditions for content selection
    • Outcomes defining transformations
  5. Click Save.

Stage Management

Ordering: Data processes through stages sequentially. The sequence directly affects the cumulative output. Drag to reorder. Enable/Disable: Toggle stages on/off without deleting. When disabled, the stage is skipped and the previous stage’s output passes directly to the next active stage. Example: With three stages where stage 2 is disabled — data flows through stage 1, then directly to stage 3, skipping stage 2 entirely. Deleting: Permanently removes the stage from the Workbench.

Workbench Simulator

The Workbench includes a built-in simulator for testing stage behavior before deployment. Features
  • Test and verify individual stage outputs
  • Test cumulative effects of multiple stages
  • Works with any data source type
Running a Simulation
  1. Click Simulate option on the Workbench page.
  2. Select the stages to test.
  3. Choose number of documents for testing.
  4. View results in JSON format in the Chunk Viewer.
Key Points:
  • Simulator shows changes from all stages in sequence order
  • Testing from a specific stage shows cumulative transformations up to that point
  • Temporarily disable other stages to test individual stage behavior
  • Errors during transformation appear in the simulate_errors object
Use the Workbench Simulator frequently during configuration to catch errors early.

Content Browser

The Content Browser provides tools to observe, verify, and edit extracted chunks from source data.
  • When new content is added, the application auto-trains and generates chunks automatically.
  • When existing content is updated, all related chunks are deleted and recreated — meaning any manual edits to those chunks will be lost on the next sync.

Key Capabilities

  • Observation and Verification: Inspect and verify extracted chunks for accuracy
  • Editing of Chunks: Modify chunk information directly within the browser

Viewing Chunks

Navigate to Index > Browse to view all chunks. Each chunk displays:
  • Preview of any images or tables present in the chunk. Click a preview to enlarge it to full size. If a chunk contains multiple images, a carousel allows navigation through them.
  • Summary information
Click Details to view:
  • Document Title: Source document/page title
  • Chunk Title: Title assigned to the chunk
  • Chunk Text: Content of the chunk
  • Chunk ID: Unique identifier
  • Page Number: Source page (if applicable)
  • Source Title: Name of the source in the application
  • Source: Content type (web pages, files, connectors)
  • Extraction Strategy: Strategy used for extraction
  • Edited on: Last update date (if applicable)
  • Source URL: Source URL (if applicable)
View JSON: Displays chunk contents in JSON format with all properties.

Editing Chunks

  1. Click the Details icon on a chunk
  2. Modify title and/or text
  3. Click Save
Alternatively, use the Edit option directly from the browser home page.
When documents are updated, all related chunks are deleted and recreated, losing manual edits.

Search and Filter

Search: Use the search bar to find chunks by properties (chunkTitle, chunkText, source, etc.) Filter: Advanced search using multiple chunk properties:
  • Source types
  • Extraction strategy
  • Content keywords
  • And more

Vector Configuration

Indexing converts extracted chunks into vector embeddings and stores them in a knowledge index used for answer generation. Vectors are multidimensional numerical representations of chunks that carry their semantic meaning. Navigate to Index > Index Configuration to configure this. The platform supports embedding models including BGE-M3, VDR, and custom models. By default, new applications use the BGE-M3 model for vector generation. Key Features
  • Select from out-of-the-box XO GPT embedding models or bring your own custom model.
  • Choose which chunk fields are used to generate embeddings.
  • Generate up to three vectors per chunk, each capturing different semantic aspects via distinct field combinations (multi-vector search).

Glossary

TermDescription
Vector / EmbeddingA numerical representation of data that captures its semantic meaning.
Embedding ModelA machine learning model that converts text or images into vectors.
Multi-VectorGenerating multiple vectors per chunk, each using different fields to capture distinct semantic aspects.
Field CombinationThe specific set of fields (for example, Chunk Title, Chunk Text, Record Title) selected to generate a vector.
Vector ColumnThe data column that stores embeddings for a specific vector. In a 3-vector setup, there are three vector columns.
Vector WeightA percentage value assigned to each vector to control its influence on retrieval scoring.
Vision ModelsAI models designed to interpret and analyze visual data such as images.
RebalancingWhen a vector is unavailable for a chunk, its weight is proportionally redistributed among the remaining available vectors.

Prerequisites

The XO GPT model is configured for vector generation by default. To verify or change the embedding model:
  1. Navigate to Generative AI Tools > Model Library.
  2. Confirm that vector generation is enabled for the desired model.
All configured embedding models appear in the Vector Configuration page drop-down.

Vector Configuration for Textual Data

Configure the following fields:
FieldDescription
Vector ModelEmbedding model for vector generation. Options: BGE-M3 or a custom model
PromptPrompt used by the model. For custom models, create a new prompt using +New Prompt
The model and prompt selected on the Gen AI page are applied here by default. Changes made here are also reflected in the Vector Generation - Text feature on the Gen AI page.
Search AI supports generating up to three vectors per chunk, each configured to capture different semantic aspects using distinct field combinations. This improves retrieval accuracy by evaluating each chunk from multiple perspectives. How it works:
  1. Configure vectors — Define up to three vectors, each targeting specific sources, file types, and fields.
  2. Assign weights — Set a percentage weight per vector to control its influence on the final relevance score.
  3. Retrieval — At query time, the user query is converted to a vector and matched against all configured vectors, calculating a similarity score corresponding to each vector. Each similarity score is multiplied by its weight, and scores are summed to produce a final relevance ranking. Chunks are ranked based on these final scores, and the most relevant results are returned.

Configuring Vectors

Generate and assign up to three different vectors per chunk. Vector 1 is configured by default, applies to all content from all sources, and cannot be deleted. Its default fields are: Chunk Title, Chunk Text, Source Name, Record Title. You can change the fields but not the content source and file types. Vectors 2 and 3 must be manually configured. For each additional vector, provide:
FieldDescription
NameA meaningful identifier for the vector configuration
Field CombinationDefines which content the vector applies to and which fields are used
Each Field Combination has three components:
  • Source — Select the content source (e.g., uploaded files, connectors). Use the AND operator to narrow the scope further.
  • File Types — Select supported file types (e.g., PDF, HTML).
  • Fields — Select one or more chunk fields (e.g., Chunk Text, Chunk Title, Record Title) for embedding generation.
The order of selected fields influences the generated embedding. Semantically rich fields improve embedding quality and search accuracy.
Example: To use a custom summary field (cfs1) for uploaded PDFs:
  • Source: Files
  • File Type: PDF
  • Fields: cfs1
Use the Add button to define multiple field combinations within a single vector — useful when different content types need different field selections.

Field Combinations Precedence

When multiple field combinations are defined within a vector, they are evaluated top to bottom. The first matching combination is applied. Place more specific combinations above generic ones. Example: If the first combination targets the Default Directory and the second targets all sources — the Default Directory uses the first combination; everything else uses the second.

Vector Coverage

If a vector’s field combinations don’t cover certain content types, no embeddings are generated for those types in that vector column — meaning those chunks won’t contribute to semantic matching for that vector. The system automatically rebalances weights to compensate. For instance, if Vector 2 is configured with fields that apply only to web files, then embeddings are generated only for web content in Vector 2. Other content types, such as PDFs or connector-based files, won’t have embeddings in Vector 2 and therefore won’t contribute to semantic matching for that vector column. Learn about Weight Rebalancing.

Assigning Weights to Vectors

Weights control how much each vector contributes to the final relevance score. All weights must total 100%. To assign weights, click Manage Weights in the Vector Configuration section and set a percentage for each vector. Example configuration: Use case: A company wants to improve semantic search by capturing multiple aspects of the same document — prioritizing technical summaries while supporting broad and product-specific queries.
VectorFields UsedWeightRationale
Vector 1 (Title-Focused Vector)Chunk Text, Chunk Title30%Broad general coverage
Vector 2 (Summary-Focused Vector)Summary (cfs1)50%High semantic significance
Vector 3 (Metadata-Driven Vector)Product Name (cfs2), Version (cfs3)20%Specific but limited scope

Automatic Weight Rebalancing

When a vector has no embedding for a particular chunk (e.g., Vector 2 is only configured for web pages, so PDF chunks have no Vector 2 embedding), the system automatically redistributes that vector’s weight proportionally among the remaining available vectors, referred to as Automatic Weight Rebalancing. Example: Three vectors configured as:
  • Vector 1 — all sources, fields a, b, weight 50%
  • Vector 2 — web pages only, fields c, d, weight 20%
  • Vector 3 — all sources, fields e, f, weight 30%
Vector 2 embeddings are only generated for web page chunks. For all other content types (uploaded files, connectors), Vector 2 is unavailable:
ChunkVector 1Vector 2Vector 3
Chunk 1 (web page)
Chunk 2 (uploaded file)Not available
Chunk 3 (connector)Not available
For Chunks 2 and 3, Vector 2’s 20% weight is redistributed using:
Adjusted Weight = Original Weight + ((Unavailable Weight × Original Weight) / Sum of Available Weights)
This gives the following adjusted weights for Chunks 2 and 3:
  • Vector 1 → 62.5%
  • Vector 3 → 37.5%
This ensures fair and accurate scoring without manual adjustment.

Points to Note

  • The default Vector 1 cannot be disabled or deleted.
  • Avoid overlapping conditions across vector definitions.
  • After changing the embedding model, updating vector configurations, or reassigning weights — always Train the application to regenerate embeddings with the new settings.

Vector Configuration for Image-Based Data

Required only when using the Image-Based Document Extraction strategy.
FieldDescription
Vector ModelVDR embedding model (selected by default)
PromptSelect or create a prompt for image-based vector generation
Changes to the prompt here are automatically reflected in the Vector Generation - Image feature on the Gen AI page.

Batch Processing of Vector Generation

Search AI supports batch processing, which groups multiple chunks into a single API request rather than sending them individually for vector generation. This reduces API overhead and improves throughput. The system dynamically packs chunks up to a configured token limit per request and throttles batches according to token-per-minute and request-per-minute rate limits. For configuration details, see Configure Batch Processing.

Best Practices

Extraction Strategy Selection

  • Match strategy to content type and structure
  • Consider expected query complexity when setting chunk sizes
  • Use Layout Aware for documents with tables and charts

Transformation Pipeline Design

  • Order stages logically (e.g., exclude before enrichment)
  • Use simulation frequently during development
  • Keep transformations focused and modular

Content Quality

  • Review chunks in Content Browser after processing
  • Edit chunks only when necessary (edits lost on content updates)
  • Use filters to identify problematic chunks

Testing and Validation

  • Simulate all stages before training
  • Test with representative sample documents
  • Verify cumulative effects of multiple stages

Processing Workflow Summary

1. Content Ingestion
   └── Sources: Websites, Documents, Connectors

2. Extraction Phase
   └── Apply extraction strategy based on content type
   └── Generate initial chunks

3. Transformation Phase (Document Workbench)
   └── Stage 1 → Stage 2 → Stage N
   └── Each stage processes and passes to next

4. Enrichment Phase (Workbench)
   └── Chunks processing stages
   └── Field mapping, scripts, API calls, LLM enrichment

5. Indexing Phase
   └── Vector embedding generation
   └── Storage in vector database

6. Verification
   └── Content Browser review
   └── Manual edits if needed