Ingest Data from Spreadsheets

Search AI supports ingestion of spreadsheets in .xlsx and .csv formats via Markdown Extraction. When uploading a spreadsheet, use Markdown Extraction as the extraction model. Search AI automatically detects the file type and processes it appropriately.

Supported formats: .xlsx (Excel 2007+) and .csv
Unsupported formats: Legacy .xls files are not supported — convert to .xlsx before uploading.

How It Works

When an .xlsx or .csv file is uploaded, Search AI processes it in four steps. Step 1: Pre-processing The file is identified by type. For .xlsx files:

Empty columns are removed
Blank cells and embedded line breaks within cells are normalised.

Step 2: Conversion to markdown The file is converted to structured markdown.

Tables are identified by their boundaries. A blank row or blank column signals the end of one table and the start of the next.
Any content found after that boundary is treated as a separate table.

Step 3: Chunking The markdown is split by sheet first. Each sheet is then chunked independently using the token limit.

The sheet name is stored as the chunk title, and the file name as the record title.
Tables that exceed the token limit (5,000 tokens by default) are split at row boundaries.
The original header row is re-inserted at the top of every continuation chunk, so each chunk is a self-contained, valid markdown table.
If a sheet contains multiple tables separated by blank rows, each is treated as a separate table in the markdown output and chunked independently.

Step 4: Indexing The resulting chunks are stored and indexed for search.

What Works Out of the Box

For well-structured Excel files with clear column headers and consistent data, Search AI supports:

Lookup and Retrieval — Locate a specific record, value, or row based on a query.
Semantic Search — Search across columns filled with rich text.
Sheet-Scoped Queries — Sheet names are indexed as titles for data chunks, enabling search within a specific sheet in the workbook.
Cross-Sheet Retrieval — All sheets within the same workbook are included in a single search index, enabling knowledge sharing across sheets.
Formula Results — Computed values from formulas are indexed so users see results rather than formula text. The Excel file must have been opened and saved after the last data entry — files generated programmatically without being opened in Excel may contain empty formula cells.

How Search AI Handles .xlsx Files

Feature	Behavior	Notes / Caveats
Multiple sheets	Each sheet is extracted independently	Sheet name preserved as section header
Formula results	Shows computed value	Cells may be empty if the file was never saved in Excel — re-save before uploading. Volatile formula values like `TODAY()` and `NOW()` are captured as a snapshot at time of last save and are stale from the moment of ingestion.
Large tables	Auto-split at row boundaries	Header row re-inserted at the top of each chunk
Multiple tables	Tables separated by blank rows are extracted separately	Each table is treated as an independent chunk
Merged cells	Processed without errors	Value appears in the top-left cell only; remaining cells in the range are empty
Mixed-layout sheets	Sections with different column structures are split into separate tables	Content is preserved but does not appear as a unified view
Very large workbooks	Supported	Extraction can take up to 20 minutes for very large files
Empty rows or columns	Removed	Removed to keep the output clean
Indented row hierarchies	Flattened	Parent-child relationships are flattened

Wide tables (many columns) can reduce search quality. Consider splitting wide datasets into narrower tables before uploading.

Limitations

The following are not supported:

Legacy format (.xls) — Convert to .xlsx before uploading.
Encrypted or password-protected files — Cannot be processed.
Numeric aggregations — SUM, AVERAGE, COUNT, etc. cannot be performed.
Pivot tables — Extracted as static snapshots reflecting the last saved state. Re-pivoting, filtering, and slicing are not available.
Cell formatting — Cell colors, fonts, borders, conditional formatting, and number formats are not captured.
Visual elements — Charts, sparklines, and embedded images are not extracted.
Cell metadata — Comments, hyperlinks, and data types (dates, booleans, currencies) are not preserved.
Scripts — Macros, VBA scripts, and named ranges are not supported.
Indentation and parent-child relationships — Flattened during extraction, which may lead to context loss.

Best Practices — Authoring Excel Files for Search AI

Consider the following guidelines for effective extraction from spreadsheets. Always use the first row as the header row. The system always treats the first row containing data as the header row — this is not inferred from content or formatting, it is assumed by position. The header row is the only row that carries over into every continuation chunk when a table is split. If your first row contains data rather than column names, every chunk will use that data row as its header and the actual column names will be absent. Use plain language column headers. Column headers are the primary semantic signal for retrieval. A header like Rev_Adj_Wt_Avg tells the system nothing — and therefore tells it nothing about the rows below it. Write headers in full, descriptive language like Adjusted Weighted Average Revenue. If an abbreviation is necessary, include the full form: ARR (Annual Recurring Revenue). One table per sheet, or separate tables with a blank row between them. The system identifies table boundaries using blank rows and blank columns. If two tables are adjacent without a blank row between them, they will be extracted as a single table with a broken schema. Separate each logical table with at least one blank row. Avoid merged cells wherever possible. A merged cell’s value only appears on the first cell of the range — all other cells in the merge are blank after extraction. This affects both column headers (sub-headers lose their parent label) and data rows (a group label spanning multiple rows will only appear on the first row of the group). Do not encode meaning in colour or formatting. Cell background colour, font colour, borders, and conditional formatting are all lost at extraction. If a column uses red to mean overdue and green to mean complete, that information does not exist in the extracted data. Status, category, and any other meaning should always be in a text column. Keep tables narrow where possible. Very wide tables — many columns per row — produce long individual row chunks that dilute the semantic signal. If a table has many columns, consider whether it can be split into narrower related tables or whether some columns are better stored as separate metadata. Ensure formulas are saved with computed values. The system extracts the computed result of a formula, not the formula itself. This only works if the file was opened and saved in Excel after the last data entry. Files generated programmatically and never opened in Excel will have empty formula cells.

Documentation Index

​How It Works

​What Works Out of the Box

​How Search AI Handles .xlsx Files

​Limitations

​Best Practices — Authoring Excel Files for Search AI

How It Works

What Works Out of the Box

How Search AI Handles .xlsx Files

Limitations

Best Practices — Authoring Excel Files for Search AI