.xlsx and .csv formats via Markdown Extraction. When uploading a spreadsheet, use Markdown Extraction as the extraction model. Search AI automatically detects the file type and processes it appropriately.
- Supported formats:
.xlsx(Excel 2007+) and.csv - Unsupported formats: Legacy
.xlsfiles are not supported — convert to.xlsxbefore uploading.
How It Works
When an.xlsx or .csv file is uploaded, Search AI processes it in four steps.
Step 1: Pre-processing
The file is identified by type. For .xlsx files:
- Empty columns are removed
- Blank cells and embedded line breaks within cells are normalised.
- Tables are identified by their boundaries. A blank row or blank column signals the end of one table and the start of the next.
- Any content found after that boundary is treated as a separate table.
- The sheet name is stored as the chunk title, and the file name as the record title.
- Tables that exceed the token limit (5,000 tokens by default) are split at row boundaries.
- The original header row is re-inserted at the top of every continuation chunk, so each chunk is a self-contained, valid markdown table.
- If a sheet contains multiple tables separated by blank rows, each is treated as a separate table in the markdown output and chunked independently.
What Works Out of the Box
For well-structured Excel files with clear column headers and consistent data, Search AI supports:- Lookup and Retrieval — Locate a specific record, value, or row based on a query.
- Semantic Search — Search across columns filled with rich text.
- Sheet-Scoped Queries — Sheet names are indexed as titles for data chunks, enabling search within a specific sheet in the workbook.
- Cross-Sheet Retrieval — All sheets within the same workbook are included in a single search index, enabling knowledge sharing across sheets.
- Formula Results — Computed values from formulas are indexed so users see results rather than formula text. The Excel file must have been opened and saved after the last data entry — files generated programmatically without being opened in Excel may contain empty formula cells.
How Search AI Handles .xlsx Files
| Feature | Behaviour | Notes / Caveats |
|---|---|---|
| Multiple sheets | Each sheet is extracted independently | Sheet name preserved as section header |
| Formula results | Shows computed value | Cells may be empty if the file was never saved in Excel — re-save before uploading. Volatile formula values like TODAY() and NOW() are captured as a snapshot at time of last save and are stale from the moment of ingestion. |
| Large tables | Auto-split at row boundaries | Header row re-inserted at the top of each chunk |
| Multiple tables | Tables separated by blank rows are extracted separately | Each table is treated as an independent chunk |
| Merged cells | Processed without errors | Value appears in the top-left cell only; remaining cells in the range are empty |
| Mixed-layout sheets | Sections with different column structures are split into separate tables | Content is preserved but does not appear as a unified view |
| Very large workbooks | Supported | Extraction can take up to 20 minutes for very large files |
| Empty rows or columns | Removed | Removed to keep the output clean |
| Indented row hierarchies | Flattened | Parent-child relationships are flattened |
Wide tables (many columns) can reduce search quality. Consider splitting wide datasets into narrower tables before uploading.
Limitations
The following are not supported:- Legacy format (.xls) — Convert to
.xlsxbefore uploading. - Encrypted or password-protected files — Cannot be processed.
- Numeric aggregations — SUM, AVERAGE, COUNT, etc. cannot be performed.
- Pivot tables — Extracted as static snapshots reflecting the last saved state. Re-pivoting, filtering, and slicing are not available.
- Cell formatting — Cell colors, fonts, borders, conditional formatting, and number formats are not captured.
- Visual elements — Charts, sparklines, and embedded images are not extracted.
- Cell metadata — Comments, hyperlinks, and data types (dates, booleans, currencies) are not preserved.
- Scripts — Macros, VBA scripts, and named ranges are not supported.
- Indentation and parent-child relationships — Flattened during extraction, which may lead to context loss.
Best Practices — Authoring Excel Files for Search AI
Consider the following guidelines for effective extraction from spreadsheets. Always use the first row as the header row. The system always treats the first row containing data as the header row — this is not inferred from content or formatting, it is assumed by position. The header row is the only row that carries over into every continuation chunk when a table is split. If your first row contains data rather than column names, every chunk will use that data row as its header and the actual column names will be absent. Use plain language column headers. Column headers are the primary semantic signal for retrieval. A header likeRev_Adj_Wt_Avg tells the system nothing — and therefore tells it nothing about the rows below it. Write headers in full, descriptive language like Adjusted Weighted Average Revenue. If an abbreviation is necessary, include the full form: ARR (Annual Recurring Revenue).
One table per sheet, or separate tables with a blank row between them.
The system identifies table boundaries using blank rows and blank columns. If two tables are adjacent without a blank row between them, they will be extracted as a single table with a broken schema. Separate each logical table with at least one blank row.
Avoid merged cells wherever possible.
A merged cell’s value only appears on the first cell of the range — all other cells in the merge are blank after extraction. This affects both column headers (sub-headers lose their parent label) and data rows (a group label spanning multiple rows will only appear on the first row of the group).
Do not encode meaning in colour or formatting.
Cell background colour, font colour, borders, and conditional formatting are all lost at extraction. If a column uses red to mean overdue and green to mean complete, that information does not exist in the extracted data. Status, category, and any other meaning should always be in a text column.
Keep tables narrow where possible.
Very wide tables — many columns per row — produce long individual row chunks that dilute the semantic signal. If a table has many columns, consider whether it can be split into narrower related tables or whether some columns are better stored as separate metadata.
Ensure formulas are saved with computed values.
The system extracts the computed result of a formula, not the formula itself. This only works if the file was opened and saved in Excel after the last data entry. Files generated programmatically and never opened in Excel will have empty formula cells.