GETTING STARTED
SearchAssist Overview
SearchAssist Introduction
Onboarding SearchAssist
Build your first App
Glossary
Release Notes
What's new in SearchAssist
Previous Versions

CONCEPTS
Managing Sources
Introduction
Files
Web Pages
FAQs
Structured Data 
Connectors
Introduction to Connectors
Azure Storage Connector
Confluence Cloud Connector
Confluence Server Connector
Custom Connector
DotCMS Connector
Dropbox Connector
Google Drive Connector
Oracle Knowledge Connector
Salesforce Connector
ServiceNow Connector
SharePoint Connector
Zendesk Connector
RACL
Virtual Assistants
Managing Indices
Introduction
Index Fields
Traits
Workbench
Introduction to Workbench
Field Mapping
Entity Extraction
Traits Extraction
Keyword Extraction
Exclude Document
Semantic Meaning
Snippet Extraction
Custom LLM Prompts
Index Settings
Index Languages
Managing Chunks
Chunk Browser
Managing Relevance
Introduction
Weights
Highlighting
Presentable
Synonyms
Stop Words
Search Relevance
Spell Correction
Prefix Search
Custom Configurations
Personalizing Results
Introduction
Answer Snippets
Introduction
Extractive Model
Generative Model
Enabling Both Models
Simulation and Testing
Debugging
Best Practices and Points to Remember
Troubleshooting Answers
Answer Snippets Support Across Content Sources
Result Ranking
Facets
Business Rules
Introduction
Contextual Rules
NLP Rules
Engagement
Small Talk
Bot Actions
Designing Search Experience
Introduction
Search Interface
Result Templates
Testing
Preview and Test
Debug Tool
Running Experiments
Introduction
Experiments
Analyzing Search Performance
Overview
Dashboard
User Engagement
Search Insights
Result Insights
Answer Insights

ADMINISTRATION
General Settings
Credentials
Channels
Team
Collaboration
Integrations
OpenAI Integration
Azure OpenAI Integration
Custom Integration
Billing and Usage
Plan Details
Usage Logs
Order and Invoices
Smart Hibernation

SearchAssist APIs
API Introduction
API List

SearchAssist SDK

HOW TOs
Use Custom Fields to Filter Search Results and Answers
Add Custom Metadata to Ingested Content
Write Painless Scripts
Configure Business Rules for Generative Answers

Custom HTML Page Crawling

This feature can be used to crawl specific content from a webpage, exclude unnecessary and irrelevant data, and generate metadata and sub-items from the web page content.

Key features 

  • Exclude unnecessary sections from the HTML page (like header, sidebar, footer, etc).
  • Extract content as metadata, like publish date, author, etc.
  • Define sections as sub-items, like in FAQs, answers can be stored as sub-items. 

Schema:

[
  {
    "for": {
      "urls": [
        ".*"
      ]
    },
    "exclude": [
      {
        "type": "CSS|XPATH",
        "path": "css|xpath selector"
      },
      {
        "type": "CSS|XPATH",
        "path": "css|xpath selector"
      }
    ],
    "metadata": {
      "meta1": {
        "type": "CSS|XPATH",
        "path": "css|xpath selector"
      }
    },
    "subItems": {
      "MySubItemName": {
        "type": "CSS|XPATH",
        "path": "css|xpath selector"
      }
    }
  },
  {
    "for": {
      "types": [
        "MySubItemName"
      ]
    },
    "metadata": {
      "meta1": {
        "type": "CSS|XPATH",
        "path": "css|xpath selector"
      },
      "title": {
        "type": "CSS|XPATH",
        "path": "css|xpath selector"
      },
      "uri": {
        "type": "CSS|XPATH",
        "path": "css|xpath selector"
      },
      "meta2": {
        "type": "CSS|XPATH",
        "path": "css|xpath selector"
      }
    }
  }
]

Elements of Custom Configuration:

Filters (for)

  • The for element specifies the pages or items to which the settings are to be applied. This element is mandatory. 
  • This field specifies a matching value for either URLs (pages) or types (sub-items). The value can be specified using a REGEX expression.

Filter Examples:

  1. To select all the pages on the site, use the following filter.
"for": {
    "urls": [
      ".*"
    ]
  }

 

  1. To select all the HTML pages only.
"for": {
"urls": ["\\.html"]
}

Exclusion

We can use the exclude element to remove one or more parts of the page body.

Example: To index only the title, the question, and the answers and remove the top bar, the header, the top advertisement, and the sidebar from the indexed content, use corresponding selectors, and define the exclude tag as shown below. 

"exclude": [
{
"type": "CSS",
"path": "div.topbar"
},
{
"type": "CSS",
"path": "#header"
},
{
"type": "CSS",
"path": "#herobox" //CSS for the advertisement.
},
{
"type": "CSS",
"path": "#sidebar"
}
]

Metadata

The metadata element defines how to retrieve specific metadata from the current page. It’s a map of selectors, each key representing the metadata name. Each metadata field can have the following properties:

  • boolean: When this property is set to true, the selector returns a boolean value of true in response if there is a matching element on the page. Else it returns false. 
  • isAbsolute: By default, selectors are evaluated on the sub-item body to retrieve the metadata. When this property is set to true, metadata is evaluated from the parent page instead of the current element. Use this setting only for sub-items settings when you want to retrieve metadata that are on the parent page only.

Example: Use the following elements to retrieve metadata for a question and get the following information:

  • The number of votes for the question.
  • Get the date and time the question was asked.
  • Check if there’s at least one answer to the question.
"metadata":
{
"questionVotes": {
"type": "XPATH",
"path": "//*[@id='question']//div[@class='vote']/span/text()"
},
"questionAskedDate": {
"type": "CSS",
"path": "#question div.user-action-time > span::attr(title)"
},
"questionHasAnswer": {
"type": "CSS",
"path": "div.answer",
"isBoolean": true
}
}

SubItems

Subitems are used to retrieve multiple source items from a single web page. The subItems element defines how to retrieve sub-items from the web page.

Example: In a Q&A website, a page may contain a question, answers, and comments. You could define answer and comment sub-items, while the main item would contain only the question.

"subItems": {
      "answer": {
        "type": "CSS",
        "path": "#answers div.answer"
      }
    }

Note: The full page is still indexed (taking into account applicable web scraping tasks) and becomes a source item. For example, suppose a specific Q&A website page contains five answers to a question and answers are defined as sub-items. In that case, the index will contain six items corresponding to the page (1 for the page itself and 5 for answer sub-items).

Custom HTML Page Crawling

This feature can be used to crawl specific content from a webpage, exclude unnecessary and irrelevant data, and generate metadata and sub-items from the web page content.

Key features 

  • Exclude unnecessary sections from the HTML page (like header, sidebar, footer, etc).
  • Extract content as metadata, like publish date, author, etc.
  • Define sections as sub-items, like in FAQs, answers can be stored as sub-items. 

Schema:

[
  {
    "for": {
      "urls": [
        ".*"
      ]
    },
    "exclude": [
      {
        "type": "CSS|XPATH",
        "path": "css|xpath selector"
      },
      {
        "type": "CSS|XPATH",
        "path": "css|xpath selector"
      }
    ],
    "metadata": {
      "meta1": {
        "type": "CSS|XPATH",
        "path": "css|xpath selector"
      }
    },
    "subItems": {
      "MySubItemName": {
        "type": "CSS|XPATH",
        "path": "css|xpath selector"
      }
    }
  },
  {
    "for": {
      "types": [
        "MySubItemName"
      ]
    },
    "metadata": {
      "meta1": {
        "type": "CSS|XPATH",
        "path": "css|xpath selector"
      },
      "title": {
        "type": "CSS|XPATH",
        "path": "css|xpath selector"
      },
      "uri": {
        "type": "CSS|XPATH",
        "path": "css|xpath selector"
      },
      "meta2": {
        "type": "CSS|XPATH",
        "path": "css|xpath selector"
      }
    }
  }
]

Elements of Custom Configuration:

Filters (for)

  • The for element specifies the pages or items to which the settings are to be applied. This element is mandatory. 
  • This field specifies a matching value for either URLs (pages) or types (sub-items). The value can be specified using a REGEX expression.

Filter Examples:

  1. To select all the pages on the site, use the following filter.
"for": {
    "urls": [
      ".*"
    ]
  }

 

  1. To select all the HTML pages only.
"for": {
"urls": ["\\.html"]
}

Exclusion

We can use the exclude element to remove one or more parts of the page body.

Example: To index only the title, the question, and the answers and remove the top bar, the header, the top advertisement, and the sidebar from the indexed content, use corresponding selectors, and define the exclude tag as shown below. 

"exclude": [
{
"type": "CSS",
"path": "div.topbar"
},
{
"type": "CSS",
"path": "#header"
},
{
"type": "CSS",
"path": "#herobox" //CSS for the advertisement.
},
{
"type": "CSS",
"path": "#sidebar"
}
]

Metadata

The metadata element defines how to retrieve specific metadata from the current page. It’s a map of selectors, each key representing the metadata name. Each metadata field can have the following properties:

  • boolean: When this property is set to true, the selector returns a boolean value of true in response if there is a matching element on the page. Else it returns false. 
  • isAbsolute: By default, selectors are evaluated on the sub-item body to retrieve the metadata. When this property is set to true, metadata is evaluated from the parent page instead of the current element. Use this setting only for sub-items settings when you want to retrieve metadata that are on the parent page only.

Example: Use the following elements to retrieve metadata for a question and get the following information:

  • The number of votes for the question.
  • Get the date and time the question was asked.
  • Check if there’s at least one answer to the question.
"metadata":
{
"questionVotes": {
"type": "XPATH",
"path": "//*[@id='question']//div[@class='vote']/span/text()"
},
"questionAskedDate": {
"type": "CSS",
"path": "#question div.user-action-time > span::attr(title)"
},
"questionHasAnswer": {
"type": "CSS",
"path": "div.answer",
"isBoolean": true
}
}

SubItems

Subitems are used to retrieve multiple source items from a single web page. The subItems element defines how to retrieve sub-items from the web page.

Example: In a Q&A website, a page may contain a question, answers, and comments. You could define answer and comment sub-items, while the main item would contain only the question.

"subItems": {
      "answer": {
        "type": "CSS",
        "path": "#answers div.answer"
      }
    }

Note: The full page is still indexed (taking into account applicable web scraping tasks) and becomes a source item. For example, suppose a specific Q&A website page contains five answers to a question and answers are defined as sub-items. In that case, the index will contain six items corresponding to the page (1 for the page itself and 5 for answer sub-items).