Keyword Extraction is a technique to automatically detect important words from the text stored in a field.
The Keyword Extraction stage in Index Pipeline allows you to identify a set of keywords from a source field and save the identified keywords in a target field so that it can be used to identify the intention of the search user better. SearchAssist supports various NLP algorithms to extract keywords.
You can:
- Define a condition for the keyword extraction stage. The keywords will be extracted only from the documents that satisfy the given condition.
- Re-order or delete keyword extractions.
- Simulate the changes before saving them.
Ensure to Train your app each time you make changes to any index configuration. This builds the index based on the updated configurations.
Configuration
To configure keyword extraction, follow the below steps:
- Click the Indices tab on the top.
- On the left pane, under the Index Configuration section, click Workbench.
- On the Workbench (Index Configuration) page, on the Stages column, click the + icon.
- On the right column, select Keyword Extraction from the Stage Type drop-down list.
- Enter a name in the Stage Name field.
- Enter a condition in the Condition field. You can add multiple conditions using the AND/OR connectors. Documents that satisfy the condition will be executed as part of the stage. See below for details.
- Select the field you want to extract keyword from as Source Field
- Define where you want to store the extracted keyword as Target Field. This field is created by the application.
- Choose a model from the Choose Model drop-down list. See below for details.
- Click Simulate to verify the configurations. The simulator displays the Source and the number of documents to which the mapping was applied, and the result. You can change the Source (if not mentioned in the condition) and the number of documents.
- Once done, click Save Configuration on the top-right.
Models
The following models are supported:
- Topic Rank – It is a method to extract keyphrases from the most important topics of a document.
- Position Rank – It is a method to capture both highly frequent words or phrases and their position in a document.
- Multi-partite Rank – It is a keyphrase extraction method that encodes topical information within a multi-partite graph structure.
Conditions
Condition is of the following format: ctx.field_name==value
or ctx.field_name!=value
. The field_name can be obtained from the Fields table under Index Configuration.
For example, ctx.contentType=="web"
to restrict the extraction from the content from a web source.