Custom HTML Page Crawling

This feature can be used to crawl specific content from a webpage, exclude unnecessary and irrelevant data, and generate metadata and sub-items from the web page content.

Key features

Exclude unnecessary sections from the HTML page (like header, sidebar, footer, etc).
Extract content as metadata, like publish date, author, etc.
Define sections as sub-items, like in FAQs, answers can be stored as sub-items.

Schema:

[
  {
    "for": {
      "urls": [
        ".*"
      ]
    },
    "exclude": [
      {
        "type": "CSS|XPATH",
        "path": "css|xpath selector"
      },
      {
        "type": "CSS|XPATH",
        "path": "css|xpath selector"
      }
    ],
    "metadata": {
      "meta1": {
        "type": "CSS|XPATH",
        "path": "css|xpath selector"
      }
    },
    "subItems": {
      "MySubItemName": {
        "type": "CSS|XPATH",
        "path": "css|xpath selector"
      }
    }
  },
  {
    "for": {
      "types": [
        "MySubItemName"
      ]
    },
    "metadata": {
      "meta1": {
        "type": "CSS|XPATH",
        "path": "css|xpath selector"
      },
      "title": {
        "type": "CSS|XPATH",
        "path": "css|xpath selector"
      },
      "uri": {
        "type": "CSS|XPATH",
        "path": "css|xpath selector"
      },
      "meta2": {
        "type": "CSS|XPATH",
        "path": "css|xpath selector"
      }
    }
  }
]

Elements of Custom Configuration:

Filters (for)

The for element specifies the pages or items to which the settings are to be applied. This element is mandatory.
This field specifies a matching value for either URLs (pages) or types (sub-items). The value can be specified using a REGEX expression.

Filter Examples:

To select all the pages on the site, use the following filter.

"for": {
    "urls": [
      ".*"
    ]
  }

To select all the HTML pages only.

"for": {
"urls": ["\\.html"]
}

Exclusion

We can use the exclude element to remove one or more parts of the page body.

Example: To index only the title, the question, and the answers and remove the top bar, the header, the top advertisement, and the sidebar from the indexed content, use corresponding selectors, and define the exclude tag as shown below.

"exclude": [
{
"type": "CSS",
"path": "div.topbar"
},
{
"type": "CSS",
"path": "#header"
},
{
"type": "CSS",
"path": "#herobox" //CSS for the advertisement.
},
{
"type": "CSS",
"path": "#sidebar"
}
]

Metadata

The metadata element defines how to retrieve specific metadata from the current page. It’s a map of selectors, each key representing the metadata name. Each metadata field can have the following properties:

boolean: When this property is set to true, the selector returns a boolean value of true in response if there is a matching element on the page. Else it returns false.
isAbsolute: By default, selectors are evaluated on the sub-item body to retrieve the metadata. When this property is set to true, metadata is evaluated from the parent page instead of the current element. Use this setting only for sub-items settings when you want to retrieve metadata that are on the parent page only.

Example: Use the following elements to retrieve metadata for a question and get the following information:

The number of votes for the question.
Get the date and time the question was asked.
Check if there’s at least one answer to the question.

"metadata":
{
"questionVotes": {
"type": "XPATH",
"path": "//*[@id='question']//div[@class='vote']/span/text()"
},
"questionAskedDate": {
"type": "CSS",
"path": "#question div.user-action-time > span::attr(title)"
},
"questionHasAnswer": {
"type": "CSS",
"path": "div.answer",
"isBoolean": true
}
}

SubItems

Subitems are used to retrieve multiple source items from a single web page. The subItems element defines how to retrieve sub-items from the web page.

Example: In a Q&A website, a page may contain a question, answers, and comments. You could define answer and comment sub-items, while the main item would contain only the question.

"subItems": {
      "answer": {
        "type": "CSS",
        "path": "#answers div.answer"
      }
    }

Note: The full page is still indexed (taking into account applicable web scraping tasks) and becomes a source item. For example, suppose a specific Q&A website page contains five answers to a question and answers are defined as sub-items. In that case, the index will contain six items corresponding to the page (1 for the page itself and 5 for answer sub-items).

On this Page