Manage Content of a Web Page

Once you add content to the application, it needs to be updated as the content from websites may not be static. You can manage (schedule periodic web crawling and edit crawling) and ensure that the content is in sync with the data on the website.

Schedule Web Crawling

The scheduler allows you to schedule a job to re-crawl the configured website periodically. To schedule a web crawling job, follow the below steps:

On the Indices page, click Content on the left pane.
On the Content list view page, select the respective source from the list.
On the source dialog box, click the Configuration tab.
On the Configuration tab, turn on the Schedule toggle.
Set the Date, Time, and Frequency.
Turn on the Crawl Everything toggle to crawl all the domains.
If you wish to crawl only selected domains, then turn off the Crawl Everything toggle.
After you turn off the Crawl Everything toggle, the Allow List toggle is turned on automatically. You can enter the allowed list of URLs in the Allow URLs field.
If you wish to block URLs, then turn off the Allow List toggle.
After you turn off the Allow List toggle, the Block List toggle is turned on automatically. You can enter URLs to block in the Block URLs field.
Select Crawl Settings:

JavaScript-rendered
Use Cookies
Respect robots.txt

Click Save.

Edit Crawler Configuration

To edit a web crawling source, follow the below steps:

On the Indices page, click Content on the left pane.
On the Content list view page, select the respective source from the list.
On the source dialog box, make the required changes.
Click Save.

On this Page

Manage Content of a Web Page

Schedule Web Crawling

The scheduler allows you to schedule a job to re-crawl the configured website periodically. To schedule a web crawling job, follow the below steps:

On the Indices page, click Content on the left pane.
On the Content list view page, select the respective source from the list.
On the source dialog box, click the Configuration tab.
On the Configuration tab, turn on the Schedule toggle.
Set the Date, Time, and Frequency.
Turn on the Crawl Everything toggle to crawl all the domains.
If you wish to crawl only selected domains, then turn off the Crawl Everything toggle.
After you turn off the Crawl Everything toggle, the Allow List toggle is turned on automatically. You can enter the allowed list of URLs in the Allow URLs field.
If you wish to block URLs, then turn off the Allow List toggle.
After you turn off the Allow List toggle, the Block List toggle is turned on automatically. You can enter URLs to block in the Block URLs field.
Select Crawl Settings:

JavaScript-rendered
Use Cookies
Respect robots.txt

Click Save.

Edit Crawler Configuration

To edit a web crawling source, follow the below steps:

On the Indices page, click Content on the left pane.
On the Content list view page, select the respective source from the list.
On the source dialog box, make the required changes.
Click Save.