Once you add content to the application, it needs to be updated as the content from websites may not be static. You can manage (schedule periodic web crawling and edit crawling) and ensure that the content is in sync with the data on the website.
Schedule Web Crawling
The scheduler allows you to schedule a job to re-crawl the configured website periodically. To schedule a web crawling job, follow the below steps:
- On the Indices page, click Content on the left pane.
- On the Content list view page, select the respective source from the list.
- On the source dialog box, click the Configuration tab.
- On the Configuration tab, turn on the Schedule toggle.
- Set the Date, Time, and Frequency.
- Turn on the Crawl Everything toggle to crawl all the domains.
- If you wish to crawl only selected domains, then turn off the Crawl Everything toggle.
- After you turn off the Crawl Everything toggle, the Allow List toggle is turned on automatically. You can enter the allowed list of URLs in the Allow URLs field.
- If you wish to block URLs, then turn off the Allow List toggle.
- After you turn off the Allow List toggle, the Block List toggle is turned on automatically. You can enter URLs to block in the Block URLs field.
- Select Crawl Settings:
- JavaScript-rendered
- Use Cookies
- Respect robots.txt
- Click Save.
Edit Crawler Configuration
To edit a web crawling source, follow the below steps:
- On the Indices page, click Content on the left pane.
- On the Content list view page, select the respective source from the list.
- On the source dialog box, make the required changes.
- Click Save.