Smart Document Crawler
Within Smart Document Access, you can configure the crawler to index the content of a website.
The crawler operates in two distinct phases: a first link navigation phase, which maps the site structure without extracting content, and a second content extraction phase, which generates indexed PDF documents.
Activation and structure creation
Inside an already created subfolder, select the "Configure Smart Crawler" option.
Enable the "Enable on subfolder" toggle.
Select "Add configuration".
Assign a name to the configuration and select "Save".
At this stage you can import previously downloaded configurations or download existing ones.
Setting up the Crawler
Selecting "Edit fields" opens the Crawler configuration.
General
Enter the starting URL of the site in the "Address" field.
Add other URLs in "Additional pages" if necessary.
Navigation
Enable the "Exclude off-domain sites" option to limit navigation to links within the main domain.
Enable the "Include PDF, doc, docx documents" option to also navigate documents attached to web pages.
In the "Addresses to include" field, enter the list of pages to navigate.
In the "Addresses to exclude" field, enter the list of pages to avoid.
You can use regular expressions (regex) to specify multiple pages.
Set the maximum depth and the maximum number of pages to navigate.
Extraction
In the "Extraction" section, you configure the rules the Crawler will follow to transform web pages into documentation.
In the "Filter addresses to include" field, enter the list of pages to extract and convert into documentation.
The pages to extract may coincide with the pages to navigate.
In the "Filter addresses to exclude" field, enter the list of pages not to extract.
Add preset selectors in the "Remove elements" field.
Elements to remove before extraction can also be specified manually, by providing a CSS, XPATH, or Puppeteer-compatible selector.
Scheduling
Select "Enable scheduling" and set the crawler execution frequency (weekly or monthly), then configure the day and start time; you can set multiple days of the week or month for repeated execution.
Viewing documents
To verify the PDF documents generated by the crawler, go back to the subfolder in Smart Document Access.
The list of indexed documents will be available in the section dedicated to the configured subfolder.
Each document has the URL from which it was generated associated as the "Source file".