Smart Document Crawler

Within Smart Document Access, you can configure the crawler to index the content of a website.

The crawler operates in two distinct phases: a first link navigation phase, which maps the site structure without extracting content, and a second content extraction phase, which generates indexed PDF documents.

Activation and structure creation

Inside an already created subfolder, select the "Configure Smart Crawler" option.

Enable the "Enable on subfolder" toggle.

Select "Add configuration".

Assign a name to the configuration and select "Save".

info

At this stage you can import previously downloaded configurations or download existing ones.

Setting up the Crawler

Selecting "Edit fields" opens the Crawler configuration.

General

Enter the starting URL of the site in the "Address" field.

Add other URLs in "Additional pages" if necessary.

Enable the "Exclude off-domain sites" option to limit navigation to links within the main domain.

Enable the "Include PDF, doc, docx documents" option to also navigate documents attached to web pages.

In the "Addresses to include" field, enter the list of pages to navigate.

In the "Addresses to exclude" field, enter the list of pages to avoid.

info

You can use regular expressions (regex) to specify multiple pages.

Set the maximum depth and the maximum number of pages to navigate.

Extraction

In the "Extraction" section, you configure the rules the Crawler will follow to transform web pages into documentation.

In the "Filter addresses to include" field, enter the list of pages to extract and convert into documentation.

info

The pages to extract may coincide with the pages to navigate.

In the "Filter addresses to exclude" field, enter the list of pages not to extract.

Add preset selectors in the "Remove elements" field.

info

Elements to remove before extraction can also be specified manually, by providing a CSS, XPATH, or Puppeteer-compatible selector.

Scheduling

Select "Enable scheduling" and set the crawler execution frequency (weekly or monthly), then configure the day and start time; you can set multiple days of the week or month for repeated execution.

Viewing documents

To verify the PDF documents generated by the crawler, go back to the subfolder in Smart Document Access.

The list of indexed documents will be available in the section dedicated to the configured subfolder.

Each document has the URL from which it was generated associated as the "Source file".

Activation and structure creation​

Setting up the Crawler​

General​

Navigation​

Extraction​

Scheduling​

Viewing documents​

Activation and structure creation

Setting up the Crawler

General

Navigation

Extraction

Scheduling

Viewing documents