Skip to main content

Smart Document Crawler

Within Smart Document Access, you can configure the crawler to index the content of a website.

The crawler operates in two distinct phases: a first link navigation phase, which maps the site structure without extracting content, and a second content extraction phase, which generates indexed PDF documents.

Activation and structure creation

Inside an already created subfolder, select the "Configure Smart Crawler" option.

image.png

Enable the "Enable on subfolder" toggle.

image.png

Select "Add configuration".

image.png

Assign a name to the configuration and select "Save".

image.png

info

At this stage you can import previously downloaded configurations or download existing ones.

Setting up the Crawler

Selecting "Edit fields" opens the Crawler configuration.

image.png

General

Enter the starting URL of the site in the "Address" field.

image.png

Add other URLs in "Additional pages" if necessary.

Enable the "Exclude off-domain sites" option to limit navigation to links within the main domain.

Enable the "Include PDF, doc, docx documents" option to also navigate documents attached to web pages.

In the "Addresses to include" field, enter the list of pages to navigate.

In the "Addresses to exclude" field, enter the list of pages to avoid.

info

You can use regular expressions (regex) to specify multiple pages.

image.png

Set the maximum depth and the maximum number of pages to navigate.

image.png

Extraction

In the "Extraction" section, you configure the rules the Crawler will follow to transform web pages into documentation.

In the "Filter addresses to include" field, enter the list of pages to extract and convert into documentation.

info

The pages to extract may coincide with the pages to navigate.

In the "Filter addresses to exclude" field, enter the list of pages not to extract.

image.png

Add preset selectors in the "Remove elements" field.

info

Elements to remove before extraction can also be specified manually, by providing a CSS, XPATH, or Puppeteer-compatible selector.

image.png

Scheduling

Select "Enable scheduling" and set the crawler execution frequency (weekly or monthly), then configure the day and start time; you can set multiple days of the week or month for repeated execution.

image.png

Viewing documents

To verify the PDF documents generated by the crawler, go back to the subfolder in Smart Document Access.

The list of indexed documents will be available in the section dedicated to the configured subfolder.

Each document has the URL from which it was generated associated as the "Source file".

image.png