Skip to main content
Version: v2

AI-enabled data extraction

State-of-the-art AI-enabled data extraction

AI extractor

ScrapingAnt provides users with the ability to perform AI-enabled data extraction. This means that for every scraping request, ScrapingAnt will extract the data from the page into the structured JSON object using our AI technology.

The only thing you need to do is to specify the data you want to extract from the page. The extraction input is a free-form text that describes the data for extraction.

While processing, ScrapingAnt's AI model with convert the free-form input parameters into camelCase JSON property names and extract the data from the page into the JSON object with the same structure.

Our state-of-the-art technology allows extracting the needed data from any web page, even if the page structure is changed. It also allows extracting data from pages with dynamic content, such as Single-Page Applications (SPA) using ScrapingAnt's cloud browser.

How AI extractor works

AI extractor uses the same web scraping technology as the general endpoint, but with additional AI processing. It uses the Markdown transformation to extract the text from the page and then processes it using the AI model.

By using Markdown-transformed version of the page, AI extractor can handle URLs, images, and other non-text content. It also allows to process the text content in a more efficient way.

AI extractor parameters

AI extractor uses a separate AI-enabled endpoint:

https://api.scrapingant.com/v2/extract

It uses the same request structure as the general endpoint, but with additional extract_properties parameter which is a free-form text that describes the data parameters for extraction.

The basic request to the AI extractor requires 3 parameters:

  • url - URL of the page to extract data from
  • x-api-key - ScrapingAnt API key
  • extract_properties - free-form text that describes the data you want to extract

In the common case we expect extract_properties to be comma-separated list of the data you want to extract. For example:

product title, price, full description

Still, it's possible to extend your request with additional details as the input processing is handled by the AI model as well, so it could handle more sophisticated expressions. For example:

product title, price(number), full description, reviews(list: review title, review content)

As well, as all other API parameters, extract_properties should be URL-encoded and sent to API using query parameter.

AI extractor request example

The simplest request that extracts the title and content of the web page:

curl --request GET \
--url 'https://api.scrapingant.com/v2/extract?url=https%3A%2F%2Fexample.com&extract_properties=title%2C%20content&x-api-key=<YOUR-API-KEY>'

This request uses the following extract_properties parameter value:

title, content

The output of this request is the following JSON object:

{
"title": "Example Domain",
"content": "This domain is for use in illustrative examples in documents. You may use this domain in literature without prior coordination or asking for permission."
}

Our AI extractor uses free-form described parameters for JSON structure names and returns the extracted data in the JSON object with the same structure.

AI extractor cost

The AI extractor cost is calculated based on the number of characters in the Markdown version of the page and the number of output characters.

Learn more here: AI extractor cost

AI extractor temporary limitations

  • AI extractor works only with markdown extracted from the page's HTML. It doesn't work with styles, JS and HTML tags.
  • AI extractor is multi-language, but it works best when input parameters described in English for the proper JSON structure names.
  • Nested JSON output structures are supported, but requires more sophisticated input parameters.

Check out the AI extractor best practices for more information.