Skip to main content

Web Data Extraction Techniques

Sequentum Enterprise makes it easy to extract data from most websites without requiring much prior knowledge about web data extraction techniques. However, you'll be able to build better web data extraction agents if you know some basic techniques. Some very difficult websites will require in-depth knowledge, but this user guide can help you gain more understanding and direct you to additional resources.

The following topics are important if you want to become proficient at web data extraction, but they are not necessarily a prerequisite for the successful use of Sequentum Enterprise on all websites. Click the links to learn more about each topic:

  • HTML-Content- Web pages are driven by HTML, which is the basic language for building websites.

  • Dynamic Websites - It can be challenging to perform data extraction on dynamic websites. So, it's good to have a general understanding of how JavaScript works, since it is found on most dynamic websites.

  • XPath and Selection Techniques - Most web data extraction tools extract data from a website by selecting web elements on the web page. XPath is a language that manages web selection.

  • Regular Expressions - XPath can select a web element such as a paragraph of text, but you may have interest only in a small part of the web element content. Regular Expressions is a language for extracting small bits of text from a larger text element.

HTML Content

HTML stands for HyperText Markup Language - the standard markup language for creating web pages. It consists of content that is defined in an HTML document by tags that appear in brackets, such as <html>. Typically, these tags are seen in pairs, with one on each end of the content that they represent (such as <h1> and </h1>). The first tag in a pair is the start tag, and the second tag is the end tag (also known as opening tags and closing tags). Some tags that represent empty elements don't come in pairs, such as <img>.

The purpose of a web browser is to read HTML documents and compose them into visible web pages. The browser does not display the HTML tags, but rather interprets the tags and displays content on the page that corresponds to that tag. HTML describes the structure of a web page semantically, with cues for presentation. This distinguishes it as a markup language rather than a programming language.

HTML elements are the building blocks of any website, including embedded images and objects, and also interactive forms. It provides the structure for a page by denoting structural semantics for text such as headings, paragraphs, lists, links, quotes, and other items. It can also contain scripts written in languages such as JavaScript - which controls the behavior of HTML web pages.

Sequentum Enterprise uses XPath to select specific HTML tags and then extract content from those tags. An HTML tag can contain both text and attributes. For example, an HTML tag that displays an image will contain an src attribute that specifies the URL of the image to display. Sequentum Enterprise can extract both tag text and tag attributes and may perform certain actions on the content it extracts. For example, it may extract the src attribute from an <image> HTML tag and then use the URL to download the image.

There are many websites that have HTML tutorials. Here is one example:

Dynamic Websites

Using HTML script, a client-side dynamic web page will continue to load more content after the initial content loads and the page elements are available to the user. The most common language for client-side scripting is JavaScript, and it may use AJAX (Asynchronous JavaScript and XML) to load additional content onto a web page asynchronously. It may also modify existing content on a web page, such as enabling or disabling content when you click on particular web elements.

To extract data correctly, Sequentum Enterprise needs to detect any dynamic changes on a web page. For example, if you want to extract any additional data that AJAX loads onto a web page, then you'll want to configure Sequentum Enterprise to wait on AJAX to finish processing the new content before it can start extracting it.

Sequentum Enterprise is excellent at the automatic detection of dynamic changes. However, sometimes JavaScript behaves unusually, and you may need to make adjustments to properly extract dynamic content. For example, Sequentum Enterprise can detect when JavaScript completes an AJAX load of dynamic content. But it cannot detect exactly when the JavaScript is done and so it will simply wait for a few milliseconds. If the JavaScript takes an unusually long time to display the dynamic content, you may need to use the timeout feature of Sequentum Enterprise to insert a short interval for the JavaScript to display the dynamic content (typically a few additional milliseconds).

Familiarity with JavaScript can make it much easier to configure a web data extraction agent to extract data from dynamic websites when Sequentum Enterprise is unable to configure the agent automatically. You can learn more from various JavaScript tutorials available on the web, such as:

XPath and Selection Techniques

Proper selection technique is a critical aspect of web data extraction. The most basic selection technique is to point-and-click on elements in the web browser panel, which is the easiest way to add commands to an agent. XPath is a common syntax for selecting elements in HTML and XML documents. Each time you click on an element in the web browser, Sequentum Enterprise works in the background to calculate the selection XPath.

Sequentum Enterprise has a variety of tools that help you create precise XPaths without needing to know XPath syntax at all. If you want finer control over element selection, it's worthwhile to learn some specifics about XPath syntax. Although this user guide doesn't include an XPath tutorial, there are many tutorials available on the web. We recommend this as a good place to learn more about XPath:

We also recommend this reference for common XPath patterns that is popular among selenium users:

Regular Expressions

With Regular Expressions, you can write expressions that look for specific character sequences within strings and then extract small text strings out of larger ones.

Sequentum Enterprise uses XPath to select web elements on a web page and then extracts content from those web elements. You may only want some parts of the content extraction, or you may want to transform it. For example, a single web element may contain the entire address of a company, but you may want to extract the content into separate elements such as a street address, city, zip code, and state. You can use Regular Expressions to split the address text into separate text strings.

There are many tutorial websites that teach Regular Expressions. Here is one example:

JavaScript errors detected

Please note, these errors can depend on your browser setup.

If this problem persists, please contact our support.