Sequentum Enterprise makes it easy to extract data from most websites without requiring much prior knowledge about web data extraction techniques. However, you'll be able to build better web data extraction agents if you know some basic techniques. Some very difficult websites will require in-depth knowledge, but this user guide can help you gain more understanding and direct you to additional resources.
The following topics are important if you want to become proficient at web data extraction, but they are not necessarily a prerequisite for the successful use of Sequentum Enterprise on all websites. Click the links to learn more about each topic:
- HTML Content - Web pages are driven by HTML, which is the basic language for building websites.
- XPath and Selection Techniques - Most web data extraction tools extract data from a website by selecting web elements on the web page. XPath is a language that manages web selection.
- Regular Expressions - XPath can select a web element such as a paragraph of text, but you may have interest only in a small part of the web element content. Regular Expressions is a language for extracting small bits of text from a larger text element.
HTML stands for HyperText Markup Language - the standard markup language for creating web pages. It consists of content that is defined in an HTML document by tags that appear in brackets, such as <html>. Typically, these tags are seen in pairs, with one on each end of the content that they represent (such as <h1> and </h1>). The first tag in a pair is the start tag, and the second tag is the end tag (also known as opening tags and closing tags). Some tags that represent empty elements don't come in pairs, such as <img>.
The purpose of a web browser is to read HTML documents and compose them into visible web pages. The browser does not display the HTML tags, but rather interprets the tags and displays content on the page that corresponds to that tag. HTML describes the structure of a web page semantically, with cues for presentation. This distinguishes it as a markup language rather than a programming language.
Sequentum Enterprise uses XPath to select specific HTML tags and then extract content from those tags. An HTML tag can contain both text and attributes. For example, an HTML tag that displays an image will contain an src attribute that specifies the URL of the image to display. Sequentum Enterprise can extract both tag text and tag attributes and may perform certain actions on the content it extracts. For example, it may extract the src attribute from an <image> HTML tag and then use the URL to download the image.
There are many websites that have HTML tutorials. Here is one example:
To extract data correctly, Sequentum Enterprise needs to detect any dynamic changes on a web page. For example, if you want to extract any additional data that AJAX loads onto a web page, then you'll want to configure Sequentum Enterprise to wait on AJAX to finish processing the new content before it can start extracting it.
XPath and Selection Techniques
Proper selection technique is a critical aspect of web data extraction. The most basic selection technique is to point-and-click on elements in the web browser panel, which is the easiest way to add commands to an agent. XPath is a common syntax for selecting elements in HTML and XML documents. Each time you click on an element in the web browser, Sequentum Enterprise works in the background to calculate the selection XPath.
Sequentum Enterprise has a variety of tools that help you create precise XPaths without needing to know XPath syntax at all. If you want finer control over element selection, it's worthwhile to learn some specifics about XPath syntax. Although this user guide doesn't include an XPath tutorial, there are many tutorials available on the web. We recommend this as a good place to learn more about XPath:
We also recommend this reference for common XPath patterns that is popular among selenium users:
With Regular Expressions, you can write expressions that look for specific character sequences within strings and then extract small text strings out of larger ones.
Sequentum Enterprise uses XPath to select web elements on a web page and then extracts content from those web elements. You may only want some parts of the content extraction, or you may want to transform it. For example, a single web element may contain the entire address of a company, but you may want to extract the content into separate elements such as a street address, city, zip code, and state. You can use Regular Expressions to split the address text into separate text strings.
There are many tutorial websites that teach Regular Expressions. Here is one example: