Web Data Extraction Limitations

Web data extraction can be challenging when it comes to collecting data from complex, dynamic websites. If you're new to web data extraction, then it is recommended that you begin with an easy website: one that is mostly static and has little, if any, AJAX or JavaScript.

After you get familiar with the navigation paths for your target website, you need to identify a good start URL. Sometimes this is simply the start URL of the website, but often times the best URL is one of the sub-pages such as the product listing page. Once you have this URL, you’ll need to copy it and then paste it into the address bar of Sequentum Enterprise.

NOTICE:

Some websites allow navigation without any corresponding change in the visible URL. In such cases, you may not have a start URL that points directly to your start webpage, and so you’ll need to add preliminary steps to your agent to navigate to that webpage.

Web data extraction can be also challenging if you don't have the proper tools. Largely, you're completely at the mercy of the target website, and that website can change at any time without prior notice. The website can also contain faulty JavaScript that causes it to crash and exhibit unexpected behavior. The server that hosts the website may crash, or the website may undergo maintenance. Many potential problems can occur during a lengthy web data extraction session, and you have very little influence on any of them. Sequentum Enterprise offers an array of advanced error-handling and stability features that can help you manage many of the problems that a web data extraction agent is likely to encounter.

In addition to unreliable websites, another challenge is that some web data extraction tasks are especially difficult to complete - including the following:

Extracting data from complex websites
Extracting data from websites that use deterrents
Extracting huge amounts of data
Extracting data from non-HTML content.

Extracting Data From Complex Websites

If you are developing web data extraction agents for a large number of different websites, you will probably find that around 50% of the websites are very easy, 30% are modest in difficulty, and 20% are very challenging. For a small percentage, it will be effectively impossible to extract meaningful data. It may take two weeks or more for a web data extraction expert to develop an agent for such a website, so the cost of developing the agent is likely to outweigh the value of the data you might be able to extract.

Extracting Data From Websites Using Deterrents

Web data extraction will always be challenging for any website with active deterrents in place. If it is necessary to login to access the content that you want to extract, then the website can always suspend your account and make it impractical to create new accounts.

Some websites use browser fingerprinting to identify and block your access to the website. Fingerprinting uses JavaScript to make a positive identification by examining your browser and computer specifications and thereby making it impossible to circumvent.

Another method for websites that are wary of crawlers or scrapers is the use of CAPTCHA. Sequentum Enterprise includes tools you can use to overcome CAPTCHA protection, but you'll incur additional costs to get a 3rd-party to do automatic CAPTCHA processing. See CAPTCHA Blocking for more information.

The most common protection technique is using your IP address to identify and block your access to a website. You can usually circumvent this technique by using a proxy rotation service, which hides your actual IP address and uses a new IP address every time you request a web page from a website. See IP Blocking & Proxy Servers for more information.

NOTE: Ethically and legally, we recommend that you avoid websites that are actively taking measures to block your access, even if you are able to circumvent the protection.

Extracting Huge Amounts of Data

A web data extraction tool must actually visit a web page to extract data from it. Downloading a web page takes time, and it could take weeks and months to load and extract data from millions of web pages. For example, it's virtually impossible to extract all product data from Amazon.com, since there are too many web pages.

Extracting Data From Non-HTML Content

Some websites are built entirely in Flash, which is a small-footprint software application that runs in the web browser. Sequentum Enterprise can only work with HTML content, so it can only extract the Flash file. However, it can't interact with the Flash application or extract data from within the Flash application.

Many websites provide data in the form of PDF files and other file formats. Though it cannot directly extract data from such files, Sequentum Enterprise can easily download those files and convert the files into an HTML document using 3rd-party converters to extract data from the conversion output. The document conversion happens very quickly in real-time, so it will seem as though you are performing a direct extraction. It's important to realize that PDF documents and most file formats don't contain content that is easily convertible into structured HTML. To do that, you can use the Regular Expressions feature of Sequentum Enterprise to resolve the conversion output.