After you get familiar with the navigation paths for your target website, you need to identify a good start URL. Sometimes this is simply the start URL of the website, but often times the best URL is one of the sub-pages such as the product listing page. Once you have this URL, you’ll need to copy it and then paste it into the address bar of Sequentum Enterprise.
Some websites allow navigation without any corresponding change in the visible URL. In such cases, you may not have a start URL that points directly to your start webpage, and so you’ll need to add preliminary steps to your agent to navigate to that webpage.
In addition to unreliable websites, another challenge is that some web data extraction tasks are especially difficult to complete - including the following:
- Extracting data from complex websites
- Extracting data from websites that use deterrents
- Extracting huge amounts of data
- Extracting data from non-HTML content.
Extracting Data From Complex Websites
If you are developing web data extraction agents for a large number of different websites, you will probably find that around 50% of the websites are very easy, 30% are modest in difficulty, and 20% are very challenging. For a small percentage, it will be effectively impossible to extract meaningful data. It may take two weeks or more for a web data extraction expert to develop an agent for such a website, so the cost of developing the agent is likely to outweigh the value of the data you might be able to extract.
Extracting Data From Websites Using Deterrents
Web data extraction will always be challenging for any website with active deterrents in place. If it is necessary to login to access the content that you want to extract, then the website can always suspend your account and make it impractical to create new accounts.
Another method for websites that are wary of crawlers or scrapers is the use of CAPTCHA. Sequentum Enterprise includes tools you can use to overcome CAPTCHA protection, but you'll incur additional costs to get a 3rd-party to do automatic CAPTCHA processing. See CAPTCHA Blocking for more information.
The most common protection technique is using your IP address to identify and block your access to a website. You can usually circumvent this technique by using a proxy rotation service, which hides your actual IP address and uses a new IP address every time you request a web page from a website. See IP Blocking & Proxy Servers for more information.
NOTE: Ethically and legally, we recommend that you avoid websites that are actively taking measures to block your access, even if you are able to circumvent the protection.
Extracting Huge Amounts of Data
A web data extraction tool must actually visit a web page to extract data from it. Downloading a web page takes time, and it could take weeks and months to load and extract data from millions of web pages. For example, it's virtually impossible to extract all product data from Amazon.com, since there are too many web pages.
Extracting Data From Non-HTML Content
Some websites are built entirely in Flash, which is a small-footprint software application that runs in the web browser. Sequentum Enterprise can only work with HTML content, so it can only extract the Flash file. However, it can't interact with the Flash application or extract data from within the Flash application.
Many websites provide data in the form of PDF files and other file formats. Though it cannot directly extract data from such files, Sequentum Enterprise can easily download those files and convert the files into an HTML document using 3rd-party converters to extract data from the conversion output. The document conversion happens very quickly in real-time, so it will seem as though you are performing a direct extraction. It's important to realize that PDF documents and most file formats don't contain content that is easily convertible into structured HTML. To do that, you can use the Regular Expressions feature of Sequentum Enterprise to resolve the conversion output.