The following diagram shows the key steps for building a web data extraction agent. We provide links for other topics which will explain each of these steps in further detail.
In this section, we cover the identification of data elements on your target website and create a web data extraction agent. We'll work through each of the above steps with examples that match common web data extraction usage, so you can get comfortable building your own agents.
We cover these topics in this section:
- Choosing a Start URL
- Select the Content to Capture
- Refine Your Data (optional)
- Output Data Format
- Test Your Agent
- Run Your Agent
After you get comfortable creating an agent, you can learn more about the Sequentum Enterprise Editor.
The Start URL is the place where you begin data collection and corresponds to the starting point of your web data extraction agent.
In the following sections, we'll use the Cruise Direct website for our example.
Note: In this example, we start from the Cruise Direct home page, however, if the data you require is not located on a website home page, you can start the agent from a website sub-page. This approach will make the agent more efficient, so it’s worth taking the time to be more specific.
1. We start by pasting the start web page URL from the target website (http://www.cruisedirect.com) into the Sequentum Enterprise Address Bar.
2. Next, click the Blue Play button to load the Cruise Direct home page.
Note: you can also press the Enter key to load the Cruise Direct home page.
Sequentum Enterprise with Cruise Direct home page loaded
In the next section, Select the Content to Capture, we will continue to use the Cruise Direct website data for our example.
In the previous section, we selected our Start URL and loaded the web page into Sequentum Enterprise. Next, you can select the data you want to capture and start building your web data extraction agent. In our Cruise Direct example, we plan to search for available cruise vacations and then extract details about each cruise.
1. Firstly we need to perform a search to retrieve the data for the available cruises. To do this, we select the orange Search button element with the mouse, then click one more time to display the Sequentum Enterprise Message window.
Agent Explorer with new Search command linked to the new Search page
3. We are now ready to add commands to our agent to extract the cruise data. As we have a number of data elements in tables, we will use a list to simplify the extraction for us. To capture a data element, move your mouse precisely over the data element you want until you see the data-capture box around it. We start by selecting the first cruise name.
First Cruise Line data element selected within Sequentum Enterprise
4. Then click List in the Configure Agent Command panel to activate the list selection mode.
Activating List selection mode from the Configure Agent Command panel
5. In list selection mode, we can add web data elements to the list by clicking similar data elements. Now we'll click on the second cruise name and you will see Sequentum Enterprise has selected the remaining data elements on the page. Note: if any cruise data elements remain unselected, simply click on these to add to the list.
Second Cruise Line data element selected while in List selection mode
6. We now click Save to save the list and exit list selection mode. The Web Element list command defines the list area, so any elements within this area are now included within the list.
7. To capture the cruise name text, we click on any selected element to display the Sequentum Enterprise Message window. From the Sequentum Enterprise Message window, choose the Capture Text option to add the web element command to capture the cruise names. We have now added new web element list and web element commands to the Agent Explorer and Sequentum Enterprise has set default names for these commands.
8. To edit the names for the commands, we click the respective Edit icons and set the names of the commands to ‘Cruise Name List’ and ‘Cruise Name’. Then click the Green Tick to save.
Agent Explorer with new Cruise Name List and Cruise Name commands
9. Now we plan to extract the individual cruise web elements from each table. First, click on the Departs web element. Sequentum Enterprise now automatically selects the Departs web element for all cruises because it is already defined as a list.
10. Next, click on the Departs web element one more time to display the Sequentum Enterprise Message window. Now choose the Capture Text option from the Sequentum Enterprise Message window to add the command to the Agent so we can capture the individual Departs web elements.
11. After that, we click the Edit icon to change the name of the command to "Departs" then save it.
12. Now we do the same for the Ship, Destination, Duration, and Ports of Call web elements then set the respective names of the commands and save them.
13. We also want to capture all the price information in the pricing tables, so as before, we select the first web element (Date) in the pricing table. We then click List in the Configure Agent Command panel to activate the list selection mode and click on more Date web elements to generate the list.
14. Click one more time on one of the Date web elements to display the Sequentum Enterprise Message Window. Then choose the Capture Text option to add the command to the Agent.
15. Add commands for the ‘inside’, ‘outside’, ‘balcony’ and ‘suite’ web elements by clicking twice on each of the web elements.
16. Change the names of the new commands so your agent looks like the image below.
Agent Explorer showing all Capture Text commands
17. Thus far we have created the Agent to extract all the cruise information on the first page. We need to set it up to iterate through all the search result pages. To do this, we need to use the Follow Pagination command to follow each of the pages. Scroll down the page and select the Next link. Then click one more time on the selected element to display the Sequentum Enterprise Message window.
Sequentum Enterprise Message window with Follow Pagination option selected
18. Now we choose the Follow Pagination option to add the pagination command to the Agent.
Sequentum Enterprise has added the pagination command to the Agent and loads the next page on the second browser tab.
19. When we click on the pagination command we can see all the search list commands inside of the pagination command. This means our agent will now iterate through all the search result pages to extract this information.
Agent Explorer showing the contents of the Pagination command
20. We have now finished building the Agent so we should save it. To save the Agent, choose File > Save in the Sequentum Enterprise menu, and then enter the Agent Name “cruisedirect”. Then click the Save button to commit your changes.
In the next section, Refine Your Data we use Sequentum Enterprise's Content Transformation method to change the extracted price data.
In the previous section, we added commands to our agent to capture all the cruise price content we require. The prices include a $ sign, but we want to get rid of that $ sign, so we're left with just an integer.
1. First edit the "Interior" command in the Agent Explorer.
2. Next, we scroll down to the capture sample window and highlight only the price number. You should notice that the Transformation Script button has now changed to a Generate Transformation button.
Selecting text to Transform in the Configure Agent Command panel
3. Now click on the Generate Transformation button, and we can now see only the price number in the Transformed window.
Transformed text in the Configure Agent Command panel
4. Click Save to save the transformation.
5. Repeat the steps above for Ocean View, Balcony Price, and Suite Price.
In the next section, Output Data Format we look at the data output formats available for the extracted web data and show how to change and configure a new export target.
Sequentum Enterprise can export extracted web data as a report or to numerous different database types. Data output options include Excel, CSV, XML, PDF, JSON, Parquet, or a Database (SQL Server, MySQL, Oracle, OleDB).
You can also use a Sequentum Enterprise export script to completely customize the data export to your own database structures. This is useful when you want the data updates to be dynamically reflected, such as for an online website / portal.
Sequentum Enterprise can export data to Excel 2003+ and take advantage of features in Excel 2007+ such as outlining and embedded images.
Data is exported automatically to your chosen export destination when a data extraction project completes, so you don't have to export data manually. However, you can always export extracted data manually at any time to any export destination.
The following steps show how easy it is to choose the data export type you want to use.
1. We start by clicking on Sequentum Enterprise's Data menu and then clicking the Export Target Settings button.
Sequentum Enterprise Data menu
2. After clicking the Export Target Settings button the Data Export Configuration Window displays. This window allows you to change and configure a new export target. The default option is Excel 2003.
Click on the list on the left to change the export target or configure for multiple exports by checking Enable Export. You can also change the default destination folder location for where the data file(s) are outputted. Other options are available like appending the current export to existing files or including a timestamp in the filename. Additionally, File Name Transformation is also available where the user can write a script to fully rename and customize the export filename.
Sequentum Enterprise Data Configuration window
In the next section, Test Your Agent we use Sequentum Enterprise's Debugging features to test that the "cruisedata" Agent is functioning as expected.
Once you have finished developing your agent, it is important to test it to ensure the correct data is being extracted and in the format you require. Sequentum Enterprise has a sophisticated debugging engine which enables you to carefully analyze all aspects of your agent and the data being extracted. It can also help you to pinpoint any trouble spots in your agent code so you can quickly resolve them. For more detail on the Debugger features available in Sequentum Enterprise refer to Testing/Debugging an Agent.
Now let's debug the Cruise Direct agent to check that the commands are working properly.
To do the test run, click the ‘Debug’ menu option at the top left of the screen and then the click ‘Start’ arrow button to start the debugging.
Sequentum Enterprise Debugger running the Cruise Direct Agent
During the debugging, we can observe and check that Sequentum Enterprise goes through each of the commands in sequence and processes each of the web pages to extract the required data.
Part way through running the agent, we will click the ‘Stop’ button to stop the debugging. Then click ‘View Export Data’ to check that the output results are correct.
Sequentum Enterprise default view of the export data
To see the export data in Excel, we can simply click on the ‘Open Exported Spreadsheet’ button to open the Excel spreadsheet.
Viewing the Cruise Direct export results in an Excel Spreadsheet
The extracted data contains one row for each departure date, but you could also choose to save the date and price information in a separate data table.
In the next section, Scheduling you will learn how to set up an agent so it can be automatically run at intervals of your choosing.
Sequentum Enterprise provides an Agent scheduling facility that enables you to automatically run your agent at predetermined time slots whenever you need it to run. This can be done every hour, every day, month, year and so on.
Once you have completed your agent, you will be able to access this facility. From the Agent Settings menu at the top of the Sequentum Enterprise application. Simply Click on the Schedule menu option to display the Scheduling window.
Sequentum Enterprise Scheduling Window
For details on how to configure Sequentum Enterprise's Scheduling feature, refer to Scheduling in the Sequentum Enterprise Editor section.
In the next section, Run Your Agent we will show you how you run your finished web data extraction Agent.
Now that we have finished developing and testing the Cruise Direct Agent, it is ready to use.
To run your agent, you simply click on the Run menu at the top left of the Sequentum Enterprise application then click the Run Agent arrow selection.
Sequentum Enterprise Run menu selections
Note: if you have already scheduled your Agent to run at a later time or date, you can just leave it on your Internet enabled PC or server and it will run automatically. Refer to Scheduling for more detail.