Building Your First Agent

The following diagram shows the key steps for building a web data extraction agent. We provide links for other topics which will explain each of these steps in further detail.

seq_webscrapediagram_090414__1_.jpg
Basic web data extraction agent creation process

In this section, we cover the identification of data elements on your target website and create a web data extraction agent. We'll work through each of the above steps with examples that match common web data extraction usage, so you can get comfortable building your own agents.

We cover these topics in this section:

After you get comfortable creating an agent, you can learn more about the Sequentum Enterprise Editor.

Choosing a Start URL

The Start URL is the place where you begin data collection and corresponds to the starting point of your web data extraction agent.

In the following sections, we'll use the Cruise Direct website for our example.

 http://www.cruisedirect.com

Note: In this example, we start from the Cruise Direct home page, however, if the data you require is not located on a website home page, you can start the agent from a website sub-page. This approach will make the agent more efficient, so it’s worth taking the time to be more specific.

1. We start by pasting the start web page URL from the target website (http://www.cruisedirect.com) into the Sequentum Enterprise Address Bar.

2. Next, click the Blue Play button to load the Cruise Direct home page.

Note: you can also press the Enter key to load the Cruise Direct home page.

Start_URL.png

Sequentum Enterprise with Cruise Direct home page loaded

In the next section, Select the Content to Capture, we will continue to use the Cruise Direct website data for our example.

Select the Content to Capture

In the previous section, we selected our Start URL and loaded the web page into Sequentum Enterprise. Next, you can select the data you want to capture and start building your web data extraction agent. In our Cruise Direct example, we plan to search for available cruise vacations and then extract details about each cruise.

1. Firstly we need to perform a search to retrieve the data for the available cruises. To do this, we select the orange Search button element with the mouse, then click one more time to display the Sequentum Enterprise Message window.

CGMsgWindow.png
Sequentum Enterprise Message Window
 
2. From the message window, we choose the Click on the Web Element option to add a new command to the agent that will execute the search and display the search results on a new web page. Notice that Sequentum Enterprise has added our first command to the Agent Explorer - in this case, to execute the search and display the search results.
 
CruiseLines.png

Agent Explorer with new Search command linked to the new Search page

3. We are now ready to add commands to our agent to extract the cruise data. As we have a number of data elements in tables, we will use a list to simplify the extraction for us. To capture a data element, move your mouse precisely over the data element you want until you see the data-capture box around it. We start by selecting the first cruise name.

1st_cruise_line_data.png

First Cruise Line data element selected within Sequentum Enterprise

4. Then click List in the Configure Agent Command panel to activate the list selection mode.

list.png

Activating List selection mode from the Configure Agent Command panel

5. In list selection mode, we can add web data elements to the list by clicking similar data elements. Now we'll click on the second cruise name and you will see Sequentum Enterprise has selected the remaining data elements on the page. Note: if any cruise data elements remain unselected, simply click on these to add to the list.

2nd_cruise_data.png

Second Cruise Line data element selected while in List selection mode

6. We now click Save to save the list and exit list selection mode. The Web Element list command defines the list area, so any elements within this area are now included within the list.

7. To capture the cruise name text, we click on any selected element to display the Sequentum Enterprise Message window. From the Sequentum Enterprise Message window, choose the Capture Text option to add the web element command to capture the cruise names. We have now added new web element list and web element commands to the Agent Explorer and Sequentum Enterprise has set default names for these commands.

8. To edit the names for the commands, we click the respective Edit icons and set the names of the commands to ‘Cruise Name List’ and ‘Cruise Name’. Then click the Green Tick to save.

AE.png

Agent Explorer with new Cruise Name List and Cruise Name commands

9. Now we plan to extract the individual cruise web elements from each table. First, click on the Departs web element. Sequentum Enterprise now automatically selects the Departs web element for all cruises because it is already defined as a list.

10. Next, click on the Departs web element one more time to display the Sequentum Enterprise Message window. Now choose the Capture Text option from the Sequentum Enterprise Message window to add the command to the Agent so we can capture the individual Departs web elements.

11. After that, we click the Edit icon to change the name of the command to "Departs" then save it.

12. Now we do the same for the Ship, Destination, Duration, and Ports of Call web elements then set the respective names of the commands and save them.

13. We also want to capture all the price information in the pricing tables, so as before, we select the first web element (Date) in the pricing table. We then click List in the Configure Agent Command panel to activate the list selection mode and click on more Date web elements to generate the list.

Dates.png

14. Click one more time on one of the Date web elements to display the Sequentum Enterprise Message Window. Then choose the Capture Text option to add the command to the Agent.

15. Add commands for the ‘inside’, ‘outside’, ‘balcony’ and ‘suite’ web elements by clicking twice on each of the web elements.

16. Change the names of the new commands so your agent looks like the image below.

Cruise_details.png

Agent Explorer showing all Capture Text commands

17. Thus far we have created the Agent to extract all the cruise information on the first page. We need to set it up to iterate through all the search result pages. To do this, we need to use the Follow Pagination command to follow each of the pages. Scroll down the page and select the Next link. Then click one more time on the selected element to display the Sequentum Enterprise Message window.

Pagination.png

Sequentum Enterprise Message window with Follow Pagination option selected

18. Now we choose the Follow Pagination option to add the pagination command to the Agent.
Sequentum Enterprise has added the pagination command to the Agent and loads the next page on the second browser tab.

19. When we click on the pagination command we can see all the search list commands inside of the pagination command. This means our agent will now iterate through all the search result pages to extract this information.

Pagination_group.png

Agent Explorer showing the contents of the Pagination command

20. We have now finished building the Agent so we should save it. To save the Agent, choose File > Save in the Sequentum Enterprise menu, and then enter the Agent Name “cruisedirect”. Then click the Save button to commit your changes.

In the next section, Refine Your Data we use Sequentum Enterprise's Content Transformation method to change the extracted price data.

Refine Your Data (Optional)

In the previous section, we added commands to our agent to capture all the cruise price content we require. The prices include a $ sign, but we want to get rid of that $ sign, so we're left with just an integer.

1. First edit the "Interior" command in the Agent Explorer.

Refine_data.png

2. Next, we scroll down to the capture sample window and highlight only the price number. You should notice that the Transformation Script button has now changed to a Generate Transformation button.

generate_transformation.png

Selecting text to Transform in the Configure Agent Command panel

3. Now click on the Generate Transformation button, and we can now see only the price number in the Transformed window.

transformation.png

Transformed text in the Configure Agent Command panel

4. Click Save to save the transformation.

5. Repeat the steps above for Ocean View, Balcony Price, and Suite Price.

In the next section, Output Data Format we look at the data output formats available for the extracted web data and show how to change and configure a new export target.

Output Data Format

Sequentum Enterprise can export extracted web data as a report or to numerous different database types. Data output options include Excel, CSV, XML, PDF, JSON, Parquet, or a Database (SQL Server, MySQL, Oracle, OleDB).

You can also use a Sequentum Enterprise export script to completely customize the data export to your own database structures. This is useful when you want the data updates to be dynamically reflected, such as for an online website / portal.

Sequentum Enterprise can export data to Excel 2003+ and take advantage of features in Excel 2007+ such as outlining and embedded images.

Data is exported automatically to your chosen export destination when a data extraction project completes, so you don't have to export data manually. However, you can always export extracted data manually at any time to any export destination.

The following steps show how easy it is to choose the data export type you want to use.

1. We start by clicking on Sequentum Enterprise's Data menu and then clicking the Export Target Settings button.

data.pngSequentum Enterprise Data menu

2. After clicking the Export Target Settings button the Data Export Configuration Window displays. This window allows you to change and configure a new export target. The default option is Excel 2003.

Click on the list on the left to change the export target or configure for multiple exports by checking Enable Export. You can also change the default destination folder location for where the data file(s) are outputted. Other options are available like appending the current export to existing files or including a timestamp in the filename. Additionally, File Name Transformation is also available where the user can write a script to fully rename and customize the export filename.

Data_export_targets.png

Sequentum Enterprise Data Configuration window

In the next section, Test Your Agent we use Sequentum Enterprise's Debugging features to test that the "cruisedata" Agent is functioning as expected.

Test Your Agent

Once you have finished developing your agent, it is important to test it to ensure the correct data is being extracted and in the format you require. Sequentum Enterprise has a sophisticated debugging engine which enables you to carefully analyze all aspects of your agent and the data being extracted. It can also help you to pinpoint any trouble spots in your agent code so you can quickly resolve them. For more detail on the Debugger features available in Sequentum Enterprise refer to Testing/Debugging an Agent.

Now let's debug the Cruise Direct agent to check that the commands are working properly.

To do the test run, click the ‘Debug’ menu option at the top left of the screen and then the click ‘Start’ arrow button to start the debugging.

Debug_mode_run.png

Sequentum Enterprise Debugger running the Cruise Direct Agent

During the debugging, we can observe and check that Sequentum Enterprise goes through each of the commands in sequence and processes each of the web pages to extract the required data.

Part way through running the agent, we will click the ‘Stop’ button to stop the debugging. Then click ‘View Export Data’ to check that the output results are correct.

Debug_data.png

Sequentum Enterprise default view of the export data

To see the export data in Excel, we can simply click on the ‘Open Exported Spreadsheet’ button to open the Excel spreadsheet.

Data_export_sheet.png

Viewing the Cruise Direct export results in an Excel Spreadsheet

The extracted data contains one row for each departure date, but you could also choose to save the date and price information in a separate data table.

In the next section, Scheduling you will learn how to set up an agent so it can be automatically run at intervals of your choosing.

Scheduling

Sequentum Enterprise provides an Agent scheduling facility that enables you to automatically run your agent at predetermined time slots whenever you need it to run. This can be done every hour, every day, month, year and so on.

Once you have completed your agent, you will be able to access this facility. From the Agent Settings menu at the top of the Sequentum Enterprise application. Simply Click on the Schedule menu option to display the Scheduling window.

CruiseDirectSchedule.png

Sequentum Enterprise Scheduling Window

For details on how to configure Sequentum Enterprise's Scheduling feature, refer to Scheduling in the Sequentum Enterprise Editor section.

In the next section, Run Your Agent we will show you how you run your finished web data extraction Agent.

Run Your Agent

Now that we have finished developing and testing the Cruise Direct Agent, it is ready to use.

To run your agent, you simply click on the Run menu at the top left of the Sequentum Enterprise application then click the Run Agent arrow selection.

run_agent.png

Sequentum Enterprise Run menu selections

Note: if you have already scheduled your Agent to run at a later time or date, you can just leave it on your Internet enabled PC or server and it will run automatically. Refer to Scheduling for more detail.

Was this article helpful?
0 out of 0 found this helpful

Comments

0 comments

Please sign in to leave a comment.