Skip to main content

Extracting Data From Non-HTML Documents

Websites generally provide most of their content in HTML format, but some websites may also provide content in other formats - such as PDF or Microsoft Word documents. Since Sequentum Enterprise can only process HTML documents, it will simply download any non-HTML document. Sequentum Enterprise can help you extract text and images from within a PDF or Word document by converting such documents into HTML.

To have Sequentum Enterprise convert your non-HTML document, you will need to provide an external document converter. The Sequentum Enterprise public website provides a list of open source programs that you can use for this purpose. Please remember that we don't support these tools, and you must comply with the license for any conversion tool.

Limitations

The design of most file formats, including PDF and Word files, doesn't include ease-of-conversion to HTML. So, the conversion output is considerably more difficult to manage than standard HTML. In many cases, you'll have to select the entire HTML page and then use Regular Expressions to extract the target content.

Installing a Document Converter

Sequentum Enterprise uses a custom script that makes a call to an external document converter, and you can configure this script to call any type of program. The default script can handle the three document converters currently available for download on the Sequentum Enterprise website.

Installing the PDF To HTML Converter

Follow these steps to install the PDF-to-HTML document converter:

  1. Download the pdftohtml.zip file from the Sequentum Enterprise website:

https://contentgrabber.com/web-scraping-tools

2. Extract the content of the zip file into the default Sequentum Enterprise Converters folder, My Documents\Content Grabber\Agents\Shared\Converters. You can also copy the converter into the corresponding Public Documents folder if you need the converter to be available for all users on the computer.

3. The direct path to the document converter should now be:

My Documents\Content Grabber\Agents\Shared\Converters\pdftohtml\pdftohtml.exe 

Installing the Docx To HTML Converter

Follow these steps to install the Docx to HTML document converter:

  1. Download the docxtohtml.zip file from the Sequentum Enterprise website:

https://contentgrabber.com/web-scraping-tools

2. Extract the content of the zip file into the default Sequentum Enterprise Converters folder, My Documents\Content Grabber\Agents\Shared\Converters. You can also copy the converter into the corresponding Public Documents folder if you need the converter to be available for all users on the computer.

3. The direct path to the document converter should now be:

My Documents\Content Grabber\Agents\Shared\Converters\pdftohtml\pdftohtml.exe

Installing the Excel To HTML Converter

Follow these steps to install the Excel to HTML document converter:

  1. Download the exceltohtml.zip file from the Sequentum Enterprise website:

https://contentgrabber.com/web-scraping-tools

2.Extract the content of the zip file into the default Sequentum Enterprise Converters folder, My Documents\Content Grabber\Agents\Shared\Converters. You can also copy the converter into the corresponding Public Documents folder if you need the converter to be available for all users on the computer.

3. The direct path to the document converter should now be:

My Documents\Content Grabber\Agents\Shared\Converters\exceltohtml\exceltohtml.exe

Using a Document Converter

You'll need to add the custom script (that calls the document converters) to a Download Document command, and that same command must also perform the download of the document.

DocDownloadScript.png

 

The default conversion script looks like this:

using System;
using System.IO;
using Sequentum.ContentGrabber.Api;
public class Script
{
//See help for a definition of ConvertDocumentToHtmlArguments.
public static bool ConvertDocumentToHtml(ConvertDocumentToHtmlArguments args)
{
    if(args.DocumentType.Equals("pdf", StringComparison.OrdinalIgnoreCase))
    {
          args.ScriptUtilities.ExecuteCommandLine(@"Converters\pdftohtml\pdftohtml.exe",
args.DocumentFilePath, args.HtmlFilePath, "-noframes -nodrm");
    }
    else if(args.DocumentType.Equals("docx", StringComparison.OrdinalIgnoreCase))
    {
        args.ScriptUtilities.ExecuteCommandLine(@"Converters\docxtohtml\docxtohtml.exe",
args.DocumentFilePath, args.HtmlFilePath, "");
    }
    else if(args.DocumentType.Equals("xlsx", StringComparison.OrdinalIgnoreCase) || args.DocumentType.Equals("xls", StringComparison.OrdinalIgnoreCase))
    {
            if(args.IsDebug)
            {
                    args.ScriptUtilities.ExecuteCommandLine(@"Converters\exceltohtml\exceltohtml.exe",
args.DocumentFilePath, args.HtmlFilePath, "-noimages -rows 100");
            }
            else
            {
                    args.ScriptUtilities.ExecuteCommandLine(@"Converters\exceltohtml\exceltohtml.exe" args.DocumentFilePath, args.HtmlFilePath, "-noimages");
            }
    }
    else if(args.DocumentType.Equals("csv", StringComparison.OrdinalIgnoreCase) || args.DocumentType.Equals("txt", StringComparison.OrdinalIgnoreCase))
    {
            if(args.IsDebug)
            {
                    args.ScriptUtilities.ExecuteCommandLine(@"Converters\exceltohtml\exceltohtml.exe",
args.DocumentFilePath, args.HtmlFilePath, "-text -delimiter , -noimages -rows 100");
            }
            else
            {
                    args.ScriptUtilities.ExecuteCommandLine(@"Converters\exceltohtml\exceltohtml.exe", args.DocumentFilePath, args.HtmlFilePath, "-text -delimiter , -noimages");
            }
    }
    if(!File.Exists(args.HtmlFilePath))
            return false;
    return true;
}

Extracting Content From a Converted Document

After converting a document to HTML, that document goes into the agent data folder and you can use a Navigate URL command to open the HTML document. The Download Document command that did the conversion will also store the path to the document, and then the Navigate URL command can use that path to get the file URL to the HTML document.

Report.png

An agent with a URL command that links to a converted HTML document

The Navigate URL command uses data from the Download Document command, so the Download Documentcommand must execute first. Also, both the commands must have the same parent command or the Navigate URL command must be a child command of the command that contains the Download Document command.

You can execute the Navigate URL command in the editor to open the converted HTML document, but you must first execute the Download Document command to make sure the HTML document is available.

PDFInCG.png

A Navigate URL command using data captured by a Download Document command

JavaScript errors detected

Please note, these errors can depend on your browser setup.

If this problem persists, please contact our support.