Extracting Data From Non-HTML Documents
Websites generally provide most of their content in HTML format, but some websites may also provide content in other formats - such as PDF or Microsoft Word documents. Since Sequentum Enterprise can only process HTML documents, it will simply download any non-HTML document. Sequentum Enterprise can help you extract text and images from within a PDF or Word document by converting such documents into HTML.
To have Sequentum Enterprise convert your non-HTML document, you will need to provide an external document converter. The Sequentum Enterprise public website provides a list of open source programs that you can use for this purpose. Please remember that we don't support these tools, and you must comply with the license for any conversion tool.
Limitations
The design of most file formats, including PDF and Word files, doesn't include ease-of-conversion to HTML. So, the conversion output is considerably more difficult to manage than standard HTML. In many cases, you'll have to select the entire HTML page and then use Regular Expressions to extract the target content.
Installing a Document Converter
Sequentum Enterprise uses a custom script that makes a call to an external document converter, and you can configure this script to call any type of program. The default script can handle the three document converters currently available for download on the Sequentum Enterprise website.
Installing the PDF To HTML Converter
Follow these steps to install the PDF-to-HTML document converter:
Download the pdftohtml.zip file from the Sequentum Enterprise website:
https://contentgrabber.com/web-scraping-tools
2. Extract the content of the zip file into the default Sequentum Enterprise Converters folder, My Documents\Content Grabber\Agents\Shared\Converters. You can also copy the converter into the corresponding Public Documents folder if you need the converter to be available for all users on the computer.
3. The direct path to the document converter should now be:
My Documents\Content Grabber\Agents\Shared\Converters\pdftohtml\pdftohtml.exe
Installing the Docx To HTML Converter
Follow these steps to install the Docx to HTML document converter:
Download the docxtohtml.zip file from the Sequentum Enterprise website:
https://contentgrabber.com/web-scraping-tools
2. Extract the content of the zip file into the default Sequentum Enterprise Converters folder, My Documents\Content Grabber\Agents\Shared\Converters. You can also copy the converter into the corresponding Public Documents folder if you need the converter to be available for all users on the computer.
3. The direct path to the document converter should now be:
My Documents\Content Grabber\Agents\Shared\Converters\pdftohtml\pdftohtml.exe
Installing the Excel To HTML Converter
Follow these steps to install the Excel to HTML document converter:
Download the exceltohtml.zip file from the Sequentum Enterprise website:
https://contentgrabber.com/web-scraping-tools
2.Extract the content of the zip file into the default Sequentum Enterprise Converters folder, My Documents\Content Grabber\Agents\Shared\Converters. You can also copy the converter into the corresponding Public Documents folder if you need the converter to be available for all users on the computer.
3. The direct path to the document converter should now be:
My Documents\Content Grabber\Agents\Shared\Converters\exceltohtml\exceltohtml.exe
Using a Document Converter
You'll need to add the custom script (that calls the document converters) to a Download Document command, and that same command must also perform the download of the document.
The default conversion script looks like this:
using System;
using System.IO;
using Sequentum.ContentGrabber.Api;
public class Script
{
//See help for a definition of ConvertDocumentToHtmlArguments.
public static bool ConvertDocumentToHtml(ConvertDocumentToHtmlArguments args)
{
if(args.DocumentType.Equals("pdf", StringComparison.OrdinalIgnoreCase))
{
args.ScriptUtilities.ExecuteCommandLine(@"Converters\pdftohtml\pdftohtml.exe",
args.DocumentFilePath, args.HtmlFilePath, "-noframes -nodrm");
}
else if(args.DocumentType.Equals("docx", StringComparison.OrdinalIgnoreCase))
{
args.ScriptUtilities.ExecuteCommandLine(@"Converters\docxtohtml\docxtohtml.exe",
args.DocumentFilePath, args.HtmlFilePath, "");
}
else if(args.DocumentType.Equals("xlsx", StringComparison.OrdinalIgnoreCase) || args.DocumentType.Equals("xls", StringComparison.OrdinalIgnoreCase))
{
if(args.IsDebug)
{
args.ScriptUtilities.ExecuteCommandLine(@"Converters\exceltohtml\exceltohtml.exe",
args.DocumentFilePath, args.HtmlFilePath, "-noimages -rows 100");
}
else
{
args.ScriptUtilities.ExecuteCommandLine(@"Converters\exceltohtml\exceltohtml.exe" args.DocumentFilePath, args.HtmlFilePath, "-noimages");
}
}
else if(args.DocumentType.Equals("csv", StringComparison.OrdinalIgnoreCase) || args.DocumentType.Equals("txt", StringComparison.OrdinalIgnoreCase))
{
if(args.IsDebug)
{
args.ScriptUtilities.ExecuteCommandLine(@"Converters\exceltohtml\exceltohtml.exe",
args.DocumentFilePath, args.HtmlFilePath, "-text -delimiter , -noimages -rows 100");
}
else
{
args.ScriptUtilities.ExecuteCommandLine(@"Converters\exceltohtml\exceltohtml.exe", args.DocumentFilePath, args.HtmlFilePath, "-text -delimiter , -noimages");
}
}
if(!File.Exists(args.HtmlFilePath))
return false;
return true;
}
}
Extracting Content From a Converted Document
After converting a document to HTML, that document goes into the agent data folder and you can use a Navigate URL command to open the HTML document. The Download Document command that did the conversion will also store the path to the document, and then the Navigate URL command can use that path to get the file URL to the HTML document.
The Navigate URL command uses data from the Download Document command, so the Download Documentcommand must execute first. Also, both the commands must have the same parent command or the Navigate URL command must be a child command of the command that contains the Download Document command.
You can execute the Navigate URL command in the editor to open the converted HTML document, but you must first execute the Download Document command to make sure the HTML document is available.