Convert Document to HTML Scripts

Convert Document to HTML script is used to convert a downloaded document into a HTML page, so Sequentum Enterprise can extract data from the document the same way as for any other HTML page.

Please see the topic Extracting Data From Non-HTML Documents for more information.

Convert Document to HTML script can be added to a Download Document command by selecting the Convert to HTML configuration tab and setting the option Convert to HTML:

 

The following example is the default Convert Document to HTML script. This script checks the type of the downloaded document, and uses an appropriate document converter to convert the document to a HTML page:

using System;
using System.IO;
using Sequentum.ContentGrabber.Api;
public class Script
{
	//See help for a definition of ConvertDocumentToHtmlArguments.
	public static bool ConvertDocumentToHtml(ConvertDocumentToHtmlArguments args)
	{
		if(args.DocumentType.Equals("pdf", StringComparison.OrdinalIgnoreCase))
		{
			args.ScriptUtilities.ExecuteCommandLine(Path.Combine(args.Agent.DirectoryName, @"..\Shared\Converters\pdftohtml\pdftohtml.exe"),
				args.DocumentFilePath, args.HtmlFilePath, "-noframes -nodrm");
		}
		else if(args.DocumentType.Equals("docx", StringComparison.OrdinalIgnoreCase))
		{
			args.ScriptUtilities.ExecuteCommandLine(Path.Combine(args.Agent.DirectoryName, @"..\Shared\Converters\docxtohtml\docxtohtml.exe"),
				args.DocumentFilePath, args.HtmlFilePath, "");
		}
		else if(args.DocumentType.Equals("xlsx", StringComparison.OrdinalIgnoreCase) || args.DocumentType.Equals("xls", StringComparison.OrdinalIgnoreCase))
		{
			if(args.IsDebug)
			{
				args.ScriptUtilities.ExecuteCommandLine(Path.Combine(args.Agent.DirectoryName, @"..\Shared\Converters\exceltohtml\exceltohtml.exe"),
					args.DocumentFilePath, args.HtmlFilePath, "-noimages -rows 100");
			}
			else
			{
				args.ScriptUtilities.ExecuteCommandLine(Path.Combine(args.Agent.DirectoryName, @"..\Shared\Converters\exceltohtml\exceltohtml.exe"),
					args.DocumentFilePath, args.HtmlFilePath, "-noimages");
			}
		}
		else if(args.DocumentType.Equals("csv", StringComparison.OrdinalIgnoreCase) || args.DocumentType.Equals("txt", StringComparison.OrdinalIgnoreCase))
		{
			if(args.IsDebug)
			{
				args.ScriptUtilities.ExecuteCommandLine(Path.Combine(args.Agent.DirectoryName, @"..\Shared\Converters\exceltohtml\exceltohtml.exe"),
					args.DocumentFilePath, args.HtmlFilePath, "-text -delimiter , -noimages -rows 100");
			}
			else
			{
				args.ScriptUtilities.ExecuteCommandLine(Path.Combine(args.Agent.DirectoryName, @"..\Shared\Converters\exceltohtml\exceltohtml.exe"),
					args.DocumentFilePath, args.HtmlFilePath, "-text -delimiter , -noimages");
			}
		}
		if(!File.Exists(args.HtmlFilePath))
			return false;
		return true;
	}
}
import clr
import os
from Sequentum.ContentGrabber.Api import *

def ConvertDocumentToHtml(args):
	if args.DocumentType.lower() == 'pdf':
		args.ScriptUtilities.ExecuteCommandLine(os.path.join(args.Agent.DirectoryName, '..\\Shared\\Converters\\pdftohtml\\pdftohtml.exe'), args.DocumentFilePath, args.HtmlFilePath, '-noframes -nodrm')
	elif args.DocumentType.lower() == 'docx':
		args.ScriptUtilities.ExecuteCommandLine(os.path.join(args.Agent.DirectoryName, '..\\Shared\\Converters\\docxtohtml\\docxtohtml.exe'), args.DocumentFilePath, args.HtmlFilePath, '')
	elif args.DocumentType.lower() == 'xlsx' or args.DocumentType.lower() == 'xls':
		if args.IsDebug:
			args.ScriptUtilities.ExecuteCommandLine(os.path.join(args.Agent.DirectoryName, '..\\Shared\\Converters\\exceltohtml\\exceltohtml.exe'), args.DocumentFilePath, args.HtmlFilePath, '-noimages -rows 100')
		else:
			args.ScriptUtilities.ExecuteCommandLine(os.path.join(args.Agent.DirectoryName, '..\\Shared\\Converters\\exceltohtml\\exceltohtml.exe'), args.DocumentFilePath, args.HtmlFilePath, '-noimages')
	elif args.DocumentType.lower() == 'csv' or args.DocumentType.lower() == 'txt':
		if args.IsDebug:
			args.ScriptUtilities.ExecuteCommandLine(os.path.join(args.Agent.DirectoryName, '..\\Shared\\Converters\\exceltohtml\\exceltohtml.exe'), args.DocumentFilePath, args.HtmlFilePath, '-text -delimiter , -noimages -rows 100')
		else:
			args.ScriptUtilities.ExecuteCommandLine(os.path.join(args.Agent.DirectoryName, '..\\Shared\Converters\\exceltohtml\\exceltohtml.exe'), args.DocumentFilePath, args.HtmlFilePath, '-text -delimiter , -noimages')
	if not os.path.exists(args.HtmlFilePath):
		return False
	return True

The function should return True if the conversion succeeds or False if the conversion failed.

Please note that Sequentum Enterprise does not have any built-in tools to convert non-HTML documents to HTML and you must use third party tools. You may find some of them here.

An instance of the ConvertDocumentToHtmlArguments class is provided by Sequentum Enterprise and has the following functions and properties:

Property or Function Description

string DocumentFilePath

The path of the document that needs to be converted to HTML.

string DocumentType

The type of document that needs to be converted to HTML. For example, if the document is a PDF document, the document type will be pdf.

string HtmlFilePath

The file path the script should use for the converted HTML file.

Agent Agent

The current agent.

ScriptUtils ScriptUtilities

A script utility class with helper methods. See Script Utilities for more information.

Command Command

The current agent command being executed.

IContainer ParentContainer

The parent container command of the current command.

IConnection DatabaseConnection

The current internal database connection used by the agent. This connection is already open and should not be closed by your script.

IHtmlNode HtmlNode

The extracted HTML node.

IInternalDataRow DataRow

The current internal data row containing the data that has been extracted so far in the current container command.

bool IsDebug

True if the agent is running in debug mode.

bool IsSchemaOnly

If true, only the data schema is required, so you can optimize processing by only returning the data schema with no data.

IInputData InputDataCache

All input data available to the current command.

void WriteDebug(string debugMessage, DebugMessageType messageType = DebugMessageType.Information)

Writes log information to the agent log. This method has no effect if agent logging is disabled, or if called during design time.

void WriteDebug(string debugMessage, bool showMessageInDesignMode, DebugMessageType messageType = DebugMessageType.Information)

Writes log information to the agent log. This method has no effect if agent logging is disabled, or if called during design time.

void Notify(bool alwaysNotify)

Triggers notification at the end of an agent run. If alwaysNotify is set to false, this method only triggers a notification if the agent has been configured to send notifications on critical errors.

void Notify(string message, bool alwaysNotify)

Triggers notification at the end of an agent run, and adds the message to the notification email. If alwaysNotify is set to false, this method only triggers a notification if the agent has been configured to send notifications on critical errors.

GlobalDataDictionary GlobalData

Global data dictionary that can be used to store data that needs to be available in all scripts and after agent restarts.

Input Parameters are also stored in this dictionary.

IConnection GetDatabaseConnection(string connectionName)

Returns the specified database connection. The database connection must have been previously defined for the agent or be a shared connection for all agents on the computer. Your script is responsible for opening and closing the connection by calling the OpenDatabase and CloseDatabase methods.

IInputDataRow GetInputData()

If the current command is a data provider, the data for that command is returned. Otherwise, this function searches the command's parents and returns the first found input data.

IInputDataRow GetInputData(Command command)

If the specified command is a data provider, the data for that command is returned. Otherwise, this function searches the command's parents and returns the first found input data.

IInputDataRow GetInputData(string commandName)

If the specified command is a data provider, the data for that command is returned. Otherwise, this function searches the command's parents and returns the first found input data.

IInputDataRow GetInputData(Guid commandId)

If the specified command is a data provider, the data for that command is returned. Otherwise, the function throws an error.

Was this article helpful?
0 out of 0 found this helpful

Comments

0 comments

Please sign in to leave a comment.