CAPTCHA Blocking
A website can implement CAPTCHA blocking by using a web form that the user must submit to gain access to any restricted areas of the site. The web form is usually quite simple, consisting of an image element and a text box element. The image displays some characters which the user must enter into the text box in the exact sequence as given in the image. A human user can read the text in the CAPTCHA image, but a web scraping agent requires special character recognition software to successfully discern the characters in the image.
Sequentum Enterprise performs both manual and automatic data extraction from websites that implement CAPTCHA blocking. Automatic data extraction requires an account with a third-party CAPTCHA recognition service and, typically, there is a small fee for processing each CAPTCHA image. Manual data extraction is free but requires you to manually decode CAPTCHA images while running a data extraction agent.
Manual CAPTCHA Configuration
You have two options when configuring manual CAPTCHA. The easiest one to configure is the option we describe below. The other option uses the same approach as automatic CAPTCHA configuration, but instead of using a script to resolve the CAPTCHA, a window is displayed allowing the user to manually resolve the CAPTCHA.
Manual CAPTCHA processing is easy to configure, but requires you to manually decode CAPTCHA images while the agent is running. The agent will pause and display the browser window where a user can view the CAPTCHA image and enter the CAPTCHA text in the text box. You can configure the agent to automatically submit the web form after the user has entered the CAPTCHA text, or you can manually submit the form while the agent is paused.
If CAPTCHA blocking is part of a larger registration form, you can process the CAPTCHA part manually and let the agent process the rest of the form automatically. In this case you should let the agent submit the form automatically rather than relying on the user to submit the form while the agent is paused.
Follow these steps to pause an agent when a CAPTCHA image is displayed:
Add an Execute Script command to your agent.
Select the CAPTCHA image element in the web browser. This sets the command's web selection.
Select the default script type Pause Agent.
Select the default script condition If Selection Exists.
Save the command.
This command will pause the agent when the CAPTCHA image element exists on the web page, and allow a user to enter the CAPTCHA text.
You must add this command to all locations in the agent where CAPTCHA blocking could be encountered.
Important: Manual CAPTCHA processing relies on a human user to decode the CAPTCHA image, so an agent using manual CAPTCHA configuration cannot be run from the scheduler or API, or any other fully automated way.
Automatic CAPTCHA Configuration
Automatic CAPTCHA processing requires an account with a third party CAPTCHA recognition service. The third party recognition service must provide a .NET API and you must add an OCR script that uses this API to call the service. See the section below for two examples of CAPTCHA recognition services.
Follow these steps to configure an agent for automatic CAPTCHA processing:
Add a new Group Commands command. This group of commands will handle CAPTCHA.
Add an Execute Script command to the group. This command will skip CAPTCHA processing if the CAPTCHA image doesn't exist in the web page.
Select the CAPTCHA image in the web browser. This sets the command's web selection.
Select the default script type Exit Command.
Select the default command Parent Command.
Select the default condition If Selection Missing.
Add a Download Image command to the group. This command will download the CAPTCHA image and use an OCR script to decode the image into plain text.
Select the CAPTCHA image in the web browser. This sets the command's web selection.
Open the OCR tab and check the option Convert image to text.
Add an OCR script. See below for script examples. If you want to manually resolve the CAPTCHA, you can check the option Convert image manually in which case you should not specify a script.
Add a Set Form Field command to the group. This command will use the converted image text to set the CAPTCHA text box.
Select the CAPTCHA form field on the web page. This sets the command's web selection.
Clear the command option Use default input.
Set the data provider to Captured Data and select the CAPTCHA image command from step 3.
Add a Navigate Link command to the group. This command will submit the CAPTCHA form.
Select the CAPTCHA form submit button. This sets the command's web selection.
Add an Execute Script command to the group. This command will retry CAPTCHA processing if the CAPTCHA image still exists on the web page. If the CAPTCHA image still exists we assume it's because the CAPTCHA recognition service decoded the CAPTCHA image incorrectly and we'll try again.
Select the CAPTCHA image in the web browser. This sets the command's web selection.
Select the default script type Retry Command.
Select the default command Parent Command.
Select the default condition If Selection Exists.
Sequentum has integrated support for the following CAPTCHA recognition services. We are not affiliated with these companies in any way and don't charge any additional fees for these services.
CAPTCHA OCR Scripts
If you use a CAPTCHA recognition service that Sequentum does not have integrated support for, you need to use a Custom Script. In order to configure it, select 'Custom Script' from the Service Provider list, and write your OCR Service script.
An example of such script should be provided to you by your CAPTCHA recognition service.
Resolve CAPTCHA composite command
To simplify the process, consider using a Resolve CAPTCHA composite command to add a number of sub-commands that can be used to process standard CAPTCHA images.
Troubleshooting
If CAPTCHA fails when entering a correct CAPTCHA it may be caused by the following issue.
Sequentum cannot directly get the CAPTCHA image from the web browser, so it downloads the image a second time, and that may be disallowed by the web server, especially if you are using a proxy rotation service where a new IP address may be assigned for the image download. To overcome this problem, you can use a Download Screenshot command instead of a Download Image command, in which a second image download is not required.
Reporting CAPTCHA failure to service provider
Sometimes we may get wrong CAPTCHA from the service provider, for example, if your CAPTCHA image is having letters in uppercase and if you get the result in lowercase letters after resolving it from the service, this may be an issue from the service provider side. In this case the 'Case Sensitive' feature may not have been enabled for your service account by the service provider. You should contact your service provider for such type of issues.