Fingerprinting & Anonymization Techniques
Some websites may impose request rate limit and/or block access for automated web scraping tools (non-human users) by employing fingerprinting methods that keep track of visiting users' identity. Fingerprinting is the capability of a website to identify or re-identify a visiting user, user agent or device via configuration settings or other observable characteristics. A similar definition is provided by RFC6973.
The intent of this article is to give a brief introduction to some of the most common fingerprinting methods employed by websites, and at the same time, introduce Sequentum Enterprise users to a set of configurable anonymization-specific settings built into the software.
Types of Fingerprinting
Passive fingerprinting - browser fingerprinting based on characteristics observable in the contents of Web requests, without the use of any code executed on the client. Passive fingerprinting would trivially include cookies (often unique identifiers sent in HTTP requests), the set of HTTP request headers and the IP address and other network-level information. The User-Agent string, for example, is an HTTP request header that typically identifies the browser, renderer, version and operating system. For some populations, the User-Agent and IP address will often uniquely identify a particular user's browser.
Active fingerprinting - on top of Passive fingerprinting, this type includes techniques where a site runs JavaScript or other code on the local client to observe additional characteristics about the browser, user, device or other context. Techniques for active fingerprinting might include accessing the window size, enumerating fonts or plug-ins, evaluating performance characteristics, reading from device sensors, accessing client-side storage, and rendering graphical patterns. Key to this distinction is that active fingerprinting takes place in a way that is potentially detectable on the client.
Parser Mode vs. Dynamic Browser Mode Fingerprinting
Parser Mode - If an Agent is configured to load in HTML/JSON/XML Parser, web pages are not rendered in dynamic browser (JavaScript or other code are not executed). Hence, the only applicable type of fingerprinting is Passive fingerprinting.
Dynamic Browser Mode - Web pages are rendered in web browser, so both Passive and Active fingerprinting are applicable.
Bypassing Passive Fingerprinting
The relevant information elements in a Passive fingerprint are the set of HTTP request headers (including User-Agent and Cookies) and the IP address. The following is a list of information elements and the corresponding means to bypass fingerprinting:
IP Address - Use a proxy server to hide your IP address. When you use a proxy server, you do not visit the target website directly, but instead, request that the proxy server visit the website for you. Refer to the article How to Configure Proxy Servers for more details.
User-Agent - You can configure an Agent to use a custom or predefined User-Agent string when extracting data. You can configure this in the Agent's Properties > Agent > User Agent. Sequentum Enterprise has a built-in long list of valid User-Agent strings that an Agent can be configured to rotate on. User-Agent rotation can be configured in Properties > Anonymization.
Cookies - You can configure an Agent or a Link/URL action command to clear site cookies, session cookies, or all cookies before loading a web page. This option can also be configured globally across all Link/URL action commands in the Agent's Properties > Anonymization.
Bypassing Active Fingerprinting
The relevant information elements in an Active fingerprint includes all elements listed in Passive fingerprint above, and other browser or device properties that can be accessed using Active fingerprinting methods. The following list describes browser or device properties/resource that the website might access to identify a visiting user. Alongside are available options in Sequentum Enterprise for additional anonymization.
Storage - You can configure an Agent or a Link/URL action command to clear storage before loading a web page. This option can also be configured globally across all Link/URL action commands in the Agent's Properties > Anonymization.
Web Browser Profile - An Agent can be configured to use random web browser profiles while extracting data. Rotation of web browser profiles can be configured in the Agent's Properties > Anonymization > Rotate Web Browser Profile.
Browser Size - In dynamic browser mode, an agent can be configured to randomize the size of the browser every time a web page is loaded. This can be configured in the Agent's Properties > Anonymization > Randomize Browser Size.
Canvas Reading - Canvas reading in HTML5 is used by some websites to fingerprint a browser. Canvas Reading is disabled by default in Sequentum Enterprise. It is strongly suggested to keep this option disabled. However, there are cases where a web page fails to load when canvas reading is disabled. In this case, Sequentum Enterprise offers an option to generate random (spoofed) canvas string when requested by the code in the web page. You can access these settings in the Agent's Properties > Web Browser > Allow Canvas Reading at Run Time and Properties > Web Browser > Rotate Canvas String. Note that you will also have enable Canvas Reading in Application Settings to set the behavior for design time.