Skip to main content

Creating Web Cache Using Sequentum Enterprise

Sequentum Enterprise allows you to download and store the entire DOM of the website that you are scraping. The Cache is stored in Lite DB inside the Cache folder in the agent folder and you can easily read and modify your agent using the Cache Settings available under Agent Settings. By using the Cache file, you should be able to add new columns and extract new content that you missed to add while the agent was running. 

You need to set the below properties in the Agent Property to automatically create the Cache Lite DB file every time the SE agent is run:

Content Cache → Write Cache → True

Retention Days: Number of days to keep the Cache files, the default value is 7 days.

contentcache.png

Alternatively, if you are manually running the agent, you can choose to Write the Cache file from the Run Window and can also Read the Cache from the Run Window → Cache Options.

runwindow_cache.png

You can select your Cache file by going to Agent Settings → Cache → Cache File → Cache Settings → Cache Folder Settings window to modify your agent during Design time.

cache_folder.png

 The Cached page will be loaded in SE with a prefix cache://, for example, the Start URL for the https://sequentum.com website will be loaded as cache://https://sequentum.com.

cache_prefix.png
  • The cache feature allows you to rerun your agent against the local cache and you can avoid re-running the entire crawl for a small change against the public website which would be a slow process. Also re-running the agent may result in different data because of the frequent website changes.

  • Change the data type and go back and re-run from the cache to regenerate the data.

  • Change the transformation logic and re-run.

  • Able to add a column and extract new content from the cached pages.

  •  You can also view and manipulate the cached content so that rules to extract new columns can be defined.

Some Important Points on the Cache Files 

It is important to understand that cache files contain HTML only. They do not contain any external files such as images, CSS, JavaScript, etc. SE would not load anything from the Internet if you load an agent from a cache file. Therefore, the web pages that come from a cache file will never look the same as the live website. You would not see images, there will be no styling, the page layout may look broken, some HTML elements may be hidden.

This functionally is only intended for making very basic changes to the agent, for example, add a new field. It is not intended for serious agent development or debugging. The idea is that you just make minor changes to the agent, save it and re-run in Run mode to extract the missing data.

  • When you load the Cache file using the Cache Settings window you will see that in the agent command properties under the Web Browser section, the Load Images property will be set to “Never” automatically. And when you reload the agent, the Load Images property will be set to the old value. The Cache setting is not supposed to save any cache settings into the agent permanently when you select a cache file, it only happens temporarily. Therefore, when reloading the agent, it will load the original agent with its original settings.  It is something similar to using a config file where those settings are applied to the agent but are not saved permanently with the agent.

  • Some websites rely heavily on JavaScript to render the web page properly. As mentioned above, cache files contain HTML only and do not contain JavaScript files. So, SE does not execute JavaScript that makes page elements visible. In such cases, you would need to add new commands manually by entering XPath manually, or you can find and select HTML elements on the Nodes tab.

  • It is also important to understand that sessions are only supported in Run mode. There is no session support in Debug mode. This has always been the case. Therefore, certain session-related things simply won’t work in Debug mode. It is recommended to use the first session cache file. Otherwise, you won’t be able to execute some commands manually because those pages do not exist in the last or any other session cache file.

JavaScript errors detected

Please note, these errors can depend on your browser setup.

If this problem persists, please contact our support.