Sequentum Enterprise allows you to download and store the entire DOM of the website that you are scraping. The Cache is stored in Lite DB inside the Cache folder in the agent folder and you can easily read and modify your agent using the Cache Settings available under Agent Settings. By using the Cache file, you should be able to add new columns and extract new content that you missed to add while the agent was running.
You need to set the below properties in the Agent Property to automatically create the Cache Lite DB file every time the SE agent is run:
Content Cache → Write Cache → True.
Retention Days: Number of days to keep the Cache files, the default value is 7 days.
Alternatively, if you are manually running the agent, you can choose to Write the Cache file from the Run Window and can also Read the Cache from the Run Window → Cache Options.
You can select your Cache file by going to Agent Settings → Cache → Cache File → Cache Settings → Cache Folder Settings window to modify your agent during Design time.
The Cached page will be loaded in SE with a prefix cache://, for example, the Start URL for the https://sequentum.com website will be loaded as cache://https://sequentum.com.
- The cache feature allows you to rerun your agent against the local cache and you can avoid re-running the entire crawl for a small change against the public website which would be a slow process. Also re-running the agent may result in different data because of the frequent website changes.
- Change the data type and go back and re-run from the cache to regenerate the data.
- Change the transformation logic and re-run.
- Able to add a column and extract new content from the cached pages.
- You can also view and manipulate the cached content so that rules to extract new columns can be defined.
Some Important Points on the Cache Files
This functionally is only intended for making very basic changes to the agent, for example, add a new field. It is not intended for serious agent development or debugging. The idea is that you just make minor changes to the agent, save it and re-run in Run mode to extract the missing data.
- When you load the Cache file using the Cache Settings window you will see that in the agent command properties under the Web Browser section, the Load Images property will be set to “Never” automatically. And when you reload the agent, the Load Images property will be set to the old value. The Cache setting is not supposed to save any cache settings into the agent permanently when you select a cache file, it only happens temporarily. Therefore, when reloading the agent, it will load the original agent with its original settings. It is something similar to using a config file where those settings are applied to the agent but are not saved permanently with the agent.
- It is also important to understand that sessions are only supported in Run mode. There is no session support in Debug mode. This has always been the case. Therefore, certain session-related things simply won’t work in Debug mode. It is recommended to use the first session cache file. Otherwise, you won’t be able to execute some commands manually because those pages do not exist in the last or any other session cache file.