Issue contents
The University Computing Centre based at the University of Zagreb (SRCE) and the National and University Library (NUL) in Zagreb started their co-operation on the 'Design of the System for Harvesting and Archiving Legal Deposit of Croatian Web Publications' project back in 2003 with the main goal of building a system that would be able to gather and archive the selected web publications located within Croatian Web space, as well as preserving the original web sites as much as possible. The main output of this project is the system named DAMP (Digital Archive for Web Publications), which has been developed by SRCE and has been connected to the NUL's Catalogue (CROLIST).It should be emphasised that the specific use of certain web technologies in the design of certain web sites or publications is the cause of many difficulties in the archive process and, in certain cases, of unsuccessful archiving processes. It is therefore vital to constantly improve the system and at the same time educate the publishers in order to avoid innovative but non-standard use of web technologies. In this article, however, we will concentrate on the architecture and functional model of the DAMP system.

DAMP has six main parts:
Basic data about web publications (e.g. title, URL) that are suitable for archiving are collected by DAMP and is transmitted via the automated data exchange with CROLIST. These data are stored in the database. DAMP's administrator (i.e. librarian) then defines the frequency and modality of gathering these websites via a web interface. The DAMP system also allows the administrator to define the depth of gathering, the parts of web tree to be gathered, any exceptions (e.g. parts of the tree that will not be gathered), any data types to be gathered and any other parameters.

The main role of the scheduler is to start gathering certain web sites at particular times. More precisely, the scheduler puts jobs in a queue at a defined time.
Collecting the web pages is the main role of the subsystem called "gatherer". There can be more than one instance of a gatherer active in the system. "Controller" controls the work of the gatherers, giving each one a job from the scheduler queue. When the gatherers finish with gathering, they deliver collected data to the controller, which stores it in the data storage component of DAMP. The administrator can access the archive using a web interface, which allows him/her to examine the results of each gathering. During the data exchange, the DAMP system provides CROLIST with an inventory of archived publications, complete with URLs.
The whole workflow is presented in the Figure 1 and the architecture of the DAMP system is presented in the Figure 2.
DAMP's file system is organised in a hierarchical structure starting with the directories that correspond to the year of archiving. Each of these directories is divided into subdirectories named by publication with a unique identifier (ID). This directory is further split into subdirectories that correspond to the particular gatherings. After successful gathering, a file named "damp.xml" is placed in these "gathering directories". This XML formatted file contains a complete list of all the successfully and unsuccessfully gathered resources. Resources are described by filename, URL, gathering time, data type, file checksum, HTTP status code and Content-type HTTP header. The starting directory of a specific web publication - with complete file structure - is stored at the same level. The data that links the publication with the directories and files in the file system are stored in the database. Organisation of the DAMP's file system is shown in Figure 3 (below).

The gathering process is initialised when the controller sends a job to the gatherer. As noted above, the gatherer puts the job's starting URL in a resource queue so that the process can start. Every URL in the queue is visited and any hyperlinks from the resulting page are extracted. The gatherer normalises each hyperlink and if the hyperlinks pass all tests, and its depth from site entry point is not bigger than defined, it is added to the queue. This process is repeated until all entries from the queue are processed. After that, the gatherer generates the "damp.xml" file and transforms the hyperlinks on each gathered page to build a browsable copy of the publication. All files as well as the information about the job are sent to the controller which stores them in the data storage component (i.e. archive). The gathering process is illustrated with the diagram in the Figure 4 (below).

Gathering parameters for each publication are defined by administrator. When the controller sends a job definition to the gatherer, parameters are also included. System has 9 parameters used to facilitate the gathering process. Here is the list with a short explanation:
Development of the DAMP system is an ongoing task. Further work will include the development of a user interface for public access, with a proper system of authentication and authorisation. We also plan to incorporate the following features in our system:
Special care must be devoted to permanent conservation of archived content and protection of archive's integrity. Due to the constant change in web technologies, it is vital to constantly improve basic components of the system, especially the gatherer, enabling it to cope with new and upcoming web technologies.
The first version of the DAMP system proved to be a useful tool for archiving a limited and well defined web space. In further development we will try to enhance its capabilities, but it will still require the administrator to set up the right gathering parameters. One must accept that in some cases of the use of particular web technologies will cause archiving difficulties, and in certain cases even render the archiving provess unsuccessful. Therefore, it is also important to educate the publishers in order to avoid innovative but non-standard use of web technologies.
Contact: damp@srce.hr
All rights reside with the author(s). WIDWISAWN recognises the author(s) as the copyright holder(s) unless specifically directed otherwise.
WIDWISAWN is delivered via the SAPIENS e-publishing service and is hosted by the Centre for Digital Library Research at the University of Strathclyde
Design and layout © CDLR 2008
Last updated: Jul 2005