NDIIPP PartnershipTechnical Architecture Survey
Collation – 1/2006
The following information has been extracted from the survey information provided at
the July, 2005 Partner Meeting. It is recognized that some information is incomplete or
Functions, Processes and Procedures
Ingest/Acquisition of Content/Content File Validation and Technical Analysis
[CDL] JHOVE (http://hul.harvard.edu/jhove/)
[EMORY] Consists of the LOCKSS OAI-PMH-driven content ingestion and replication
software component (http://www.lockss.org/index.html) as well as a hardware
component that makes use of linux systems administration tools for allocating disc space
among institutional nodes.
[ICPSR] Virtual Data Center (VDC) (http://thedata.org/)
[UCSB] Working to develop a registry of geospatial formats based on the ADL Object
(http://www.alexandria.ucsb.edu/~gjanee/thesaurus/specification.html). See also
Archival Storage/Institutional Repository
[CDL] Storage Resource Broker /CDL Repository
[EMORY] Archival Storage is the OAIS layer that will be investigated least by the
current MetaArchive Partnership. The MetaArchive Network would like to add a bridge
between the content ingestion and replication system with a modular component that will
enforce archival storage (format migration or emulation and another layer of data
integrity checks that is informed by or communicates with the integrity checks contained
within the LOCKSS software).
[ICPSR] VDC, through the repository service component. The Repository stores and
manages digital objects and the administrative metadata (such as the object‟s owner, or
last time of access) associated with them. A repository access protocol allows for
maintenance, hiding the details of their storage (currently a SQL database) from the rest
of the system. The repository itself treats every object as a MIME-typed BLOB. All
knowledge about complex objects (objects that cannot be rendered by a browser without
pre-processing) is encapsulated inside the User Interface Service (UIS) (see Access
[UIUC] Evaluating are DSpace, Fedora, Eprints, Greenstone, and the OCLC Digital
Archive, with DSpace used as the initial, primary repository. They will also be
evaluating the OCLC Web Archivist‟s Workbench (WAW) which is being developed as
another part of the UIUC project.
[ICPSR] A central catalog will be provided at the Harvard-MIT Data Center (HMDC)
with a registry of persistent identifiers, and a union-catalog created through harvesting.
The union catalog will enable searching, will itself be searchable through Z39.50 and
harvestable through OAI-PMH, and will link to the local holdings.
VDC, through the User Interface Service (UIS), the gateway to the other service
components (repository, name resolution and indexing service components). The UIS is
implemented as a set of Java servlets, each of which encapsulates access to a particular
services and objects. Each object or service is itself described in XML, and XSL is used
to render the object.
[NCSU] Any content exposed through the NC OneMap access delivery framework will
be represented via geospatial web services using Open Geospatial Consortium (OGC)
specifications. The current NC OneMap viewer implementation makes both image and
vector data available in a format-independent manner via the OGC Web Map Server
(WMS) specification. The NC OneMap application acts as a cascading map server,
making WMS requests to distributed state, local, and federal data servers and combining
the resulting information into a single map image sent to the user. The WMS
specification is limited in that it does not deliver the underlying data, just a map image
that is the result of the request. The Web Feature Server (WFS) specification will be
explored as a means to stream the actual data content to the end user. NC OneMap‟s web
services also stream content to the National Map.
[UCSB] The Alexandria Digital Library (ADL) middleware server
(http://www.alexandria.ucsb.edu/~gjanee/middleware/) is written in Java and Python and
can be run as a web application inside a servlet container, as an RMI server, or both.
Distributed with the server is the "Bucket99 driver," a configurable component that
allows relational databases to be viewed as collections.
Web Crawling/File Indexing
[CDL] Under development: A Web Crawl service for initiating and monitoring web
crawls and for processing web crawl results. This service will interact with a new CDL
resource manager (Web Crawl Manager), which in turn will interact
with Internet Archive‟s Heritrix (http://crawler.archive.org/) web crawler, utilizing
Heritrix‟s JMX interface.
eXtensible Text Framework (XTF): a flexible indexing and query tool written by CDL
that supports searching across collections of heterogeneous data and present results in a
highly configurable manner. Details at http://www.cdlib.org/inside/projects/xtf/.
Under development: A curation service for defining web crawl collections, scheduling
crawls, packaging web crawl results with archival metadata, submitting crawl results to
the preservation repository, etc.
[ICPSR] VDC, through the Index Server (IS). The IS manages indexing and searching
(queries) of the descriptive metadata associated with each object. Index servers act with a
large amount of independence – they are assigned sets of identifiers that they are
responsible for indexing. In addition, the index servers asynchronously resolve the
identifiers to a repository, retrieve the metadata component of these object, and build
indices based on this metadata.
[CDL] Under development: A rights management service for identifying and recording
rights metadata for crawled content.
Persistent Identifier Manager
[CDL] NOID (Nice Opaque Identifier) Minting and Binding Tool. The NOID utility
written by CDL creates minters (identifier generators) and accepts commands that
operate them. A NOID minter is a lightweight database designed for efficiently
generating, tracking, and binding unique identifiers, which are produced without
replacement in random or sequential order, and with or without a check character that can
be used for detecting transcription errors. CDL utilizes the Archival Resource Key
(ARK) identifier, a naming scheme for persistent access to digital objects (including
images, texts, data sets, and finding aids), currently being tested and implemented by the
California Digital Library (CDL) for collections that it manages. Details on ARK at
http://www.cdlib.org/inside/diglib/ark/ and NOID at
[ICPSR] VDC, through the name resolution system (NRS) manages identifiers for each
digital object. Each distinct intellectual work stored in the system will be supplied with a
persistent identifier using the CNRI handle system (http://www.handle.net) (and
additional identifiers later identified by the partnership) and a format-independent
cryptographic hash or digital signature.
[NCSU] Unique, non-semantic, auto-generated identifiers will be used to track items in
the workflow management database and will be included with item metadata. NOTE:
There is no universal unique identifier scheme for data resources in the geospatial
industry space. Upon ingest, DSpace-provided Handles will also be stored in the
workflow management database. The DSpace Handles are initially seen as redundant.
Separate collection identifiers will connect individual data items with broader collections.
[CDL] Under development: Develop a standard encoding format for representing crawled
content and associated metadata, and utilize this format for moving content between
CDL-CF services, and between CDL‟s repositories and partners‟ repositories. This
format will likely combine METS metadata encoding and the next generation of Internet
Archive‟s web archive file format (ARC). An ARC file is the concatenation of many
datastreams, whereby a separation between datastreams is created by a section that
provides – mainly crawling-related –metadata in a text-only format (see
[EMORY] The MetaArchive Metadata Schema (MACDMS) is derived primarily from
the UKOLN RSLP Collection Description Profile
(http://www.ukoln.ac.uk/metadata/rslp/) but has been augmented with a local at-risk
ranking and with LOCKSS specific metadata tags.
[ICPSR] The basic object managed in the VDC system is the study. Each study comprises
a metadata object and a set of associated data objects. The metadata object follows the
Data Documentation Initiative (DDI) standard
(http://www.icpsr.umich.edu/DDI/index.html), an XML-based standard for social science
metadata, and contains all of the structural metadata for that study, and the descriptive
metadata for the corresponding (abstract) intellectual work. The associated data objects
consist of text files (usually for supplementary documentation), MIME-typed BLOBs
(Binary Large Objects), and/or structured quantitative databases. The metadata object
acts to document the study and to tie the associated data objects together.
[NCSU] Metadata will be stored in the following forms: 1) as FGDC records or ESRI
Profile FGDC records, 2) as Qualified Dublin Core, Library Application Profile (roughly)
within the DSpace database (Oracle), and 3) as METS records (stored with the content as
DSpace bitstreams) containing the FGDC records and additional elements.
The Federal Geographic Data Committee (FGDC) Content Standard for Digital
Geospatial Metadata will apply to most acquired content. Harmonization of the FGDC
standard with the ISO Draft Technical Specification ISO 19139 (Geographic information
– Metadata) is ongoing.
Additional metadata elements which have either been extracted from the FGDC records,
disambiguated from FGDC elements, or have been created in addition to FGDC elements
will be stored in a separate MySQL database along with administrative and repository
ingest workflow elements related to the item. A combination of ESRI FDGC elements
and additional elements from the MySQL database will be mapped to: 1) the DSpace
Simple Archive Format, which uses the Library Application Profile (roughly) of
Qualified Dublin Core, and 2) METS records for submission as DSpace bitstreams along
with the data items. While current plans involve the use of METS, developments related
to GeoDRM and the prospective use of MPEG 21 as a content packaging framework will
be watched closely. Longer term, it is expected that the project would adopt whatever
content packaging scheme that becomes adopted by the geospatial industry.
[PDPT] PBCore (http://www.utah.edu/cpbmetadata/) and METS
[UCSB] ADL metadata
File Format Standards
Not requested in the survey
[CDL] CDL Common Framework (CDL-CF) is an open, services-oriented architecture
written in Java. Consistent exposure of services through SOAP and Java Client API. All
CDL-CF services are exposed as web services, using SOAP with attachments via HTTP.
[Emory] LOCKSS software installation is completely closed except to the nodes housed
at the member institutions. The trusted relationship among servers has had the side effect
of our institutions looking to this system as a test of the Shibboleth open-source platform
for inter-institutional identity and trust relationships (http://shibboleth.internet2.edu/).
[NCSU] The project will explore a future component focusing on the issue of integrating
preserved content into existing geospatial data discovery and access systems, notably the
NC OneMap framework at the state level (http://www.cgia.state.nc.us/nconemap/) and
Geospatial One-Stop (http://www.geo-one-stop.gov/) and the National Map at the
national level (http://nationalmap.gov/).
[EMORY] The hardware is constructed from off-the-shelf components from EMC and
This is a standard storage area network utilizing Intel based hardware and fibrechannel
SATA disk storage.
1 Dell Poweredge 1850 Server (Storage) (Processor: 1 Intel Xeon 800 Mhz front
side bus – Memory: 1 GB DDR 2 – Storage: 2 73 GB 15K SCSI drive (mirrored))
1 Dell/EMC AX100 Storage System (3 TB SATA Storage – 2 active processors)
1 Dell Poweredge 1850 Server (Firewall) (Processor: 1 Intel Xeon 800 Mhz front
side bus – Memory: 2 GB DDR 2 – Storage: 2 73 GB 10K SCSI drive (mirrored))
1 Dell PowerConnect 2616 16 Port GigE Unmanaged Switch
1 Dell Poweredge 4210 Rack Mount System
[ICPSR] The VDC node at HMDC is currently hosted on redundant Redhat Linux
Opteron based servers, using XSAN (http://www.apple.com/xsan/) technology for
redundant storage and DLT tape jukeboxes for backup. Each partner will use additional
hardware and software for ingest, local storage, and dissemination, tailored to the
partners needs and workflow. This currently includes: Solaris and other Unix servers,
SAS and SPSS statistical software for data ingest and manipulation, DLT and DVD
[NCSU] 2 Nexsan ATABeast systems, each with forty-two 400GB ATA drives, 7200
rpm, Fibre to ATA connectivity, 512 MB cache. Capacity will be 16.8 TB per system,
with a total of 12.4 TB of redundant, usable space. One system will be deployed offsite
and will replicate content on the other system. Additional drive upgrades in an existing
ATABeast system provide additional auxiliary space. The disk space will be managed in
1.5-2 TB partitions, with tape backups. The storage environment will be managed by a
Sun Fire V440 server cluster (four 1.593GHz UltraSPARC IIIi units).
[PDPT} WGBH DAM (Digital Asset Management) implementation
(http://daminfo.wgbh.org/): Sun Fire servers and StorEdge server to run Artesia for DAM
and Oracle database server. The software requirement for OAIS conformance is using
Java for framework implementation and PostgreSQL server to host DSpace repository
backend database engine.
[UCSB] UCSB and Stanford are using a variety of systems, both disk and tape: Isilon,
Centerra, etc. Archivas is a common denominator.
[EMORY] Operating system on the Storage Area Network: RedHat Enterprise Linux AS
[ICPSR] Redhat Linux at the VDC
[NCSU] Solaris 9
Preservation Planning and Strategies
[CDL] Normalization of Data on Ingest: Dates may be normalized either on ingest or in
[Emory] Redundant Data Storage: Redundancy is spread out over six different
institutions utilizing the backbone of the Internet2 Abilene network and the local
connections of the Southern Crossroads (SoX) network consortium and the Mid-Atlantic
Crossroads (MAX) network consortium (See Abilene Map
Bit Preservation: MetaArchive Network is only working on bit preservation. This is
accomplished using MD5 checksums along with the LOCKSS polling algorithm that
slowly checks each host node‟s content against each other for faults, additions or
subtractions. This decentralized model does not necessarily preserve bits within the
framework of an individual institution‟s access but provides a cost sharing model for bit
preservation and access within the MetaArchive Network.
[ICPSR] Migration Strategies: The VDC system provides built-in support for format
Normalization of Data on Ingest: Data will be normalized on ingest, which will involve
the conversion of data in proprietary statistical formats to human-readable text+XML
formats, and will also involve the conversion (or creation of preservation copies) of
documentation in proprietary formats to preservation formats, such as XML, plain text,
and PDF/A. Descriptive statistics are calculated to ensure that the data has been
successfully normalized. When a data object in a proprietary format such as SAS or SPSS
is ingested into the system, it is converted to a set of plain-text files and XML files that
completely capture all of the data and metadata embedded in the original file.
Redundant Data Storage: As much of the content as possible will be located on a VDC
system at Harvard that will mirror data at each individual partner‟s sites.
[NCSU] Migration Strategies: The emergence of a widely accepted and understood
Geography Markup Language (GML) application schema (http://opengis.net/gml/) might
trigger the migration of additional vector content at a future point. There is some interest
in developing mechanisms, based on data inventories, to track format market strength in
order to assess decline in support of particular formats, informing migration triggers.
Normalization of Data on Ingest: Existing FGDC metadata will need to be normalized.
Malformed content will need to be restructured using openly available FGDC tools know
as „cns‟ (Chew and Spit) and „mp‟ (Metadata Parser). Metadata will be imported using
the National Park Service Metadata Toolkit on top of ESRI‟s ArcCatalog for
synchronization of elements and subsequent export as ESRI Profile FGDC XML records.
Managing Time-Versioned Content: In one possible approach, serial item representations
of data items would be assigned Handles which serve as serial item identifiers. A
separate database table would be used to manage serial representations, including storage
of volatile layer-related information which would be inappropriate for inclusion in
metadata stored with repository objects. The serial representation would be actively
managed over time, and would provide the basis for Handle-based call-backs from the
time-versioned items--after they have separated from the archive--for “get current
metadata,” “get current object,” and “get current DRM” requests. The role of the Handle
would be to redirect to the current location of the serial object manager, which would
resolve to current context of the individual serial object. This approach is still
[UCSB] Migration Strategies: Along with each archival data object we intend to archive
sufficient format, semantic, and contextual information that enables access mechanisms
for the object to be re-created at any point in the future, and that allows the object to be
used for the scientific purposes for which it was intended when originally created. This
approach is effectively a foundation for other preservation approaches (periodic
Define standard, public data model that outlives both storage systems and archival system
[UIUC] Examining the intersections of different strategies, as embodied by the digital
repository software they are evaluating. This project also tests a new preservation
strategy, called the Arizona Model, a rationale based on archival principles for selecting
and managing web-harvested digital materials as a hierarchy of aggregates rather than as
individual items. The Arizona Model is implemented by the Web Archives Workbench
tools developed by this project.