Delivering Digital Repositories with Open Solutions Whitepaper

Reviews
Shared by: C Gunnison
Stats
views:
145
rating:
not rated
reviews:
0
posted:
12/29/2007
language:
pages:
0
DELIVERING DIGITAL REPOSITORIES WITH OPEN SOLUTIONS by Carl Grant, President & Co-Founder of CARE Affiliates, Inc. White Paper, Version 8.0 November 2007 Abstract Libraries and information organizations are in need of solutions to manage and preserve the diverse and rapidly increasing digital content that is being placed in their care. Through the use of new open solutions including application software, Sun hardware, operating systems, storage solutions and consulting services, these organizations are now able to rapidly deploy a solution that meets the challenges. Sun Microsystems, Inc. Table of Contents Introduction................................................................................................... 3 Open Solution Architectures ............................................................................ 5 Architecture Overview .................................................................................... 7 Architecture — Application Software................................................................ 8 Architecture — Hardware — Server / OS .......................................................... 9 Architecture — Storage................................................................................. 10 Hardware — Storage — Implementation Restrictions ...................................... 13 Continuing Digital Repository Preservation/Archival Challenges ....................... 14 Conclusion ................................................................................................... 16 Glossary ...................................................................................................... 18 Suggested Readings and Websites ................................................................. 20 3 Introduction Sun Microsystems, Inc. DIGITAL REPOSITORY GROWTH One measure of repository growth is through OAIster; a union catalog of digital resources. Using OAI-PMH (the Open Archives Initiative Protocol for Metadata Harvesting) OAIster harvests metadata records from repositories. These charts show their growth: Chapter 1 Introduction Libraries today are under pressure to handle a vast and rapidly growing amount of digital data. According to the one estimate “humankind creates 5 exabytes of stored data — in paper, the equivalent of 500,000 new Libraries of Congress each year. More than 90 percent of those 5 exabytes were stored on a hard disk.”1 Libraries are increasingly turning to digital repositories to handle the storage of that portion of this digital data that is deemed worthy of future scholarly research and therefore preservation. (See sidebar for growth charts of repositories). Digital or institutional repositories are defined as “a managed collection of digital objects, institutional in scope, with consistent data and metadata structures, enabling resource discovery by the ‘Communities of Practice’ for whom the objects are of interest.”2 Repositories are critical to the development and management of organization-wide digital content, and bringing greater value to organizational intellectual treasury. Generally speaking, repositories are intended to be broad in scope and to cover a broad range of multi-media digital objects, examples include: research papers, dissertations, data sets; scientific visualization of data sets; PowerPoint/OpenOffice files; web-based e-learning content, such as past course content, blogs; video; photo and animations — literally anything an organization creates and wants to save for on-going, active collaborative use or long-term preservation, including national heritage archiving projects. There are a number of challenges to be addressed in providing a large-scale repository solution, including: 1. Objects must be always retrievable (Access). 2. Cost and complexity as systems scale (Economic sustainability). 3. Finding data through sophisticated metadata handling (Discovery). 4. Data integrity must be assured (Trust). 5. Seamless scaling must be provided (Scale & extensibility). As libraries implement repository technology, they have to face a variety of decisions about the direction and services to be offered to their users. This includes making decisions about the types of digital objects to be accepted, the commitment to preserve, migrate and maintain those objects and the types of transformations, metadata and searching that will be supported in order to allow discovery and delivery of objects to end users. One particularly challenging and complex issue for all libraries, especially those that move quickly to create digital repositories, are the issues surrounding preservation and archiving. These issues are so complex that they are frequently not addressed in the initial planning. If not considered, they are potential land mines for a project. Thus they remain as large opportunities. At the most basic level, libraries have to be sure to address these four primary areas of preservation/archiving (Wheatley, 2004): 1 The Search by John Batelle. Portfolio (2005), p. 276. 2 Geoff Payne, Manager of the ARROW (Australian Research Repositories Online to the World) Project # of searches: # of repositories: # of countries: www.oaister.org/stats.html 4 Introduction Sun Microsystems, Inc. 1. Data can be maintained without being lost, damaged or altered, 2. Data can be found and extracted for or by a user 3. Data can be interpreted and understood by the end user. 4. That objectives 1-3 can be achieved in perpetuity. Beyond these, as a project moves towards actualization, there are some identified, specific categorical threats to preservation, such as technological obsolescence, media decay and software obsolescence which must be addressed. Specifically, these include, as identified by Rosenthal et al3 : 1. Media failure. 2. Hardware failure. 3. Software failure. 4. Operator errors. Solutions for these issues are beginning to appear and/or be developed. In specific, hardware manufacturers like Sun Microsystems have moved to start addressing items 1 & 2 above through new components like the Sun StorageTek™ 5800 system which will be looked at in this paper. To implement a digital repository and, while doing so, lay a solid groundwork for a preservation/archival solution, one can select components from an ever growing list of available options. Digital repository solutions are clearly a combination of hardware, proprietary software or open source software and open standards. The architecture used to assemble these components deserves examination as open solutions are available that address and provide the best technology available today to deal with each of the areas identified above. 3 Requirements for Digital Preservation Systems by David S. H. Rosenthal, Thomas Robertson, Tom Lipkis, Vicky Reich and Seth Morabito. D-Lib Magazine 5 Suggested Readings and Websites Sun Microsystems, Inc. Open Computing Great ideas come from everywhere, and great ideas should be shared. Those simple concepts are the cornerstones of Sun’s open computing philosophy. Open Computing is typically defined as an environment where it is possible, through open standards that are defined, publicly available and maintained by consensus and which describe interfaces, to mix and match components to build a system as a system of interacting components (software, hardware and human) with interface specifications that are fully defined, publicly available, and maintained according to industry consensus. Sun’s decision to open source its flagship Solaris™ Operating System and the Sun StorageTek™ 5800 system API, Sun’s creation of the Preservation and Archiving Special Interest Group (Sun PASIG) are all examples of Sun’s commitment to open collaboration. Chapter 2 Open Solution Architectures In planning the architecture to be utilized for the digital repository solution, especially one with strong attention to the facet of preservation, there is a growing trend towards “open source software”. While this is a term that can quickly become overworked and even stretched in terms of what it encompasses, there are solid reasons for the trend including: • The growth of open source into the enterprise and corporate software environments. This includes: Wordpress (blogging/website software which as of Sept 7, 2007 has been downloaded nearly 900K times), OpenOffice from Sun (an open source version of MS Office that is currently estimated to have around 50M users) as well as Sun’s OpenSolaris™ OS, Firefox, (a web browser like MS Internet Explorer, as of Sept 7th has around 400M users). In addition there are products like Apache, (a webserver), MySQL (a relational database) and Thunderbird (an email package like MS Outlook). • Vendor Consolidation. Merger and acquisitions are hardly new to the technology market, but the effect on customers is nearly always the same; concern and uncertainty about the future of the applications they’re reliant upon. Inevitably, some customers face being stranded on a product where they don’t have source code nor any other support options, so they’re being forced to migrate to a new product. This has caused many to see the open source software solutions as a comfortable way to maintain long-term control over their future. • Easier to procure. Open source offers a streamlined procurement process. Open source is freely available so many organizations, after doing some basic homework, simply download the application, install it and test it against their needs. Unlike the proprietary world where this option may not exist at all or will consist of a “limited time or functionality” version, it is possible to download the open source product and run it for as long as needed. This can be done with multiple products. Once selected and if commercial support is needed, then an RFP can be issued, but obviously it is a much more focused procurement tool. Ultimately, open source solutions provide a much more thorough examination of the product at a much lower cost and ultimately with much better results. • Greater reliability. One of the real advantages of open source is that the product has a much greater level of peer review for not only the specifications for new features, but also for the code that is written to implement those features. The result is quite simply, a more robust and reliable product. 6 Suggested Readings and Websites Sun Microsystems, Inc. • No vendor lock-in. When customers adopt open source, they have access to the source code and this means that the vendor can’t lock them in to their customer base. If their vendor were to be bought, sold or consolidated for whatever reason, they can then move to a new vendor who will continue to enhance and maintain the product. So the customer can determine when they’ll upgrade or migrate and what features they use or don’t use. • Support options. This is what many customers were waiting for before they could move to open source. Because in the early days of open source, customers had to have programming resources to use open source and many customers that didn’t meet this requirement ended up leaving the open source option aside. However, this is no longer the case. There are now commercial entities that have been created to provide commercial support options just like those obtained with proprietary software, and provide it at a reasonable cost. These companies will handle data conversion, installation, training, support, maintenance, ongoing development, customization, and all the other services customers have come to expect with proprietary software. Of course, the other big advantage here is that a customer has options. If they don’t like the support or cost of their present vendor they can switch to another. This introduces an inherent competition to provide better service as it becomes key to the vendor retaining customers. • Development options (user focused, group focused, wide participation). Customers have generally grown frustrated with getting needed developments from proprietary vendors. It’s slow, it’s costly and many times what gets delivered is not what was needed. In addition, many developments get caught behind new contract developments or ROI analyses that make the company move slowly in delivering the needed feature. The bottom line is a massive amount of frustration for customers. Open source models address these problems. It gives customers the options of hiring their own programming resources or those of another company. • More efficient use of financial resources. Moving to open source clearly doesn’t mean everything is free. One huge cost (licensing) is removed and customers have potentially created a competitive market surrounding the other costs. • Community. Products and ideas developed in open environments benefit tremendously from the community aspects of the open model. Ideas, best practices, use cases and source code are all openly shared with those interested in participating in the community. This results in benefits for all members of the community and generally means that the ideas develop faster. Furthermore, because of the wider participation it results in better formed solutions. Sun’s Preservation and Archiving Special Interest Group ( www.sun-pasig.org ) is just one example of this kind of community. While open source solutions are one path available they are not the only one available. In some cases, solutions open via access to open API’s and/or compliance with open standards can prove satisfactory to meet the needs of organizations. 7 Architecture Overview Sun Microsystems, Inc. “Sun has had a longstanding commitment to both open computing standards and community development. We created the Sun PASIG to give institutions involved in repositories and archives a forum to globally share best practices, experiences, and project information. Sun will contribute its expertise around key data management and storage technologies and collaborate with leaders in this area for the benefit of the broader archiving community. – Art Pasquinelli, Sun Education Market Strategist Chapter 3 Architecture Overview The architecture for today’s repositories, while sophisticated and ever changing, is increasingly a plug-compatible set of choices utilizing open standards, Service Oriented Architecture and API’s to allow the rapid development and deployment of solutions that are highly customized to the needs of the organization. For example, at a high level, a total solution sample framework using, in this example, a Fedorabased repository and Sun foundation technologies, would include: 1. Web based client module (FEZ, ELATED, VITAL). a. HTTP-based protocol 2. Fedora a. Interface Layer i. API-A(ccess) supporting both REST and SOAP ii. API-M(anagement) supporting both REST and SOAP b. Application Logic Layer (Implements requests in terms of the object model). 3. Solaris™ or OpenSolaris™ (OS) 4. Sun server(s) such as Sun UltaSPARC® and x64/x86 servers 5. IP network 6. Storage Layer(s) a. Datacenter, Midrange or Workgroup disk b. Enterprise Archive (StorageTek 5800 system). While this is one example, similar examples can be configured using other repository packages, including DigiTool, DPS, VITAL, Dspace, or EPrints. 8 Architecture - Application Software Sun Microsystems, Inc. “Oxford has chosen Sun Honeycomb systems [also known as Sun StorageTek 5800 system] to form the storage component of the Libraries’ Digital Asset Management System which will underpin all future activities involving digital object repositories. The Honeycomb architecture and roadmaps make it particularly suitable not just for the large scale storage of digital objects but also, more crucially, for the longer term preservation of those objects. In particular, the use of storage beans to place distributed processing capacity close to storage allows verification and transformation activities to be carried out in a very scalable manner, avoiding the bandwidth limitations which would occur if these tasks involved a conventional round-trip from storage to processing resources. Additionally, the aim to make the Honeycomb API open and implementable over a variety of storage platforms should enable increased longevity of systems in terms of both technology change and capacity scaling.” – Neil Jefferies, R&D Project Manager, Systems & eResearch Service, Oxford University Libraries Chapter 4 Architecture — Application Software Application software for digital repository solutions are now widely available and when matched carefully to the needs of the institution, provides effective solutions. The critical word in the preceding sentence is the word “needs” and here is a place where all organizations planning repository services can provide themselves a significant advantage by taking the time to understand their needs and plan how to meet the those needs. It is particularly important to think not only about today’s needs, but well into the future when thinking about issues such as growth, scaling, other environments which will want to interface with the repository not only at the client interface level, but also at the content level. Planning has a huge payback! Some of the more commonly adopted solutions include: DSpace (www.dspace.org), Fedora (www.Fedora.info), EPrints (www.eprints.org), DigiTool (www.exlibrisgroup. com/digitool.htm), DPS (www.exlibrisgroup.com/Preservation.htm) and VITAL (www.vtls.com/Products/vital.shtml) Several of these solutions (DSpace, Fedora, and EPrints) consist totally of open source software, representing the growing trend in repositories towards open solutions. Those that are not completely open source, are either based on an open source repository engine coupled with a proprietary application software layer (VITAL) or they offer openly accessible API’s using XML interfaces (DigiTool and DPS). Functional comparisons of many of these products can be found on the Web to aid in identifying a solution appropriate to the needs, or alternatively consulting services are available through Sun to assist in the selection. Once the application(s) are identified that the organization wishes to work with, an IT manager and repository manager can quickly begin assembling a solution by choosing and downloading on their Sun system, a total open source solution like DSpace, Eprints, or alternatively FEZ as the client (use) module and Fedora as the repository engine. These packages come with installation programs/directions that make it possible to quickly get these applications installed and running. 9 Architecture — Hardware — Server / OS Sun Microsystems, Inc. Jonathan Schwartz, CEO & President, Sun — on why Sun is supporting open source and open computing: “No amount of fear can stop the rise of free media, or free software (they are the same, after all). The community is vastly more innovative and powerful than a single company. And you will never turn back the clock on elementary school students and developing economies and aid agencies and fledgling universities — or the Fortune 500 — that have found value in the wisdom of the open source community. Open standards and open source software are literally changing the face of the planet — creating opportunity wherever the network can reach.” Chapter 5 Architecture — Hardware — Server / OS While not the subject of this paper, clearly there are a wide ranges of solutions available in servers and operating systems. Sun can, of course deliver a complete end-to-end solution in this area, do so with great cost efficiency and in particular, in the area of large scale repositories, a very scalable solution. Sun’s four tier logical architecture in systems encompasses the server and operating system and supports these logical tiers: 1. Client tier. The client tier consists of application logic accessed directly by an end user through a user interface. The logic in the client tier could include browserbased clients, Java components running on a desktop computer, or Java™ 2 Platform, Micro Edition (J2ME™ platform) mobile clients running on a handheld device. 2. Presentation tier. The presentation tier consists of application logic that prepares data for delivery to the client tier and processes requests from the client tier for delivery to back-end business logic. In the case of repositories, this could be the layer that transforms a digital object to another format before delivering to the end user. 3. Business service tier. The business service tier consists of the repository engine that performs the main functions of the application: processing data, data models (if used), object relationships, version control, web services, etc. 4. Data tier. The data tier consists of services that provide persistent data used by business logic. The data can be application data stored in a database management system or it can be resource and directory information stored in a data store. The data services can also include data feeds from external sources. 10 Architecture — Storage Sun Microsystems, Inc. What is object storage? An Object-based Storage Device (OSD) enables the creation of self-managed, shared and secure storage for storage networks. This moves lower-level functionalities such as space management into the storage device itself, where the device is accessed through a standard object interface. – Wikipedia Chapter 6 Architecture — Storage Given the storage needs created by the explosion of digital data coupled with the demands of preservation/archiving, the challenging question in the solution architecture was that of an appropriate fixed content data storage system that could be easily integrated with the application software, operating system, server and high transaction disk storage. The announcement of solutions like the StorageTek 5800 system is an answer to that challenge — a new 3rd generation, highly interoperable, fixed content storage system that offers scalability, security, reliability, data integrity and an embedded metadata system built on open standards. The key differentiating factors of the StorageTek 5800 system are the extensive metadata facilities that describe the object being stored; the architecture support for the ability to process locally the format of the object being retrieved or stored — all at a very reasonable price point. This new system fills the critical component in the overall hardware architecture solution by meeting the archival/preservation needs of a large-scale repository including the needs of scaling, rapid retrieval, backup, rapid recovery and long-term archival integrity. “The StorageTek 5800 system is the industry’s first commercially available fixed content storage system using software that will be open sourced. It offers higher data integrity, resilience and failure tolerance than competitors storage system designs. The arbitrary metadata indexing and search capabilities makes the StorageTek 5800 system a very powerful solution for digital repositories, such as the ones that archive the worlds most precious artifacts. For those wishing to save substantial storage management costs while being assured of high data integrity, the answer is the open StorageTek 5800 system” – Graham Lovell, Sun, Senior Director Storage Servers, Systems Group The StorageTek 5800 system provides a flexible infrastructure for organizations focused on better use of huge repositories of unstructured data, and allows applications to be more efficiently deployed by leveraging compute and memory resources inside the storage environment itself. The StorageTek 5800 system’s design is a symmetricallyclustered system that incorporates clustered servers for both processing and storage functions, and allows custom data services and metadata (data about the data) management (including query) to be deployed directly within the storage environment. 11 Architecture — Storage Sun Microsystems, Inc. What is an OID and why is it important? OID stands for Object ID. It is a unique identifier for each stored object included in the system metadata. The OID is returned by the API when an object is stored and used to retrieve the object. It is also returned when queries are made against user metadata that has been associated with the OID. In repositories, the OID’s can be maintained in the digital repositories metadata such as Fedora, SRB, VITAL, etc. It can also be maintained in an application database such as Oracle, if that application was modified to write to the StorageTek 5800 system API. Sophisticated mechanisms are available for scaling, optimizing, organizing and finding data among hundreds of millions of files. Data integrity and failure tolerance are greatly improved over other storage system designs through the use of RAID 6, self-healing and continued data validation. Data persistence is handled through Reed-Solomon RAID 6 and self-healing. This means you must loose 3 drives in order to loose data and you must loose those drives within a 12-hour window. As configured, any one disk failure is handled by the 60 other disks. Bottom line: the recovery time is shorter and less resources are needed. Finally, through the use of data scrubbing/checksumming you are assured the data stored is the data written. Including the StorageTek 5800 system in the architecture of the repository solution is incredibly smooth and requires, in most instances, that no new code be written. Better yet, the product doesn’t have to be bought in to test it out with the rest of the architectural components. Sun provides a free, downloadable SDK, (sun.com/ download/products.xml?id=465eed06 ) which provides complete emulation capabilities of the StorageTek 5800 system. It can be tested using the API’s included in the SDK. When ready to deploy the real thing, the only change needed is to change the IP address to that of actual device. Let’s look at some specific examples where it is possible today to deploy the StorageTek 5800 system: • DSpace’s current architecture supports the StorageTek 5800 system as either a file storage system layer, or the best solution is to use Storage Resource Broker (SRB) as the interface, so that it is possible to allow heterogeneous storage, sharing and replication across multiple organizations. • Fedora-based repositories (VITAL, FEZ, Elated). With these solution components, the Fedora foundation makes it extremely easy to utilize the StorageTek 5800 system. While objects stored on the StorageTek 5800 system are assigned an OID, Fedora can continue to call objects using the object PID assigned by the Fedora system. The OID is stored in the extended metadata and the translation to the OID is done automatically via a metadata query to the StorageTek 5800 system, meaning the OID is not directly used by Fedora resulting in no special code being required. Using the open source API the repository software can support query capabilities utilizing either the system metadata or the extended user defined metadata schema. If there are standardized queries it is possible to define those to optimize retrieval. Since Fedora allows the low-level storage to be replaced with the StorageTek 5800 system , it means that any of the client modules (VITAL, FEZ, or ELATED) can be utilized as Fedora handles the entire interface with the StorageTek 5800 system and it becomes transparent to the client module. In addition, with the introduction of Fedora Version 2.2, it will be possible for the Dublin Core metadata of a digital content object to be pushed into the StorageTek 5800 system. This will take advantage of a key feature in the StorageTek 5800 system by offloading the metadata handling and computing resources to the internal metadata management system of the StorageTek 5800 system. “The integration of StorageTek 5800 system with Fedora is a big step forward for information durability. Sun Honeycomb (aka Sun StorageTek 5800 system) brings a robust, self-healing storage layer to the Fedora repository system. Sun’s open-source orientation is particularly complementary to the Fedora Commons mission.” – Daniel Davis Chief Software Architect Fedora Commons 12 Chapter Title Sun Microsystems, Inc. “We are using Honeycomb [aka StorageTek 5800 system] as an integral part of the Stanford Digital Repository because we needed a cost efficient, fault-tolerant, scalable tier of online storage with preservation intelligence. Honeycomb’s design allowed it to fit neatly into SDR’s component architecture; its underlying hardware platform met our performance, reliability and cost thresholds, and its software and API’s allowed for straightforward integration with other parts of our overall system. After comparing Honeycomb to other systems on the marketplace and an estimate on what it would take to develop equivalent functionality on our own, we determined that Sun’s product had the best fit for our environment, and the most favorable costbenefit ratio for both initial acquisition and long-term maintenance.” – Michael Keller University Librarian Stanford University Other repository products or implementations of Fedora may elect to modify their software, should they choose to bypass querying the StorageTek metadata. In those cases the OID, when assigned, may be handed back to the application software for storage in the external metadata of the repository application. For those using or interfacing applications that do not want to write to the API, or maintain the Object ID (OID) of the data objects, there are gateway digital repository applications such as StorageSwitch and SRB that provide that service. So, even traditional archiving applications now have the ability to use content addressable storage devices without commitment to a full port to the API. Standard interfaces like CIFS and NFS, are used to the gateway and the gateway can write to the StorageTek 5800 system API. In addition, because the StorageTek 5800 system architecture supports the ability to run data services inside the storage box against the data as it is stored or retrieved, or in a bulk operation. When released this would mean the capability to store other types of XML metadata within the StorageTek 5800 system. With that capability will come additional capabilities for storage, access, delete, modification and transformation. Finally, for large scale repositories that are planning massive ingestion of objects either at startup, or on an ongoing basis, it is important to note that StorageTek 5800 system can handle ingestion tasks direct for reasons of speed and/or efficiency. In those instances, objects can be ingested directly into the StorageTek 5800 system. Because the StorageTek 5800 system is an open solution, it is important to note that all new code being written to support the implementations above is being contributed back to the open source code base. This will ensure that when new customers wish to install the StorageTek 5800 system with any of the three open source solution repository packages (Fedora, Dspace, or EPrints) the solution will exist out of the box. 13 Hardware — Storage — Implementation Restrictions Sun Microsystems, Inc. Chapter 7 Hardware — Storage — Implementation Restrictions As with any major new technology, it is good to ask and know where the implementation restrictions might exist. StorageTek 5800 system has been in field test sites since late 2006 so it is possible to use actual field-based experience to formulate the answers to this question. For instance, it is certainly worth noting that the StorageTek 5800 system is not designed to be a transaction server or to handle, by itself, high transaction data. If it were thought there would be such demand for data stored on the StorageTek 5800 system, it should be configured with a caching server in between so that frequently used data could be cached to that external server in order to maximize retrieval times. It is also important to note that while some repository solutions can be configured to use the StorageTek 5800 system internal metadata system, it is certainly not a requirement. Some repository solutions may not be capable of utilizing this technology or, you may simply choose to use external metadata storage because it is already handled through Oracle or other applications. Simply using the OID assigned by StorageTek 5800 system to all objects is a perfectly valid way to interface with the external metadata systems. It should also be noted that the StorageTek 5800 system , in the current release is not a compliant device (i.e. It does not meet regulation S.E.C. 17A , Sarbanes-Oxley or HIPAA, although it does meet advisory compliant regulations). Also, it should be noted that the system does not allow the direct deletion of digital objects. When all the metadata instances associated with a given object are removed, the system automatically garbage collects the data. In the upcoming releases of the product, when the retention periods are supported, this behavior would change as the policies will dictate the removal of the data object(s). 14 Continuing Digital Repository Preservation/Archival Challenges Sun Microsystems, Inc. “Emulation” in the preservation context. The term emulation can have a slight variation in meaning when used in the context of repositories or preservation and archiving. Per the Trustworthy Repositories Audit and Certification (TRAC) guidelines, issued by The Center for Research Libraries, it refers to digital objects and the ability to: “produce a supportable environment to enable the proprietary software to run” that is needed to make the object available and usable for future generations. Chapter 8 Continuing Digital Repository Preservation/Archival Challenges The issues of preservation/archiving in large-scale repositories are still rapidly evolving. While the systems of today, employing architecture such as that outlined in this paper will provide a solid foundation going forward; there are areas where we will continue to see challenges and development. These include, as identified by Richard Jones, et al4 : 1. “Migration: i.e. migrating file formats to a supported format at ingest or migrate the format to a required or requested one at delivery. Both will need to be stored and maintained to aid preservation. 2. Viewers: instead of migrating file formats, we could reduce the preservation activity by providing tools that know how to render stored formats. This would require preservation of viewers. 3. Emulation: similar to viewers, we might choose to develop tools that emulate a software platform on which items were created. 4. Universal Virtual Computer (UVC): an entire system designed as a preservable platform upon which emulation, viewing or migrating tools may be developed. 5. Technical metadata: as a supporting set of tools for digital preservation, technical metadata could be enhanced to encompass information such as the representation information for a file format and provide linkage to supporting databases of file formats and rendering tools.” Others add to this list by including items like: 6. Planning for the horizontal dimension, i.e. the ability for repositories to support a diverse breadth of content and uses. Current efforts to amalgamate content have proved challenging. With the growing size and number of objects to be accommodated, amalgamation is looking extremely unlikely to be the ultimate solution. This has direct implications in the preservation/archiving dimension of large-scale repositories as these tend to be thought of as places where objects will be reside at some stage of their life cycle when they are either no longer in high demand or their uniqueness requires the extra care provided by sophisticated long-term storage like the Sun StorageTek 5800 system. Yet richer architectures and/or multiple types or instances of repository solutions may be needed to fully support the preservation needs of the complex digital environment. One solution might be to employ multiple repository solutions in order to fulfill the needs of a repository user constituency, which may include a combination of in-house and external options. 4 The Institutional Repository by Richard Jones, Theo Andrew and John MacColl. Chandos Publishing, Oxford, UK (2006) Page 82-83. 15 Continuing Digital Repository Preservation/Archival Challenges Sun Microsystems, Inc. 7. In addition, intellectual property (IP) is another area where preservation efforts will need to continue to develop in the future. In a JISC report done in the UK, it was recommended that the following specific issues be addressed in dealing with IP and preservation (Jones and Beagrie, 2001). Specifically permissions related to preservation will be needed for: a. copying; b. future migration of content to new formats; and c. emulation. With regard to the threats to preservation identified earlier, clearly content addressable storage, like the StorageTek 5800 system, provides substantial answers for many of the issues of media failure and the storage portion of hardware failures as well as failure points of the file system software. Challenges remain to be addressed in many areas including such basic non-automated processes as the development of best practices or guidelines. 16 Conclusion Sun Microsystems, Inc. Most common mistakes made in implementing repository services. Chapter 9 Conclusion When an organization is ready to start architecting either a custom or out-of-the-box digital repository solution, there are action steps that need to be undertaken 1. Begin by planning the repository service. This is a critically important step to having a successful repository service! It includes making sure you understand the needs to be fulfilled at many different levels. Probably the most important is to understand the needs of the community of users. Then look at the staffing and budget required to fulfill those needs, rights management issues; the metadata to be utilized and an overall marketing and launch plan once the repository services are in place. Each of those steps involves substantial discussion and consensus building, normally with many individuals from many different sectors of both the community of users and the organization. If the expertise doesn’t exist in-house to complete this step, contact Sun who has both in-house consultants and a list of recommended 3rd party consultants to help develop a repository plan. If needed, these same consultants can help select from the wide array of available repository specific solutions that will best integrate together to support the defined plan. 2. With the plan firmly shaped, match the functional requirements established to the architectural requirements for the hardware and software. Select the the server, operating system and data center and mid-range or workgroup disk to be utilized. And again, look to Sun to provide a wide variety of choices and recommendations. 3. If the repository architecture needs to address the large scale repository needs of preservation/archiving, then content addressable storage solutions exist, but only one offers the total data integrity, scalability, internal metadata handling and low cost of ownership: that is the Sun StorageTek 5800 system. 4. Next, if a totally open source solution is being used, begin downloading and installing the application software. For instance, this might be FEZ as the client module and Fedora to serve as the repository server system software. If a solution is being utilized that involves an open, but proprietary solution based software, then work with that vendor to have their product installed 5. Finally, configure the hardware/software using the built-in capabilities, start the repository and begin deploying digital repository services to the organization per the marketing/roll out plan developed earlier. Delivering digital repository services built with open solutions is now possible! 1. Selecting wrong product architecture. 2. Not planning for scalability. 3. Not writing policies/guidelines. 4. Not involving the right constituencies. 5. Selecting the wrong product for the needs. 6. Not planning a marketing/rollout campaign. 7. Not planning for rights management. 8. Not planning for preservation. 17 About the author: Sun Microsystems, Inc. About the author: Carl Grant is a librarian and business person who has worked in libraries, or companies automating libraries, for over 30 years. He has worked for a number of library automation companies including: Data Research Associates, Inc. (DRA now part of SIRSI/Dynix), where he was the Vice President of International Business; Innovative Interfaces as their Vice President of Sales and Marketing; Ameritech Library Services (now part of SIRSI/DYNIX) where he was Vice President of Marketing, Product Management and International Business; Ex Libris (USA) as President, during which time Ex Libris became a leading vendor of automation systems in North America; VTLS as President and COO where he was responsible for the Australian Research Repositories Online to the World (ARROW) project using VITAL and Fedora . In 2007, he co-founded and serves as President of CARE Affiliates, Inc. This company specializes in open source software and provides consulting, selection, implementation, maintenance, support and development services around selected open source solutions including Fedora-based repositories. He is also the Immediate Past Chair of the National Information Standards Organization (NISO) and serves on the Fedora Advisory Board. He speaks and publishes about libraries, repositories and automation to audiences around the world. 18 Glossary Sun Microsystems, Inc. Chapter 10 Glossary API — Application Programming Interface. A set of routines, protocols, and tools used for building software applications. Client — An application that runs on a personal computer or workstation and relies on a server to perform some operations. Data object — A stored file associated with an object ID (OID) Digitool — DigiTool is an enterprise solution for the management of digital assets in libraries and academic environments. See www.exlibrisgroup.com/digitool.htm for more information. DPS — is an enterprise solution for the preservation of digital assets in libraries and academic environments. Contact Ex Libris at www.exlibrisgroup.com for more information. DSpace — A digital repository system, DSpace captures, stores, indexes, preserves and redistributes an organization’s research material in digital formats. See www. dspace.org for more information. ELATED — ELATED is a lightweight, general—purpose application for managing digital files. ELATED is built on top of the Fedora and could be used as a digital assets management system, an institutional repository, or to meet other collection archiving, publishing and searching needs. Available on SourceForge. EPrints — EPrints open source software is a flexible platform for building high quality, high value repositories. It is used to set up repositories of research outputs of literature, scientific data, theses and reports or multimedia artefacts from collections, exhibitions and performances. See www.eprints.org for more information. Exabyte — Approximately one quintillion bytes (1018). The abbreviation for exabyte is EB. Fedora — Flexible Extensible Digital Object and Repository Architecture. See www. fedora.info for more information. FEZ — Fez is a PHP / MySQL front end to the Fedora repository software. It is developed by the University of Queensland Library as an open source project hosted on SourceForge. There are mailing lists, forums and download links on the sourceforge site. HTTP — HyperText Transfer Protocol. Underlying protocol used by the World Wide Web. HTTP defines how messages are formatted and transmitted, and what actions Web servers and browsers should take in response to various commands. 19 Chapter Title Sun Microsystems, Inc. IP — Two definitions: 1) Internet Protocol and 2) Intellectual Property. Typically, when talking about repositories is it concerned with the legal rights surrounding use of a digital object. Metadata — Extra information about the data object. Describes how and when and by whom a particular set of data was collected, and how the data is formatted. There are two main types of metadata in the Sun StorageTek™ 5800 system: system and user metadata. OAI-PMH — The Open Archives Initiative Protocol for Metadata Harvesting (OAI— PMH) is a low—barrier mechanism for repository interoperability. Data Providers are repositories that expose structured metadata via OAI—PMH. Service Providers then make OAI—PMH service requests to harvest that metadata. OAI—PMH is a set of six verbs or services that are invoked within HTTP. Object — Any item that can be individually selected and manipulated. OID — Object ID. A unique identifier for each stored object included in the system metadata. Query — A request for information from a database. SRB — Storage Resource Broker, a product from General Atomics, Nirvana Division. For information on General Atomics, see http://www.ga.com/index.php. For information on Nirvana products, go to http://www.nirvanastorage.com/. System metadata — Metadata that includes a unique identifier for each stored object, called the OID, as well as information on creation time (ctime), data length, and data hash. It is automatically maintained by the system. user metadata Metadata that is added by the user of the StorageTek 5800 system. User metadata consists of name=value pairs. The name is defined in the system schema as of a certain type (for example, a string), and the value is associated with the name at the time data is stored. VITAL — VTLS Information Technology for Advanced Learning. VITAL is an institutional repository solution designed for universities, libraries, museums, archives and information centers based on Fedora. See www.vtls.com/Products/vital.shtml for more information. 20 Suggested Readings and Websites Sun Microsystems, Inc. Chapter 11 Suggested Readings and Websites ALA Task Force on Digitization (2007) “Principles for Digitized Content” available at: www.ala.org/ala/washoff/contactwo/oitp/digtask.cfm Bailey Jr., Charles W. (2007) “SPEC Kit 292: Institutional Repositories” Association of Research Libraries. Available at: www.arl.org/bm~doc/spec292web.pdf Bisson, Casey. (2007) “Open Source Software for Libraries” Library Technology Reports, May/June 2007, vol. 43, no. 3. American Library Association. CLIR (2007) “Census of Institutional Repositories in the nited States: MIRACLE Project Research Findings” available at: www.clir.org/pubs/abstract/pub140.abst.html CPIT (2006) “Technical Evaluation of selected Open Source Repository Solutions” available at: www.eduforge.org/docman/view.php/131/1062/Repository%20 Evaluation%20Document.pdf Crow, Ryam. (2002) “The Case for Institutional Repositories” Available at: www.arl.org/sparc/bm~doc/ir_final_release_102.pdf D-LIB, www.dlib.org (Articles too numerous to list, search for “repositories” or “repository”.) DiBona, Chris, Editor. (2006) “Open Sources 2.0” O’Reilly Golden, Bernard. (2005) “Succeeding with Open Source” Addison-Wesley Jones, Catherine. (2007) “Institutional Repositories: Content and Culture in an Open Access Environment” Chandos Publishing (Oxford) Ltd Jones, Richard, Theo Andrew and John MacColl. (2006) “The Institutional Repository” Chandos Publishing (Oxford) Ltd Lagoze, Carl, and Herbert Van de Sompel. (2001) “The Open Archives Initiative: Building a low-barrier interoperability framework.” Joint Conference on Digital Libraries 2001. Available at: http://www.cs.cornell.edu/lagoze/papers/oai-jcdl.pdf. Mellon Foundation. (2006) “A Technology Analysis of Repositories and Services “ Available at: www.ldp.library.jhu.edu/repository/documents/Analysis_Final_Report.pdf SPARC. (2002) “ SPARC Institutional Repository Checklist & Resource Guide” Available at: www.arl.org/sparc/bm~doc/IR_Guide_&_Checklist_v1.pdf Stevenson, Jane and JORUM Team. (2005) “Preservation Watch Report “ Available at: www.jorum.ac.uk/docs/pdf/Digital_Preservation_Report.pdf Sun. (2007) Sun StorageTek™ 5800 system. Available at: sun.com/storagetek/disk_systems/enterprise/5800/ Open Source web site for Sun StorageTek 5800 system Available at: http://www.opensolaris.org/os/project/honeycomb/ Delivering Digital Repositories with Open Solutions Sun Microsystems, Inc. Sun Microsystems, Inc. 4150 Network Circle, Santa Clara, CA 95054 USA Phone 1-650-960-1300 or 1-800-555-9SUN (9786) Web sun.com © 2007 Sun Microsystems, Inc. All rights reserved. Sun, Sun Microsystems, the Sun logo, StorageTek, OpenSolaris, Java, J2ME are trademarks or registered trademarks of Sun Microsystems, Inc. in the United States and other countries. Information subject to change without notice. Printed in USA SunWIN# 519158 Lit# SYWP13617-0 11/07

Related docs
A DRIVER'S GUIDE TO EUROPEAN REPOSITORIES
Views: 15  |  Downloads: 0
Whitepaper template
Views: 3  |  Downloads: 0
From open vieew to open source
Views: 144  |  Downloads: 7
Business Productivity at Its Best Whitepaper
Views: 5  |  Downloads: 0
Microsoft Dynamics SOA whitepaper
Views: 14  |  Downloads: 5
hyper v product overview whitepaper
Views: 4  |  Downloads: 1
whitepaper MarketingHolyGrail
Views: 9  |  Downloads: 1
SPAMfighter Whitepaper
Views: 8  |  Downloads: 0
AcxiomRetailGrowth Whitepaper
Views: 20  |  Downloads: 0
Intrallect Ltd Future of Repositories
Views: 21  |  Downloads: 0
RLM whitepaper
Views: 79  |  Downloads: 2
premium docs
Other docs by C Gunnison
Three-Year Profit Projection
Views: 413  |  Downloads: 53
Start-up Expenses
Views: 629  |  Downloads: 90
Personal Financial Statement
Views: 367  |  Downloads: 35
Opening Day Balance Sheet
Views: 566  |  Downloads: 23
Loan amortization schedule
Views: 257  |  Downloads: 18
Financial History and Ratios
Views: 249  |  Downloads: 21
C Projected Balance Sheet
Views: 272  |  Downloads: 6
Break-Even Analysis
Views: 632  |  Downloads: 95
12 Month Cashflow Form Rev
Views: 338  |  Downloads: 11
12 Month Sales Forecast
Views: 373  |  Downloads: 29
12 Month Profit and Loss Projection1[4]
Views: 176  |  Downloads: 8
BankLoanRequestforSmallBusiness[3]
Views: 334  |  Downloads: 24
Competitive Analysis[4]
Views: 815  |  Downloads: 79
invoice_quadplay
Views: 1628  |  Downloads: 56
invoice_eternity
Views: 2335  |  Downloads: 111