The University of Edinburgh

Document Sample
The University of Edinburgh Powered By Docstoc
					ITC: 16.11.10                                                                   B

                                   Information Services

                          Information Technology Committee

                                     16 November 2010

                                 Research Storage Update

Brief description of the paper

This item consists of 4 linked papers. The main paper describes how the other papers fit
together to report on the work of the 2 working groups in the data storage and management

Action requested

For discussion and advice on how to make the next steps

Resource implications

Does the paper have resource implications? Yes

Appendix D

Risk Assessment

Does the paper include a risk analysis? No

Equality and Diversity

Does the paper have equality and diversity implications? No

Originator of the paper

Brian Gilmore
Director, IT Infrastructure, Information Services

Freedom of information

Can this paper be included in open business? Yes
ITC: 16.11.10                                                                    B

Research Storage Update
This item consists of 4 linked papers.

The first paper from Jeff Haywood entitled ‘Research Data management and storage’
was issued as a basis for consultation with College and Support groups in October
requesting input by the end of November. This is labelled Appendix A

This first paper referred to two other papers which have been included for
convenience. The first of those, entitled ‘Draft proposed policy for management of
research data’ is the output from the Research Data Management group and is labelled
Appendix B. The second of the two papers referred to is from the Research Data
Storage working group and is labelled Appendix C.

The final paper is a draft implementation plan to achieve the data storage
requirements and is labelled Appendix D. This would have significant resource

The committee is invited to discuss any of the issues arising from the above.

Brian Gilmore
Director, IT Infrastructure, Information Services
ITC: 16.11.10                                                                         B
                                                                   Appendix A
Research data management and storage
Dear colleagues

I should like to bring to your attention two documents for comment by your College
and its relevant Committees. They can be downloaded from this URL
( in preference to circulation
as email attachments.

The two documents set out the thinking and conclusions of two linked Working
Groups that were set up, through the Library Committee and the IT Committee, to
help us find a realistic way forward in dealing with the complex challenge of digital
research data. One group explored research data storage (RDS) and the other research
data management (RDM), two parts of a whole, and the groups communicated with
each other throughout their work. Some of your College staff may well have seen
earlier drafts of these documents, and, as we had committed to consult widely once
we had made progress, this call for comments is part of our fulfilment of that

Research data storage and management is a very active topic amongst the research
intensive universities, government and its agencies. Many PIs are already well aware
of the issues. Edinburgh has the expertise and resources to become a 'leading light' in
this area, but time is short and the internal pressure from academics for support is
rising in all three Colleges.

Both of these documents indicate an important direction of travel for the University of
Edinburgh. The outcome of the RDM document is a draft University policy for
managing research data, taking account of increasing funding agency demands for
compliance, the open access agenda, and the shared responsibilities of the University
and PIs. The RDS document identifies the types of data that need storage and the
features that an effective University-wide storage service would need to offer. The
staff effort and hardware/software implications of these documents are substantial and
an implementation plan will be prepared to accompany these documents.

I suggest that the consultation is placed on the forthcoming agendas of the College
committees with oversight of research, IT and library matters, and perhaps also be
considered at the College SMT level.

I should welcome comments by end of November. Please send these to Cuna
Ekmekcioglu ( or use the wiki
( If you wish someone to talk
to you or your committees about our work on RDS and RDM please contact Cuna,
who will arrange for an appropriate person to attend.

Jeff Haywood
Vice Principal of Knowledge Management, CIO & Librarian to the University
Oct 2010
    ITC: 16.11.10                                                                                B
                                                                             Appendix B

    Draft proposed policy for management of

    research data

•           Attachments:1
    Added by Cuna Ekmekcioglu, last edited by Cuna Ekmekcioglu on Oct 12, 2010 (view change)

    Draft policy
    It shall be the University's policy that:

        •    Research data should be managed to the highest standards throughout the research
             data lifecycle as part of the University's commitment to research excellence.
        •    The University should provide training, support and advice, as well as mechanisms
             and services for storage, backup, registration, deposit and retention of research data
             assets in support of current and future access, during and after completion of
             research projects.
        •    Responsibility for research data management through a sound research data
             management plan during any research project or programme lies primarily with PIs.
        •    All new research proposals from date xxxxx? must include research data
             management plans or protocols that explicitly address data capture, management,
             integrity, confidentiality, retention, sharing and publication.
        •    Research data management plans must ensure that research data is available for
             access and re-use where appropriate and under appropriate safeguards.
        •    The legitimate interests of the subjects of research data must be protected.
        •    Research data of future historical interest, and all research data that represent
             records of the University, including data that substantiate research findings, should be
             offered and assessed for deposit and retention in an appropriate national or
             international data service or domain repository, or a University repository. Such
             research data deposited elsewhere should be registered with the University.

    Research Data Management Working Group report

    Add Labels

    Add Comment
ITC: 16.11.10                                           Draft, July 2010       B

                                                                Appendix C

  Research Data Storage Working Group Draft Report


      Status of this document: This is a working document. It
      follows the consultation as described below.

      It is now being distributed widely for comment.

      Comments to: and/or

ITC: 16.11.10                                                Draft, July 2010       B

                                                                    Appendix C

  Table of Contents

        Summary of recommendations                                         3
  1.    Introduction                                                       4
  2.    Aim and scope of consultation                                      6
  3.    Methodology                                                        6
  4.    Key findings of consultation                                       7
  5.    Policy issues                                                     14
  6.    Recommendations                                                   16

  Appendix 1: Members of the Group                                        18
  Appendix 2: Overviews from Colleges                                     19
                CHSS                                                      19
                CMVM                                                      19
                CSE                                                       22
                Overview of existing IS storage                           23
  Appendix 3: Data on storage from 2007 Research Computing
                Consultation                                              24

ITC: 16.11.10                                                            Draft, July 2010       B

                                                                                  Appendix C
  Summary of recommendations
  The group identified the following requirements that most users agreed were essential
  University services:
     1. Archiving of research data
     2. Accessibility of research data to all virtual collaborators, facilitating extra-
        institutional collaboration
     3. Globally-accessible, cross-platform file store
     4. Backup and synchronisation of data on mobile devices
     5. Establishing networks of knowledge
     6. Federated structure for local data storage
  Recommendations are given in detail in Section 6.

ITC: 16.11.10                                                                          Draft, July 2010               B

                                                                                                  Appendix C
  1.        Introduction
  At international, national and local levels, there is intense interest in how to manage the
  rapidly expanding volume and complexity of research data. Concern is both for the
  shorter term – ensuring competitive advantage through secure and easy-to-use access,
  and for the longer term – ensuring enduring access and usability to the research
  community into the future and compliance with legislation. The UK government and
  research funding bodies are debating with the HE community how best to address this
  large and complex problem, and have funded various initiatives to explore options (eg
  the United Kingdom Research Data Service1, Digital Curation Centre2, UK Data
  Most Research Councils now mandate or encourage Data Management Policies and
  deposit of data4.

  The Research Information Network (RIN) has published a framework of principles and
  guidelines for the stewardship of digital research data5.
  The UK Research Integrity Office (UKRIO) has prepared a standard code of practice for
  research6 which is regularly reviewed to take into account changes in legislation, and to
  reflect national and international best practice. The University of Edinburgh has recently
  adopted the UK Research Integrity Office’s new Code of Practice for Research.
  Scholarly journals in increasing numbers are requiring that access be provided to
  underlying data sets7

  The JISC Support of Research Committee8 has various programmes dealing with
  research data, the latest being the JISC Managing Research Data Programme

  The UK Research Data Service (UKRDS) project10 started with the objective of
  assessing the feasibility and costs of developing and maintaining a national shared
  digital research data service for UK Higher Education sector. This was seen by the
  project sponsors as forming a crucial component of the UK's e-infrastructure for
  research and innovation, which would add significantly to the UK's global

  The feasibility study concluded that embedding the skills, capability and organisation
  into the HEI research management process was the best approach and that a relatively
  small national service structure would be needed to foster this through channelling
  training, tools and good practice developed by existing national and international skill
  1 and Final Report at
  8 This programme seeks to expand
  effective data management and data sharing to benefit research and the HE sector more generally. The
  JISC is working towards developing a national strategy with key stakeholders (Research councils, Funding
  Councils, Institutions etc.), in order to help to establish the foundations for an effective UK research data in-

ITC: 16.11.10                                                                    Draft, July 2010           B

                                                                                           Appendix C
  centres. The University of Edinburgh could not take part in the first phase of this project,
  since this was restricted to English HEIs, but has noted interest and expects to be
  involved at some point.

  The Partnership for Accessing Data in Europe (PARADE)11 hopes to build efficient
  services addressing data management needs of multiple research communities. The
  PARADE consortium consists of several user communities and national partners,
  including representatives of Edinburgh University. The consortium aims to link with
  various European initiatives addressing data with an intention to work together to build a
  pan-European collaboration.

  What is clear is that there will be no external solution that will remove from each
  university the requirement to provide storage and management procedures for the data
  of its own research activities.
  At the University of Edinburgh, first a consultation on computing requirements of the
  research community was conducted12. Key findings of this consultation indicated a need
  for larger storage space on servers; more robust archiving services; simple, secure and
  preferably an automatic data back-up service; and a high demand for training and
  awareness raising across the University.
  Second, a pilot implementation of the JISC Data Audit Framework project13 was carried
  out. The study focused primarily on research data management rather than storage
  requirements. The findings at Edinburgh were that there was inadequate storage space
  and lack of clarity about roles and responsibility for research data management by
  University research staff. The project noted a need for storage and backup procedures
  including provision for business continuity arrangements. A formal procedure was
  needed for data transfer when staff and students leave the institution.
  Solutions for the University of Edinburgh will only be successful if they come from a
  partnership of individual researchers, Schools, Colleges and Information Services. Each
  has expertise and resources that can be brought to bear to the benefit of all. This
  Review is set in that spirit and is a collaborative exploration of the current state of play
  within Edinburgh and of options for short term and longer term actions.
  Last year, the Research Computing Advisory Group (RCAG) consulted with a
  representative sample of staff and research students and produced a strategy plus
  implementation roadmap14, which recommended to the Vice Principal that addressing
  research data storage and management was a high priority requirement.
  The oversight of research computing has now been made the responsibility of the re-
  instated IT Committee, and as part of its 2009-10 Work plan, it is taking up the challenge
  of producing a review of data storage and management, starting its work with research
  data (leaving learning and teaching data and corporate data to a later date). Two groups
  have been set up (with close links (i) Research Data Storage Working Group (ii)
  Research Data Management Working Group. This document presents the work of the
  first of these groups.
  2.        Aim and scope of consultation
       Cuna Ekmekcioglu. Research Computing Consultation Report. December 2007. Available from Edin-
       burgh DataShare repository.
        Cuna Ekmekcioglu, Robin Rice. Edinburgh Data Audit Implementation Project. January 2009: . See also Data Audit Framework:

ITC: 16.11.10                                                           Draft, July 2010          B

                                                                                Appendix C
  The group is primarily tasked with developing recommendations for research data
  storage. The remit is taken directly from the document constituting the working groups:
  •    Review current practice within the University of Edinburgh;
  •    Review what is known about current practice within the peer universities
  •    Review current national and international developments that seek to influence or
        provide services in the field;
  •    Develop, with suitable consultation, options with appraisal of their cost and
        feasibility, with a risk analysis of actions or inactions in this field.

  3.      Methodology
  The fundamental question posed to researchers is:

  "What could be done in respect of storage services, facilities and policies to make
  your research at Edinburgh easier, better, more competitive or more successful."

  The group first collected information using prior knowledge within the group, and existing
  documents from both within and external to the University. Consultation was then
  carried out in three phases.
  The first phase of consultation then involved visiting college level groups to assimilate
  information known by those groups on behalf of the schools and institutes which they
  •  CSE: College Computing and IT strategy (CC&IT) and the Computing Professionals
       Advisory Group (CCPAG)
  •  CHSS: Computing Strategy Committee (CSC) and CCPAG
  •  CMVM: IT Working Group (ITWG)
  Overviews from three colleges are given in detail in Appendix 2.
  Following these visits this draft document was produced capturing the priorities arising.
  Already at this stage several common high priority themes have emerged (for example

  The draft document was then distributed to the respective College Research
  Committees and advice was sought from the Digital Curation Centre. The University
  ITC was also shown this draft for comment and feedback.
  In this second phase of consultation the document is now open for general comment
  within the University before the final version is produced.
  The objective of reviewing current practice in peer universities will also commence now.

  4.      Key findings of consultation

  4.1     Key attributes of data storage

  As a preamble we have identified the following list of attributes of data storage:
  •       Volume (KiloBytes ↔ TeraBytes)

ITC: 16.11.10                                                                    Draft, July 2010           B

                                                                                           Appendix C
  •       Cost (cheap commodity ↔ high quality SAN)
  •       Lifetime (weeks, months, years, forever..)
  •      Primary repository (must be robust) ↔ Scratch (can easily recreate any lost
  •       Not backed ↔ backed up
  •       Line-proximity (online disk ↔ tape)
  •       Physical-proximity (defined by bandwidth*latency from storage to compute)
  •       Visibility level (public access outside of UoE ↔ highly secure
  •       Value (£) of data,
  •       Value (irreplacability) of data.
  We recognise that the importance of these attributes to different user groups will vary.
  Some groups will care deeply about the low level storage details, some will care only
  about the definition of the service as expressed though an MOU. Therefore, across the
  University we expect requirements to be phrased at different levels.

  4.2     Scale of storage requirements
  Extrapolating from figures presented in the Research Computing Consultation report
  (Appendix 3) the majority (~80%) of researcher requirements would be met by a filestore
  allocation of 100GB. Increasing this to 200G and above would capture more use cases.

  Research which required very large datasets could be hosted within such a filestore, but
  would not be accommodated by the default allocation.

  Assuming 5000 users of a centralised research filestore with an average usage of
  100GB then this sets the scale of a filestore at the order of 1/2 Petabyte15.

  Storage costs depend on implementation with resiliency, capacity, facility and
  performance all affecting cost. Work on storage implementation for research from I.S.
  ECDF suggest a per TB capital of £650 cost for a tiered storage system would be

  When implementing storage systems of this scale, novel technologies to reduce
  operational costs (such as disk spin-down or tape as storage tier) should be considered.

  4.3     Services
  We describe services below. It should be noted that there is no implication as to where
  the physical implementation for such services should be realised. They may be central,
  federated, or provided by a third party. This decision would come from detailed
  implementation planning. We make a further comment upon this principle later.
  Archiving services has arisen so far the top priority. There are many reasons for
  archiving, and indeed archiving can mean different things to different people:

    Naturally a more sophisticated space planning approach should be adopted including use of appropriate
  contention factors.

ITC: 16.11.10                                                                    Draft, July 2010       B

                                                                                           Appendix C
  •    To preserve research data in case it is ever required again (the problem of
  departing staff came up often)
  •       To securely safeguard irreplaceable research data
  •       To comply with funding agency requirements
  •       Taking snapshots.
  •       When you just don't know if you will need it
  •       Requirements arising from University data management policy
  •       Requirements arising from legal compliance (e.g. FOI, EIR)
  Just as important as the physical archive, is the ability to archive against the meta-data
  needed to find and retrieve data.
  The policies which govern the management of research data and the procedures
  which implement these will be critical to the success of the such a service.
  It is recognised that the institutional cost of the data management processes16 are much
  higher than the costs involved with providing an archive service.
  Globally accessible cross-platform file store 
  Almost universally users agreed that a globally accessible cross-platform file store
  (GACPFS) would bring benefits. Uses include:
  •       Research outputs, papers and other research documents. This is because a very
          large proportion of "research data" isn't particularly special in data terms and the
          home store is the most convenient place to put it and organise it.
  •       e-lab-books and e-log-books where researchers will keep their working records
          which must be secure.
  •       Raw data that is not so large as to need more specialised storage.
  •       Ad hoc backup copying.
  •       To make data available to multiple computers for processing (e.g. develop on A
          and run production on B).
  •       Confidential or sensitive data may need to be handle differently.
  Typically (but no implication of implementation choice is implied here) this is satisfied by
  a shareable network file system. This should be accessible both from within and without
  the University network domain. The requirements are:
  •       Access from different platforms including laptops, desktops, central IS supported
          machines, ECDF, from research centres outwith the university.
  •       Accessible from multiple operating systems.
  •       Accessible from outside the university domain.
  •       Must be backed up.
  •       Must allow for secure storage when required.
  •       There should be adequate space which can grow in a simple and flexible way.

       Neil Beagrie, Julia Chruszcz, and Brian Lavoie. Keeping Research Data Safe (Phase 2):

ITC: 16.11.10                                                                     Draft, July 2010             B

                                                                                               Appendix C
  •         Must be aware of the large range of file sizes from a few kbytes to many Gbytes.
  •         If possible then the option of automated encryption/decryption should be available.
  •         Handling of sensitive and confidential data including while it is in use and not just
            being stored
  •         Life cycle of sensitive and confidential data – who has got access?
  It is worth expanding on the space point. It is understood that space costs money, and
  that it is unlikely that every single researcher in the university could be allocated 100s to
  1000s of dedicated Gigabytes whether they used it or not. Equally it would be wrong to
  start from an antiquated viewpoint of providing only a trivial space (meaning small
  compared at the average laptop disk) and requiring a explicit complex payment
  procedure to expand it each time. This would soon render a service useless. Thus a
  more forward looking policy needs to be developed which would understand the
  differences between allocation and actual use, and use contention planning. A suitable
  costing and payment model will be vital, and would have to be agreed before beginning
  any provisioning.
  Laptop backup/synchronisation 
  More and more crucial work is “on the move” from laptops, including papers and
  documents being prepared. This will only increase and it is important to researchers to
  ensure that this work is backed up. This may be achieved by automatic
  backup/synchronisation to a university file system as described above.
  In the short term this could be achieved as a stop gap through use of a globally
  accessible file system which users transfer data manually (“by hand”). In this respect,
  “Dropbox” like functionality may be required (Dropbox client enables users to drop any
  file into a designated folder that is then synchronised to the cloud and to any other of the
  user's computers and devices with the Dropbox client).
  However, it is recognised that users will, in may cases, not wish to explicitly remember
  to do this and in reality an automated “behind the scenes” sources is needed. For
  example, Mac users use Time Machine to automatically save up-to-date copies of their
  files on their Mac.
  Federated structure for local data storage 
  It has been discussed and there is evidence from other reports17 that HEIs should
  consider federated structures for local data storage within their institution comprising
  data stores at the departmental level and additional storage and services at the
  institutional level. Therefore, this should be considered at the University of Edinburgh.
  It has been made clear there are many very good reasons why primary research storage
  is, and should remain, local to the research. Such reasons include grant conditions, near
  line requirements for processing, resilience against connections going down, and
  individual preference.
  There are storage systems owned by different groups in the university (colleges, IS) and
  some mechanism is needed for data to flow between them.

       Neil Beagrie, Julia Chruszcz, and Brian Lavoie. Keeping Research Data Safe (Phase 1):

          See also Neil Beagrie, Julia Chruszcz, and Brian Lavoie. Keeping Research Data Safe (Phase 2):

ITC: 16.11.10                                                            Draft, July 2010         B

                                                                                  Appendix C
  Much of this is the business of data management group but it is essential that data
  should be migratable across storage systems operating at different levels at the
  An important principle we identify is that it should be up to the research groups to
  determine what is best for their research in respect of location of their primary
  data store.
  However, there are many stages in the life cycle of research data, and many secondary
  and tertiary services required. We have already identified archiving for example. Thus
  rather than simply continuing with ad-hoc arrangements between heterogeneous local
  storage and a central archive service, the generalisation is to consider a model which
  federates local storage with college level and university level storage to provide a range
  of hierarchical storage management possibilities.
  This is best illustrated by example. The primary active data may be kept locally. This is
  annotated by the local data administrator to indicate if and when such data becomes
  inactive and can be migrated to, say, a college server. Eventually data may become
  archivable. The administrator also annotates the data to indicate the level of security
  and backup required. These secondary services can then be implemented on the most
  appropriate storage at the most appropriate level of centralisation. It is axiomatic that the
  access control authority associated with this must be devolved to local data
  Many research areas are critically dependent upon storage of confidential data (e.g.
  patient derived data) requiring a high level of security. For others it is desirable that the
  data is almost publicly visible for the benefit of collaborators outside of the University.
  We have also already noted the call for simple to use data encryption.
  Clearly the detailed design of such a system would require a significant amount of
  consultation and work outside the scope of his report. It should therefore be the focus of
  a specific design group involving peers across the university.

  4.4    DataBases
  Research Data is often stored within databases. Databases are used to help store,
  query, organise, visualise and manage data. Databases also help express how data
  items relate to one another by defining the relationships between data collections. For
  the purposes of this working group, there are two main types of database systems.

  The first type is "Desktop Computer Application" software. These are file system based
  databases. This includes software titles such as MS-Excel, MS-Access, Ms-
  VisualFoxPro, FileMakerPro, etc. These databases are essentially files that are stored
  on a file system and require a single software application to open these files to directly
  view, manipulate and save their contents. The database is only "live and operational"
  whilst a user has the data open within the software application. Generally these are
  single user database systems that can only easily be updated by one user at a time.

  The second type is "Client-Server Database Server" software. This type of database
  server runs as a network service / daemon. This includes products such as MS-SQL
  Server, MySQL Server, Oracle Database Server, Postgress Database Server. These
  are all ANSI SQL compatible SQL servers. The client software issues queries to the
  database server, often using a standard query language knows as SQL, to view, modify
  or delete data within the databases hosted on the database server.

ITC: 16.11.10                                                          Draft, July 2010         B

                                                                                Appendix C
  Whilst using databases to support Research Projects, there are a several considerations
  to make:
  •     Backup: It is important to ensure that the databases can be backed up for disaster
        recovery purposes. It can be difficult to backup files if they are always open.
        Special backup agents or backup procedures may be required to help facilitate
        and automate these procedures.
  •     Snapshots: It is important to ensure that snapshots of the data are made
        whenever key project events occur; this could include system upgrades, data
        analysis, project presentations, final analysis and publication.
  •     Archive: Databases can grow to very large sizes and sometimes data will need to
        be archived off to ensure the system remains responsive. Data is also archived
        when a project comes to an end and the databases no longer need to be made
        available for access. Research data often needs to be retained for up to 30 years.
        Archived data should be as free from proprietary formats as possible.
  •     Replication: Databases can often hold vital data that needs to have very high
        availability. Database servers often provide mechanisms for data to be replicated
        between database servers; these servers could be in placed in different
        geographical locations to ensure safety from fire and flood.
  •     Triggers: Databases can also contain automated logic / business procedures that
        allow the automated execution of procedures when particular events occur; such
        as the insertion of data into a table.
  •     Audit Logs: It can be vital that database systems track and maintain a history of
        "who did what and when". This is especially important when systems need to be
        compliant with complex research and data legislation.
  •     Security: Some desktop database applications and most database server allow
        for very precise configuration of access control and permissions. A user can be
        granted granual permissions to read, modify or delete data within whole
        databases, tables or even just single fields.
  •     Documentation: All Research Data should be fully documented with the minimum
        of a data dictionary that defines and explains each field in each table.
        Large research projects often benefit from having their core data stored within
        SQL database servers. Lower risk projects may suit desktop database
        applications. Data storage for database servers needs to be fast and highly

  4.5    Communication and collaboration
  Networks of knowledge
  A most interesting concept has arisen. This concerns the knowledge which exists about
  storage within the university, and not physical provision of storage itself.
  Many groups will build local adequate data stores for a variety of good reasons. Whilst
  they will be competent to do “something” they will not necessarily be in a position to
  benefit from the combined university knowledge of best practice. Naturally such
  knowledge is already shared at some level, but mainly on an ad hoc and best efforts
  basis. The suggestion is that “Networks of Knowledge” should become first class
  functions of IT staff (meaning that this is recognised as a primary function of the role as
  much as running a physical service is, and that such recognition feeds into
  advancement in the normal way).

ITC: 16.11.10                                                            Draft, July 2010         B

                                                                                 Appendix C
  Specific examples which have been mentioned are:
  •     Local storage solutions: For design and construction of modest local research
        storage facilities, and to be able to benefit from experience of previous
  •     Databases: many areas of research depend upon databases and yet in many
        cases the use is very simplistic and inefficient. Harnessing the University
        knowledge here would be very helpful.
  •     For preparation of grant applications: many grant applications go in with a
        canonical £xxx for computing, but no other detail. It could be very helpful to have
        access to consultants to provide a more meaningful assessment of computing and
        storage requirements and costs, including costs for central service provision.
  Enabling extra-institutional collaboration
  This was identified as being important to many across colleges. Probably the majority of
  high profile research is conducted in collaborations which span institutions and
  countries. The natural virtual-group is therefore not identified with a physical institution.
  Accessibility of data to all virtual collaborators, regardless of where collaborators are, or
  where the data is physically stored is a high priority requirement for many.
  Pooling initiatives will benefit from ease of sharing across institutions
  We can see evidence of this clearly in the growing use of “Google” like services (mail,
  document sharing, calendars) to get around current limitations of university systems.
  Another example is the use of a versioning system (e.g. subversion). This provides a
  standard way to access which is institute independent and has fine grained access
  An important point emphasised during consultations is that the authentication system
  must recognise that no individual university is at the “centre” of a research collaboration.
  Furthermore, researchers are generally not prepared to go through the process of
  obtaining an account at another institution and having to remember a different username
  and password. These facts pose problems for managing identities - one cannot expect
  all users to have identities allocated by the University of Edinburgh.
  Some researchers use Google services (e.g. Googledocs) for inter-institutional
  collaboration, in part because many collaborators are already likely to have a Google
  account, and it is institution-independent.
  For web-based resources, Shibboleth is a standards-based system providing a
  federated identity capability. With Shibboleth, a user who is identified by and
  authenticated at one institution can access another institution’s secure web-based
  resources (subject to being authorised to do so) without further challenge. OpenID is
  somewhat similar, but does not provide as high a level of trust.
  Eduroam is a system providing authenticated access to WiFi networks in educational
  institutions across the world without requiring local credentials.
  However, Shibboleth, OpenID and Eduroam are limited to Web or WiFi, and do not work
  with file access protocols such as NFS or CIFS, where (among other requirements)
  programs running on a user’s behalf must be able to access files using the user’s

ITC: 16.11.10                                                            Draft, July 2010         B

                                                                                  Appendix C
  A standards-based solution for controlling access to filestore using a federated identity
  system is needed, but it is not clear that this exists. The nearest may be the AFS
  system, which uses Kerberos to supply identification information.

  4.6    Special Collections
  Within the University there exist sets of records (some not even yet in digital form such
  as wax cylinders) which represent data and information of great potential scientific
  value. I many cases this data is unique (there is not other copy). This is predominantly,
  but not only, within the CHSS sector where it is a fact that the relevant funding councils
  do not make resources routinely available to curate such data. Such researchers within
  Edinburgh regard it as a priority that Edinburgh acts a steward of such data for the
  global scholarly community. Within CMVM there are large longitudinal study data sets in
  this category. This has been passed on to Data Management Working Group for further

  4.7    Other storage related issues
  Awareness raising and training
  In some sectors academics are not aware of information and tools available, and where
  exactly to go for assistance/advice? Simple file management education? How can we
  capture the needs of individuals? This is one of the most frequently raised issues in
  every consultation conducted with the research community, therefore the University
  should deal with it urgently.
  Trust in the service
  An element of adoption of centrally provided services revolves around trust in both the
  reliability , and the responsiveness, of the service.
  Ease of use 
  At least as important as technological quality of a service is the ease of use. If it is easy
  and reliable it will be used, if it is cumbersome then it will not be used.
  Instrument Data
  Everyone is familiar with users generated data, but there is a very different category of
  service/device/instrument generated data. The file sizes can vary as can their quantity,
  the scale often being much greater than User Generated Data. It is normal for these
  devices to be an integral part of the research infrastructure and be very critical for
  research. These types of devices / services either need to stage their data and replicate
  it to a secondary data storage service OR to be provided with a block-level data service
  that would allow the remote storage to appear local to the service (SAN/NAS/iSCSI).
  They can be bespoke and/or proprietary third party systems
  This type of data can often pose greater challenges for preserving data, service/device
  configuration and essentially the 'viewer' for the data. These devices upon which this
  type of data is generated may also benefit from dovetailing into other current University
  of Edinburgh strategies such as Service Virtualisation.
  Examples are:
  •     Microscopes
  •     MRI equipment
  •     Medical imaging PACS systems
  •     Lab Information Management Systems (LIMS)

ITC: 16.11.10                                                           Draft, July 2010          B

                                                                                 Appendix C
  •     Weather stations
  This is looking to be a lower priority area simply because most such instruments have
  proprietary data acquisition systems for which the overhead of integrating this with
  central storage would be too great.
  5.     Policy issues
  This section collects together requirements arising which have implications for University
  policy, where high level commitment would be needed from the University management
  (colleges, schools, research groups or institutes, information services).

  5.1    Sustainable funding models
  The biggest non technical issue arising is that of achieving sustainable funding models
  for any service provision aggregated on a scale larger than a school. It is very difficult to
  find models which are acceptable to researchers, schools, colleges and the university
  centre all at once as the viewpoints are very different.

  At the one extreme researchers and research groups tend to see only the marginal cost
  of provision when considering building their own solution. Typically power is not charged
  and staff are often perceived to be free even though there is an opportunity cost of those
  staff not doing something else of higher level value to the research. Thus local provision
  appears to be cheap compared to any central provision even though the same number
  of kW-hours and staff-hours are involved. This is not merely a perception issue, for the
  researcher will be forced to make the choice which is actually cheapest to them in £ on
  their grant. University wide arguments of lower aggregated cost have no bearing on this.

  At the other extreme it may be assumed centrally that full costs for computing are a
  matter of FEC and the research councils will pay. However this is not generally true and
  is not consistent across research councils. This is fundamentally because no consistent
  model for charging for computing under FEC exists. This is because FEC does not work
  for a resource where the cost per unit resource falls so quickly with time as it does with
  IT. If one attempts to recover the full capital cost in an FEC rate then the price charged
  to researchers rapidly becomes uncompetitive. If one instead charges a competitive
  "spot rate" to researchers then not enough money is recovered to refresh the facility.
  This is a fundamental flaw with FEC, but given that it exists then the University should
  develop a consistent policy to adapt to the situation.

  The University as a whole needs to find a way to bring together these different
  viewpoints to achieve a more holistic approach, whilst ensuring that we remain
  competitive with peer universities. Issues are:
  •     Marginal cost to researchers of local provision is a real effect because not all
        money is the same.
  •     Opportunity cost of using local research staff to provide IT does not cost £ in the
        short term, but costs research output in the longer term.
  •     Full cost at point-of-use means it is likely that some services will effectively be
        unaffordable and will not be used
  •     Comparing like-with-like, e.g. if power costs accrue to IS services, but not to
        Schools then the a useful cost comparison is difficult.
  Putative Model 1: Dual Funding

  In this model the problem is resolved by setting a cost at point-of-use to be competitive

ITC: 16.11.10                                                             Draft, July 2010         B

                                                                                   Appendix C
  with the local solution and competitive with peers (it may be a little more to reflect the
  better service, but it should not be large factors more). The difference is agreed to come
  from research support funds (i.e. sustainability and/or QR funds). For this latter
  component it would be for the centre and colleges to decide how to implement this (i.e.
  central top slice or attributed pro-rate out to schools, or a combination).

  •      More likely to be used
  •      Competitive with peers
  •      Releases research staff time
  •      The pay at point of use element ensures service remains fit for purpose (or it will not
           be used) and hence ensures that the central element is being targeted at what
           researchers want
  •      Disadvantages:
  •      Opportunity cost of the central/college research support funds
  Putative Model 2:

  [This section will follow in consultation with Jeff Haywood.]

  5.2        Third party suppliers
  There was a discussion on the use of possible third party storage suppliers (e.g.
  EDUSERV18, Amazon).

  As a matter of principle we believe that the University should embrace the idea that
  provision of physical IT implementation is an evolving area. A long time ago researchers
  provided their own networks, and yet now it is not questioned that provides our
  connectivity. Likewise we can imagine that in the future the role of university IT divisions
  will evolve and concentrate upon defining, brokering, and managing the university
  interface to services, but not necessarily also hosting the physical hardware upon which
  they are implemented. There are already several examples in the community (e.g. email
  outsourcing) and one may easily envisage outsourced storage in the future.

  It is clear that data protection issues would have to be addressed - under what
  conditions would third party hosting satisfy (i) university requirements (ii) data protection
  requirements (iii) grant requirements ?

  As example: it would a priori seem possible to provide an archive service which is
  defined through an SLD and which is back-ended by a third party supplier. It would not
  yet be appropriate to decide “up front” that any service should be moved wholesale to
  such a third party, but it would seem sensible to pilot such implementation schemes as
  part of developing new services, with a view to gathering experience for the future. In
  this respect there may be a good opportunity to work with a JISC initiative in the near
  future and we believe this should be followed up.

  5.3        Green IT


ITC: 16.11.10                                                           Draft, July 2010          B

                                                                                  Appendix C
  We must have awareness of sustainability issues including hardware, power
  consumption and efficiency with reference to the University sustainability policy.

  6.     Recommendations (Draft)
  6.1    Recommendation 1: Archiving of research data
  As a top priority a Technical Design group should be set up, consisting of IS staff and
  relevant stakeholders, with a remit to design and cost an archiving service.

  6.2   Recommendation 2: Accessibility of research data to all virtual
  collaborators, facilitating extra-institutional collaboration
  Accessibility of data to all virtual collaborators, regardless of where collaborators are, or
  where the data is physically stored is a high priority requirement. Ideally, a standards-
  based solution for controlling access to data using a federated identity system is
  needed. The University should therefore embrace this defacto situation and adapt
  accordingly in terms of services it supports.

  6.3    Recommendation 3: Globally accessible cross-platform file store
  Almost universally users agreed that a globally accessible cross-platform file store would
  bring benefits. This should be taken within the remit of the technical group already
  proposed for the archiving service.
  6.4   Recommendation 4: Backup/synchronisation of data on mobile
  More and more crucial work is “on the move” from laptops, including papers and
  documents being prepared. This will only increase and it is important to researchers to
  ensure that this work is backed up. The University should consider possible ways of
  automating the backup/synchronisation of files from laptops to a networked file system.
  6.5    Recommendation 5: Establishing networks of knowledge
  There is a lack of awareness of information and available tools among some groups of
  staff and there is no clear place they can go for assistance and advice. This is one of the
  most frequently raised issues in every consultation with the research community so far.
  The university should deal with it urgently by initiating a programme of developing
  awareness and training support on research data management and storage. This should
  be delivered by College Consultant Groups who can act as consultants and brokers to
  other experts.

  6.6    Recommendation 6: Federated structure for local data storage
  It has been discussed and there is evidence from other external consultation reports that
  HEIs should consider federated structures for local data storage within their institution
  comprising data stores at the departmental level and additional storage and services at
  the institutional level. Such a federated system of local and central data stores should
  have as its top priority the provision of the services mentioned in Recommendations 1-4
  The detailed design of such a system would require a significant amount of consultation
  and work outside the scope of his report. The University should therefore consider
  setting up a specific design group involving peers across the university to deal with it.

ITC: 16.11.10                                                     Draft, July 2010       B

                                                                          Appendix C
  Appendix 1
  Members of the Research Data Storage Working Group (RDSWG)
  Members of the RDSWG Group are:

  •   Peter Clarke (Chair)
  •   Chris Adie (IS, CSE ALD)
  •   Abdul Majothi (IS, CHSS ALD)
  •   Marshall Dozier (IS, CMVM ALD)
  •   Tony Weir (IS)
  •   Prof Rodriguez (CHSS)
  •   David Reimer (CHSS)
  •   Colin Higgs (CSE)
  •   Paul Palmer (CSE)
  •   David Perry (CMVM)
  •   Mayank Dutia (CMVM)
  •   Cuna Ekmekcioglu (IS)

  The group gratefully acknowledges the assistance of Cuna Ekmekcioglu in project
  managing the preparation of the report.

ITC: 16.11.10                                                          Draft, July 2010         B

                                                                                Appendix C
  Appendix 2


  Overview from CHSS

  Research data across the College of Humanities and Social Science has a strikingly
  diverse character from subject to subject. Consequently, storage of digital data sets
  presents a complex challenge. Simply in terms of scale, at one end of the spectrum the
  ‘lone scholar’ might work with a limited body of texts. At the other end, are vast data sets
  of demographic details, or digitized sound or image files, or lexical databases of the
  entire vocabulary of a particular language.
  Requirements for access to this data likewise are varied. Some data might require
  frequent consultation, so that the boundary lines between ‘storage’ and ‘management’
  become blurred. Other data sets might need curation, but would not receive much active
  attention. And of these, some data will be sensitive, perhaps priceless and irreplaceable,
  while other sets might be more expendable. Further, in common with other parts of the
  university, different digital formats for data may require different storage solutions.

  To a large extent, the needs of many researchers for data storage in CHSS is met by
  the existing SAN facilities. Given the diversity of need and practice, it remains unclear
  that even this fundamental provision is being used most efficiently. Even so, with some
  sectors are already pushing the bounds of their allocation, there is clearly a need for
  further data storage beyond this ‘front-line’ facility.
  At this point, the complexities implied by the sketch above reassert themselves.
  Associated with the range of scale and value of data across the College come attendant
  issues of cost, funding, and security. ‘Cost’ brings with it the problem of raw price:
  ‘cheap’ may be desirable, but it might not be ‘best’. And in any event, who pays for
  storage, when funding models in Humanities research in particular do not cater so
  readily for storage needs as might projects in, say, Science and Engineering.

  If data is being used from external sources, this sometimes brings with it requirements
  for guarantees of ‘security’ that go beyond simply the robustness of the equipment to
  legal issues as well. Examples include the NHS and working with the police.
  Overview from CMVM

  Data Storage needs to be appropriately provided in terms of cost, capacities, access
  controls and speeds. However there are many secondary issues of large data stores
  that quickly come to light after their provision to users. Data needs to be carefully
  curated and documented; with a potential need for users to adhere to some basic rules
  of records management that quite often appear to be absent. New users are great at
  making their arrival and expressing their requirements but at the end of their 'tour of
  duty', their departure can often be silent. This poses the problem of data retention,
  archiving and identifying who may be the longer term data owner. Procedures of
  archiving and documenting the data (data dictionaries, meta-data, etc) are considered to
  be less refined that the provision and protection of operational data stores. If nobody
  claims ownership within a set time and either request the data's preservation or
  archiving then can we simply delete the data?

ITC: 16.11.10                                                          Draft, July 2010         B

                                                                                Appendix C
  As locally provided data stores are growing in size, data backup procedures are moving
  from the more traditional Grandfather - Father - Son tape rotation procedures that can
  offer longer term data recovery services to that of a more simplified 4 - 8 week tape
  rotation periods with shorter data recovery periods. Essentially, tape backup is being
  used for disaster recovery within a short data history time window and the need for data
  archiving services are becoming more essential.

  Some of the these data backup and archive issues may possibly be addressed by
  utilising tiered data storage services. It is already realised that even in the relatively
  short histories of the College Servers that approximately 70% of the data has not been
  access in over 24 months.

  Research governance & legislation dictates that we must be able to demonstrate the
  providence (audit/history/change) of data is fully documented and/or preserved. This can
  include ensuring that raw data is kept separate from processed data; alternatively,
  maintain a history of how database records and fields have changed over time and a log
  of "who did what and when" is maintained. Good Laboratory Practice, Good Clinical
  Practice and ISO27001 are all examples of the type of legislative compliance that we
  daily need to adhere to.

  The University has a great resource in many of its older datasets and longitudinal
  studies. It is vital that we ensure robust mechanisms for their preservation and integrity.

  Much of the data that is collated within CMVM has patient / human identifiable fields
  within it. Due to the nature of the research, with Ethics Committee and Data Protection /
  Information Commissioner approval, this is permitable but where data crosses
  institutional boundaries we may need to be careful that we automatically anonymise the

  Collaboration between Universities and their research groups is an ever growing norm.
  This occurs for many reason. The sharing of knowledge and experience is one but also
  much of the research conducted within CMVM relates to recruiting large cohorts of
  patients or volunteers into clinical research trials, observational studies and medical
  audits. These projects are dependent on such collaborations to ensure that recruitment
  into these studies achieves suitable power for statistical analysis. Infrastructure that is
  developed to facilitate such collaborations needs to have suitable levels of user
  authentication, data encryption, data transfer, data storage robustness (fault tolerance,
  backup and archive), data processing and manipulation, data providence / change audit
  histories built into them. Each of these items are essentially building blocks for any IT
  solution developed. Research groups will use centrally provided 'building blocks' if they
  are effectively supported, fully documented, appropriately priced and meet their
  technical / legislative / political requirements.

  The price of the original enterprise class University of Edinburgh SAN was prohibitively
  expensive for many research groups even when planning and requesting funds via grant
  applications. The newer SAN/NAS services being provided by ECDF with SATA based
  storage devices, costing approximately 250 GBP per year with varying backup services
  only potentially doubling the annual cost is certainly helping meet expectations. There is
  probably a factor that we can multiply the cost of a canonical "1TB drive from PC-world"
  by and still meet expectations whilst enhancing the robustness of the storage. It's
  probably somewhere between x2 - x5 the price...

  Research Data must never have its primary store on portable storage media (usb pen

ITC: 16.11.10                                                          Draft, July 2010         B

                                                                                Appendix C
  drives, usb hard drives, laptops, etc). There has been some discussion about the desire
  for software agents to be provided that can allow users to check-out or check-in data
  from a large central store. Obviously data protection issues would need to be
  incorporated; such as storage device (total) encryption or anonymisation.

  Furthermore, it is important to consider how data should "follow people around" as they
  work between or across institutes and geographic locations; secure platform-agnostic
  remote access to data.

  Generally, Researchers create IT infrastructure to support a research project. This will
  normally include both a software application and storage for the data that it generates;
  sometime this may also include lab equipment / specialist hardware devices. This type
  of infrastructure could benefit from being able to synchronise their data to central data
  stores or to have the application hosted as a virtual machine on a VM host located near
  to central storage (share or block level storage). When archiving data from such
  systems, it may also be important to consider the 'viewer' or application required to be
  able to view the data again. Maybe virtualisation could help here as the virtual machine
  could be archived alongside the data.

  There is also a concern about the current guidelines on data encryption from Information
  Services. If all Research Data essentially belongs to the University yet Research Data is
  being primarily stored on encrypted devices then the University should have an
  obligation to provide an encryption key management service. This would be inline with
  the guidelines within ISO27001. How else can we decrpyt data for which we do not own
  the encryption key for?

  Data storage services provided to research groups would need to have clear service
  level agreements / service level descriptors available for them. There must be clear
  guidelines provided with a level of trust brokering whereby the service can be managed
  locally by the research group yet the 'metal / storage' is in fact somewhere else.

  There could be a centre of research excellence that would provide consultation to
  research groups; enabling research groups to do something better within the realm of
  what they already do.

  There is a large void in the provision of robust data archiving procedures used within
  research groups. As before, within CMVM much of the research data needs to be
  retained and managed for up to 15 years after the research project concludes. This
  would normally mean that a lot of local knowledge may be lost due to staff redeployment
  / relocation and therefore robust data documentation / meta-data needs to be stored
  along with these datasets.

  It is important that data stores are definitely cross platform compatible. The current OS
  ratios are currently 70:20:10 for pc:mac:unix. Ensuring integration with existing
  authentication mechanisms needs to be perpetuated (Active Directory or EASE). The
  way this storage is presented needs to be carefully planned and managed. Endless
  SAMBA shares out in the Ether will not work well, they may need to be collated using a
  distributed file system DFS service. Local IT professionals will want to be able to locally
  control and manage the access control lists for the data stores presented to their
  research communities in a quick and simple way - just like Access Control Lists (ACL)
  security modifications to NTFS data stores. Furthermore, it may be required that block
  level storage services are provided to existing infrastructure to expand their 'apparently'
  local file storage systems.

ITC: 16.11.10                                                           Draft, July 2010          B

                                                                                 Appendix C
  Overview from CSE

  CSE is quite diverse in terms of its storage requirements, the technical involvement of
  research staff, and policies with respect to in house -vs- central provision. Storage
  requirements can be very large (10s to 100s of TBytes for large simulation outputs) and
  much storage is required near-line for compute purposes. In many cases data needs to
  be available to large national or international collaborations as easily as it is to internal
  groups. The significant technical experience of members of CSE leads to a tendency to
  self provide for reasons of cost, agility, and resilience.

  The following were expressed during initial consultation with CSE and by subsequent
  submissions to the CSE representatives.
 •        Cross platform: All respondents indicated strongly that cross platform access to
        data was a must. There is of course a bias in the way that returns were solicited in
        the sense that this feature was explicitly mentioned.
 •        Cost: Everyone mentioned cost as a factor and, of course everyone wants all
        storage to be as cheap as possible, whatever that means. It is clear though that
        the cost of SAN storage is too much for many purposes.
 •        IS hosted/locally hosted split: There is variation on whether IS storage is used
        at all, some using a small amount storage on the scieng server, some using some
        storage on eddie, but the amounts of IS hosted storage used were generally not
        large, with the majority of storage being provided internally by the school (physics
        was the exception in the 5 returns, with 13.5TB provided by IS and 8TB internally).
        The main reasons for not using the storage provided were cost (for anything
        involving the SAN) and lack of cross platform access (scieng server).
 •        Backup: It would seem that everyone expects some level of backup to be
        applied to even the most basic service, so this must be factored into even the
        lowest cost solution.
 •       Archive: Many respondents indicated that archiving was a problem and that they
        would like to see something provided for data archiving purposes
 •        Collaboration: researchers very often work in greatly distributed collaborative
        teams. Data access mechanisms (like network files sytems) should at least be
        technically capable of allowing external contributors and flexible access control.
 •       Remote access: must be accessible from around the world. Implies a security
        model good enough that there is no need to block access at the edge of edlan.
 •        Attractiveness relative to local disk: many schools are fighting end users'
        impulse to store important things on local hard disk. There should be some form of
        "official" form of data storage that is more "attractive" to the end user than this so
        that they don't feel quite so tempted to do so. That probably means a cheap
        storage service of some kind.
 •       Basic + plugins: could there be cheap basic services with a number of pay for
        options which provide other features?
 •       Federated system: could there be a storage "service" which isn't a single,
        centrally run service at all - it is a federation of services, some run by IS, some run
        by schools, all providing something compatible with each other and seamlessly
        connected. AFS might be made to work this way, for example.

ITC: 16.11.10                                                         Draft, July 2010          B

                                                                               Appendix C

  Overview of existing IS research storage provision

  Information Services currently provides general file store via the College file servers and
  ECDF network attached storage (NAS) gateways.

  Each College has a dedicated file server, implemented as a two node Windows cluster
  with storage held on the University's central Storage Area Network. This provides a
  highly resilient infrastructure. The file servers host a mixture of home directory space
  and shared file space for specific groups. Access to this storage is available as a
  Windows share (CIFS). Charges are on a capital and recurrent model.

  ECDF provide central research I.T. facilities to the University. The ECDF compute
  cluster (eddie) provides access to the cluster file system via standard file sharing
  protocols. The primary purpose of this access is to upload data to be accessed by
  compute jobs running on the cluster and to download results. Research groups do use
  this as a generic file store. Data is usually organised in shared group areas and personal
  areas are less common. Storage is provided to the cluster file system from multiple
  resilient storage servers connected to shared research SAN storage. Access is via
  samba, NFSv3, SSH, and FTP. Data can be selectively published via HTTP(S) and
  EASE protected if necessary. This service is charged on a rental model.
  There is current activity on reshaping this service to better meet the requirements of
  research projects.

ITC: 16.11.10                                                               Draft, July 2010           B

                                                                                    Appendix C
  Appendix 3
  Data on storage from 2007 research computing consultation
  Research computing online survey which was conducted in 2007 returned 503
  responses over a period of 2 months. Of the total, 292 responses were from CSE, 136
  from CHSS, and 74 were from CMVM . 72% of the total responses were from
  academic/research staff and the rest (29%) from PhD students19.
  Seventeen percent of research staff and students participated in the survey state that
  they work with data ≤100 MB whereas the majority (43%) has to store data ≤100 GB
  (Figure 1). Their data is mainly backed-up on the memory sticks, CD/DVD, and local
  hard drive. Some data is backed-up on school or college servers.

  Twenty six percent of the data collected by research staff and students would be
  impossible to re-create in the event of a loss. According to the 44% of the responses, it
  would take months to years to recreate the lost data (Figure 2).
  Sixty percent of the consulted staff and students want to retain data whereas 30% state
  that they are obliged to retain it (Figure 3). A 16% would like to retain data for over 20
  years(Figure 4).

       Staff responded

                         100    84
                          50                                           27
                               <100MB   <1GB   <100GB    <1TB        <10TB       >100TB
                                                 Data v olume

  Figure 1: Amount of data stored by staff

   Cuna Ekmekcioglu. Research Computing Consultation Report. December 2007. Available from Edin-
  burgh DataShare repository.

ITC: 16.11.10                                                                Draft, July 2010       B

                                                                                     Appendix C

                      160                               152
                      140                                                           133

    staff responded

                      100                   82
                       80      65                                     71
                            Couple of     Weeks        Months       Years       Impossible
                              days                                             to re-create
                                         work needed to recreate the data

  Figure 2: Work needed to recreate the data in the event of a loss


                      300                               286
    Staff responded



                      100           66

                            No data to retain      Wish to retain       Obliged to retain
                                           To retain or not to reatin data

  Figure 2: Who retains data?

ITC: 16.11.10                                                      Draft, July 2010       B

                                                                            Appendix C

                      180     166
    Staff responded

                       80                                              67
                       60                                45
                            < 5 years   < 10 years    < 20 years    > 20 years
                                         Period to retain data

  Figure 4: How long data need to be retained?

ITC: 16.11.10                                                                                  B
                                                                                  Appendix D

Research Data Storage – Implementation Plan

1 Summary
This paper :
-proposes a storage implementation for the emerging requirements of the Research Data Storage
Working Group (RDSWG) and for the services implied by the Research Data Management Group
(RDMG) by extending the storage services currently delivered by ECDF
-proposes that the implementation of general research storage and the storage used for core
service delivery should be provided via a common storage implementation
-provides indicative costs for an initial research storage implementation and for refreshing the
core service storage environments
-points to the next steps for implementation of both

2 Strategy
To date, separate storage environments have been implemented for research data and for core
service delivery. These environments have been geographically separated and procured via
different processes. However, the infrastructure has been under common management and has
largely implemented using similar technologies.
Currently volume research data is hosted at the Advanced Compute Facility at the Easter Bush
Estate. This data-centre requires a yearly power-down which results in all services hosted at
this site being unavailable for up to one day. It is expected that such service disruption would
be unacceptable to a large scale research file store.
The storage environments which support the University's core services, such as Corporate
services, WebCT and MyED, were procured in 2005 and will become unsupported over the next
two to three years. Approximately one third of the storage used by the core service storage
platforms is currently used to deliver the college file servers. This service overlaps with the
research file store and a review of these services could determine the possibility of common
delivery or confirm requirements for a separate implementation.
Implementing storage to meet the requirements of the Research Data and Management Groups
at a time when a refresh is required to the core service storage environments provides the
opportunity to      build a common storage environment.
The storage required for High Performance Computing (HPC) would remain closely coupled to
the compute clusters, though a simple mechanism must be maintained for transfer of data from
the general research filestore to HPC storage to allow researchers to run computation against
their data sets.
ITC: 16.11.10                                                                                     B
                                                                                     Appendix D

3 Current Storage Services
Information Services currently provides general file store via College file servers and large scale
research filestore via the Edinburgh Compute and Data Facilities (ECDF) storage service.
ECDF have developed a file store (ECDF NAS) using a scalable cluster filesystem with
automated tiering policies. It is now becoming common for the filesystem technologies deployed
in High Performance Computing to be used as the building blocks for large scale filestores,
The service can be implemented by different storage tiers using a policy engine within the
filesystem to automatically migrate files to the appropriate tier of storage. This allows recently
accessed data to be held on faster and more resilient storage, whilst data which has not been
accessed for a period of time (eg > 6 months) to be migrated to slower and less resilient storage.
This ensures that the active data in the filestore is highly available and can be managed without
interruption to service and has no single point of failure in its data path. Data which is not being
accessed will automatically migrated to less expensive storage tiers which may need to be taken
offline for short periods for maintenance.

Recent investigation has indicated that commercial cloud storage lacks the maturity to deliver
the requirements of a large research file store and that there is no compelling cost driver..
However, the ability of cloud storage to deliver benefit to the University should be reviewed ,
e.g. for possible inclusion as a lower tier of storage within a unified platform if costs were to
A review of scalable network attached storage (NAS) solutions over the past year could find no
packaged vendor solution which provided the flexibility and price to satisfy the research
community's requirements.

4 Proposed Implementation
The RDSWG identified the following recommendations:
   1. Archiving of research data
   2. Accessibility of research data to all virtual collaborators, facilitating extra- institutional
   3. Globally-accessible, cross-platform file store
   4. Backup and synchronisation of data on mobile devices
   5. Establishing networks of knowledge
   6. Federated structure for local data storage
To address the infrastructure required for an archive service and to provide a cross-platform file
stores the ECDF file store could be expanded to provide the capacities required for research
data sets. This service should be relocated from the ACF and additional tiers of storage added
ITC: 16.11.10                                                                                     B
                                                                                     Appendix D

to provide faster write performance and a low cost bulk storage tier. The provision of tape as
the lowest storage tier should be considered.
This infrastructure has been designed to be:
   •   scalable - able to start modest in scale but able to grow to petabytes
   •   low cost – both in terms of capital to grow and staff resource to manage
   •   flexible – in access methods and integration with other services
This would provide a tiered storage environment able to securely hold archive and infrequently
accessed data at low cost. It already provides cross platform support via CIFS, NFSv3, SSHFS
and a data publication via HTTP(S) via EASE authentication if required. Further access
methods can be deployed as required.
This infrastructure is similar to the technology similar to IBM's SONAS product but deployed in
an open and cost effective way. Work is underway to build a common approach to such delivery
with Universities of Bristol, Southampton and UCL and there is ongoing discussion with a
storage intergrator to provide commercial support for such a service.
At this time extra-institutional collaboration is supported via creation of EASE friend or visitors
accounts – and current research collaborations are catered for in this way (e.g. SAGES alliance).
If a more open method was required then AFS should be considered, as should the output from
Project Moonshot 1.

5 Costs
The capital cost of reprovisioning the existing ECDF NAS service to provide a general research
file store and the intitial infrastructure for an archive service would be approximately £325,000
for a ½ petabyte filestore. Additional costs will be required if a fully automated dual site resilient
solution is required for a portion of this data. To meet a deployment of this scale further
investment will be required in backup technologies.
A service of this scale would require appropriate staffing. It is currently estimated that two
FTEs would be required to develop and administer such a service
Due to the performance and resilience requirements of the storage required to deliver the core
service applications it is currently estimated that the refresh of this environment would approach
£500,000. There is an ongoing review of the infrastructure required for delivering core database
applications which account for the main use of this environment. This is expected to complete
early in 2011 after which the costs of this refresh will be more firm.

6 Next Steps
To progress provision of such this storage implementation, the following should be progressed:

ITC: 16.11.10                                                                                  B
                                                                                  Appendix D

   − incorporate the results of consultation on Research Data Storage groups output
   − conclude technical discussion with school representatives on data access mechanisms to a
     research filestore
   − review requirements for home directory storage and possibility for incorporating within a
     research file store
   − confirm specification for the archive service
   − join with digital library on specification of storage requirements for research archive
   − conclude requirements for core/corporate service data storage and progress
   − finalise implementation and costing for companion backup infrastructure for above
   − test market with suitable procurements

Tony Weir
3rd November 2010

Shared By: