Towards the Australian Data Commons A proposal for an by knowledgegod

VIEWS: 8 PAGES: 41

									Towards the Australian
Data Commons


A proposal for an
Australian National Data Service




The ANDS Technical Working Group
October 2007




   16-18 Mort Street, Canberra ACT 2601 | GPO Box 9880, Canberra ACT 2601 | Tel: (02) 6240 8111 | www.dest.gov.au |
                                                ABN 51 452 193 160
AeRIC                                       ANDS – Towards an Australian Data Commons                                                                    30/11/07




Contents

OVERVIEW...............................................................................................................................................................3
    THIS DOCUMENT ..................................................................................................................................................3
PART ONE – INTRODUCTION ..................................................................................................................................4
    BACKGROUND ......................................................................................................................................................4
    WHY DATA, WHY NOW? .......................................................................................................................................4
    SYSTEMIC APPROACH TO RESEARCH DATA .........................................................................................................5
    THE AUSTRALIAN NATIONAL DATA SERVICE ......................................................................................................5
PART TWO – RATIONALE FOR ANDS ...................................................................................................................10
    RESEARCH ACTIVITIES AND DATA .....................................................................................................................10
    A RESEARCH DATA COMMONS ..........................................................................................................................10
    RATIONALE FOR ANDS PROGRAMS...................................................................................................................18
    RATIONALE FOR ANDS SERVICES .....................................................................................................................21
    ROLE OF METADATA ..........................................................................................................................................25
PART THREE – IMPLEMENTATION .......................................................................................................................26
    THE OVERALL ANDS STRUCTURE ....................................................................................................................26
    FRAMEWORKS PROGRAM ...................................................................................................................................27
    UTILITIES PROGRAM ..........................................................................................................................................29
    REPOSITORIES PROGRAM ...................................................................................................................................30
    RESEARCHER PRACTICE PROGRAM ....................................................................................................................32
ATTACHMENTS ......................................................................................................................................................34
    ATTACHMENT A: PROPOSED ANDS UTILITIES ..................................................................................................34
    ATTACHMENT B: INDICATIVE BUDGET ..............................................................................................................39
    ATTACHMENT C: THE ANDS TECHNICAL WORKING GROUP ............................................................................40




                                                                                                                                                   Page 2 of 41
AeRIC                       ANDS – Towards an Australian Data Commons                            30/11/07




OVERVIEW
The expression heard most often during the consultation process of developing the investment plan for
the National Collaborative Research Infrastructure Strategy Platforms for Collaboration capability
was, simply, “it’s all about data”.
A variety of proposals and developments that deal with the research data problem are in progress in
every relevant jurisdiction worldwide, and a large number of reports and discussion papers on this
issue have informed this paper.
The development of ANDS is intended to provide the essential meeting place where the Australian
path forward for research data management can evolve and where a vision can be achieved. This
vision will articulate over time policies and guidelines that are readily understood and interpreted
while simultaneously creating exemplars of best practice covering:
    •   research data ownership and the roles and responsibilities associated with ownership;
    •   access to research data collected and maintained with public funding; and
    •   best practice for the curation of experimental, research and published data.
ANDS will have succeeded if, in 10 years, the answers to these questions are agreed and well known
and put into practice in everyday research.
The technical quality of the outcome will be determined by the ease with which research data and
research outputs from all sources can be discovered and reused across disciplines and over time
through an integration of repositories and data centres supporting national and specialist discovery
services.

This Document
This paper is designed to encourage, inform and ultimately summarise the discussions around the
appropriate strategic and technical descriptions of the Australian National Data Service; to fill in the
outline in the Platforms for Collaboration investment plan.
To do so, the paper:
    •   introduces the Australian National Data Service (ANDS) and the driving forces behind its
        creation;
    •   provides a rationale for the services that ANDS will provide, and the programs through which
        the services will be offered; and
    •   describes in detail the ANDS programs.
Part One (Background) provides a brief summary of the reasons to focus on data management, as well
as an overview of ANDS, and identifies some issues associated with implementation.
Part Two (Rationale) sets out the systemic issues associated with achieving a research data commons,
and provides the resultant rationale for the services that ANDS will offer the programs that they will
be delivered through.
Part Three (Detailed Descriptions of ANDS Programs) sets out in detail the Aim, Focus, Service
Beneficiaries, Products and Community Engagement activities for each of the ANDS Programs.
Attachment A provides additional detail of the Proposed ANDS Utilities Program.
Attachment B provides a high-level indicative budget for ANDS.
Attachment C lists those who have contributed to the creation of this document.




                                                                                             Page 3 of 41
AeRIC                         ANDS – Towards an Australian Data Commons                                  30/11/07



PART ONE – INTRODUCTION
Background
The Platforms for Collaboration investment plan proposed the establishment of an Australian National
Data Service (ANDS) as a cooperative centre with expertise in research data management.
In doing so, it intended to address the needs raised by the PMSEIC Data for Science Working Group1,
which discussed at length the idea of a new National Centre for Data for Science. There was
considerable support within the Group for a Centre, and it was felt that such an initiative would be of
benefit and may be a useful mechanism for progressing many of the recommendations the group made
to PMSEIC.
The Platforms for Collaboration ANDS proposal included three mutually reinforcing groups of
services focussing on shared services, stewardship and outreach. The proposal suggested that:
         While it would be possible to focus on any one of these and still provide value, the
         development of a national centre of expertise will be significantly enhanced by embodying
         and bringing together knowledge from all three.
Other activities thought to be most important to ANDS include the identification, installation and
adoption of user-centric tools and the identification, communication and deployment of policies and
technologies that allow research teams, research communities and the general public to gain seamless
access to data collected within multiple institutionally operated repositories.
Thus, the fundamental intention is that ANDS should provide common services in support of research
data collections and provide integration infrastructure that facilitates sharing and reuse of data, so that
researchers can more easily discover, access, use, analyse, and combine digital resources as part of
their activities.
ANDS should also support and advise researchers and research data managers in appropriate digital
preservation strategies.

Why data, why now?
We are in a data deluge. It can only continue and grow in intensity as the number, frequency and
resolution of data sources rises; as information becomes universally ‘born digital’; as the capacity to
process, transform and transfer information expands; and as the dependence on data increases.
         Consequently, increasing effort and therefore funding will necessarily be diverted to data and
         data management over time.
As with other movements towards the digital sphere, technology development is leading. Relative
reductions in the cost of some hard infrastructure mean that the labour component of data management
(including data gathering, curation, analysis and reuse) is the dominant unresolved problem. For
instance, the expertise needed to better design, implement and sustain data management systems is
increasingly hard to source. This is as much the case for local storage and curation solutions as for
broad scale national and international data federations. Such difficulties are exacerbated because the
rectification of issues in data management is enormously labour intensive. So, just as more data
becomes digital, the difficulty in meeting the rising cost of managing that data may lead to the loss of
more data.




1
  FROM DATA TO WISDOM: Pathways to Successful Data Management for Australian Science Working Group on Data for
Science, Report to PMSEIC, December 2006. See
http://www.dest.gov.au/sectors/science_innovation/publications_resources/profiles/Presentation_Data_for_Science.htm


                                                                                                     Page 4 of 41
AeRIC                       ANDS – Towards an Australian Data Commons                            30/11/07

        Consequently, standardisation within forms of data and simplification in the frameworks
        around retention, storage, access and use of data, and the elimination of differences whose
        resolution requires labour, must be made, if the on-going keeping and reuse of data is to
        remain affordable.
As more data is born digital and captured and managed electronically, the potential to search, mine
and reuse that data also grows. If it is made accessible, data can be reused and re-applied to answer
questions and support understandings unrelated to the reasons for which it was initially captured,
created or held. In this way, ever-larger questions, such as those related to climate, ecology and health,
can be asked more confidently and more frequently. Essentially, the value of data can grow as it forms
collections and aggregations and becomes more available. Hence, the community interest that has
always existed in the aggregation and reuse of data is growing just as the community investment in
data management (both capital and operational) is also being asked to grow.
        Consequently greater clarity is needed over control and access to community-funded data,
        and the means of aggregating, federating and accessing such data are increasingly important.

Systemic Approach to Research Data
Moving beyond practices developed when data was far less accessible than it now can be, a more
systemic view of the framework for data management could be postulated.
The fundamental agreements that could be built into a national research data management framework
appear to be that:
    •   data is an increasingly important and expensive ingredient of research activities so that
        increasing attention to its efficient and effective management is justified and necessary;
    •   all research sponsored by the public is by definition sufficiently valuable that the creation of
        data management plans for all forms and instances of research data is justified and necessary;
    •   the collective sponsors and funders (both institutions and funding agencies) of data capture
        and ongoing data management have the right to inform data management plans and policies;
        and
    •   the collective sponsors and funders of data capture and its ongoing data management have the
        enduring right to participate in determining accessibility of the data while it is retained.
While more discussion is needed, the making of a statement of principle should be possible. Pending
this, we could agree in the interim that ANDS be guided by the Accessibility Policy framework being
established by DEST under Backing Australia’s Ability.
Hence it might be possible to arrive at a systemic approach – agreed as a collective goal – regarding
retention, management and access to research data.

The Australian National Data Service
As a vision, ANDS sets out to transform the disparate collections of research data around Australia
into a cohesive corpus of research resources. This transformation would assist the connection of
Australian and international data centres, repositories and online collections to enable serendipitous
discovery, cross-disciplinary research, and cross-repository workflows.
Such a vision can only be achieved if ANDS engages the research community and stakeholders
through consultation and outreach activities; operates as an integrative activity, rather than
establishing new or separate infrastructure facilities; and extends and builds-on existing and proposed
capabilities.
Any such vision would need to be framed as a goal for a decadal (or longer) transition. The
developments related to improved research data management are system-wide, and while the solutions
rely on technology deployments, the more challenging issues relate to complex social and behavioural
change.


                                                                                             Page 5 of 41
AeRIC                      ANDS – Towards an Australian Data Commons                             30/11/07

ANDS will operate in a context where the bulk of the responsibility for data rests with other enduring
institutions and organizations and so the bulk of the activity and funding will be deployed through
their separate decision-making. Therefore the role of ANDS must be to:
    •   provide frameworks and assistance that enables institutions and research communities
        collectively to adopt a systemic approach to research data;
    •   develop and sustain those services required for its collective perpetual implementation; and
    •   ensure that existing strengths in expertise and practice can and do migrate into this future.
Because the challenges related to federating data can best be met through coherent system wide action,
ANDS could especially contribute through activities which:
    •   sustain the community of interest needed to build a federated solution to research data
        management and to achieve the necessary community agreements;
    •   develop frameworks for bridging from a research data federation to other federated data
        management communities;
    •   identify and provide the national operational services on which data federation depends; and
    •   assist existing data federating research communities to successfully migrate into this more
        systemic federated framework.

Objectives
In support of the roles identified above, the following objectives are proposed for ANDS. These are
founded in the belief that ANDS can contribute most effectively by developing services and activities
that enable stewardship within multiple federations of data management and data user communities.
In ten years time, ANDS will be successful if:
    A. A national data management environment exists in which Australia’s research data reside in a
       cohesive network of research repositories within an Australian ‘data commons’.
    B. Australian researchers and research data managers are ‘best of breed’ in creating, managing,
       and sharing research data under well formed and maintained data management policies.
    C. Significantly more Australian research data is routinely deposited into stable, accessible and
       sustainable data management and preservation environments.
    D. Significantly more people have relevant expertise in data management across research
       communities and research managing institutions.
    E. Researchers can find and access any relevant data in the Australian ‘data commons’.
    F. Australian researchers are able to discover, exchange, reuse and combine data from other
       researchers and other domains within their own research in new ways.
    G. Australia is able to share data easily and seamlessly to support international and nationally
       distributed multidisciplinary research teams.

Scope of Activities

Framework Program
The Framework Program will aim to establish agreements around policies and responsibilities that
allow a cohesive network of research repositories to be established and to grow into an Australian
‘data commons’ (objectives A, C and E).
Such an activity is valuable in its own right, but ANDS needs the policy framework to be established
so that coherent federating services and data management implementations can be put in place. While
ANDS will need to drive this activity, the framework must be developed in concert with broader data
issues, agendas and interests.



                                                                                             Page 6 of 41
AeRIC                       ANDS – Towards an Australian Data Commons                            30/11/07

The Framework program will be best provided through a small highly skilled team. Co-location is
likely to be necessary to ensure the high level of internal communication the team will need. This is
feasible as external communication is likely to be more formal and managed. The program will need
to be able to bring in relevant expertise on an as-needs basis.

Utilities Program
ANDS will deliver a set of utility services that facilitate discovery of, and access to, research data held
in Australian repositories (objective E). These services will include facilities for the registration and
description of repositories, and the services they support for submission, access and use of the research
data they hold. They will also include facilities for the aggregation of resource descriptions from these
repositories into a national metadata store.
At the heart of the program will be an ANDS discovery service that enables users to find data held in
Australian repositories by searching this metadata store and other key aggregations. The national
metadata store will also be accessible by specialised discovery services, federations and portals
through standard protocols. A number of models will be supported where services may search the
national aggregation directly, harvest metadata for use in community, discipline or project-specific
environments, or provide a meta-search facility across participating repositories.
The program will also include services enabling repository managers to assign persistent identifiers to
the research data they hold and to manage the persistence of these identifiers when the location of the
data is changed.
In order to make data reuse more feasible (objective F) the program may need to include facilities for
the registration of metadata schemas. More advanced services to deploy these schemas to map and
search across data will need to be developed within the Researcher Practice Program.
Similarly, services to support repository submission or dissemination workflows and the actual
transfer of objects from one repository to another will need to be developed within the Repositories
Program.
The utilities program will require the development and establishment, or procurement, of highly
reliable operational technical services to meet ANDS specifications.

Repositories Program
It is an ANDS objective that Australian research data be routinely deposited into stable and sustainable
data management and preservation environments. To meet this objective, ANDS needs to improve
and/or supplement the available data management options at institutions (objectives A and B), noting
that ANDS is focussed on data federation and is not established to fund data management itself.
This program will need to be delivered by a group of data management professionals who provide
coherent and consistent advice, as well as reference software implementations, to multiple institutions.
Success will depend on the degree of co-operation achieved across those institutions, so regional
distribution of the team is likely to be required.
As the team should operate in accordance with established principles and policies, the requirement for
internal communication (once trained) is not high.

Researcher Practice Program
ANDS will help Australian researchers and research data managers to develop the necessary skills to
create, manage, and share high-quality research data. The aim of this program will be to improve data
management behaviour and support research communities in managing their data using available
federating services (objectives B, E and F).
ANDS may elect to work with NCRIS capabilities and other selected communities in the first few
years to develop the trans-community set of services and understand how they relate to developing and
established institutional repositories.


                                                                                              Page 7 of 41
AeRIC                      ANDS – Towards an Australian Data Commons                            30/11/07

This program is likely to be best achieved through co-investment with research communities on a
project basis, as well as close engagement with universities and research organisations.

Broad Engagement
ANDS will not itself host research data. This is the role of institutional or discipline-based service
providers and the funding available through ANDS cannot appreciably affect the data retention rate.
Nevertheless ANDS intends to “improve data retention,” “enhance data discovery and access” and
“influence the data management practices” in every institution and research community.
ANDS will play a brokering role to assist researchers and institutions identify existing data storage
solutions and to identify and highlight gaps in data retention infrastructure.
ANDS will therefore require a formal engagement structure with the wider community of policy
makers as well as the creators, hosts and curators of research data, and should establish forums for that
purpose as follows.
The Data Australia Policy Forum should include policy makers from DEST, ARC, NHMRC, the
universities, CSIRO, government agencies, collecting institutions and others with an interest in data
and data management. The Policy Forum should be responsible for establishing a wider coordinated
policy and funding framework to support better data management practices. It should establish a
broadly accepted outline of the administrative responsibilities for research data.
The Data Australia Service Providers Forum should include data service providers from institutions
and research communities around Australia. These might be IT or Library Directors and archive or
repository managers. This forum should be responsible for informing and implementing data
stewardship and data federation initiatives. This would include helping ANDS to define and
implement a framework for service providers
The Data Australia Support Network should include technology and service support staff. The aim
of this network is to create cohesion and collaboration at the level of technology & support activities
for data. It will be impossible for ANDS to provide all the support required by every data-intensive
research project around Australia from a central location, so this network will allow ANDS to
strengthen the connections between institutionally-based support staff, provide strategic direction to
the group, and have the grass roots support and feedback required for the successful uptake of ANDS
services. Some ANDS training initiatives should also be informed and implemented through this
network.

Implementation Strategy
One of the fundamental problems ANDS must address is that the communities with which it seeks to
interact have differing degrees of appreciation of, and solutions to, the challenges of data management
and reuse. The members of those communities also have varying levels of readiness to act.
Therefore, the simplifying notion that all participants could move together is unrealistic, so ANDS
will need to be able to provide a spectrum of responses within any service it establishes.
There are two exceptions to this approach: the development of an overarching framework, and the
development and delivery of the utility services needed in any federated solution. They are exceptions
not because all are ready to act, but because to meaningfully provide these functions, ANDS should
provide them for everyone.
Even if a framework were agreed (such as access rights and roles and the nature of the federating
support services) and if the agreed federating support services (such as registries and unique
identifiers) were provided, then some diversity would still remain because:
    •   the state of repository development differs between institutions;
    •   the quality of data management practice by researchers differs from group to group;
    •   the level of acceptance and use of international standards differs between research disciplines;



                                                                                            Page 8 of 41
AeRIC                      ANDS – Towards an Australian Data Commons                           30/11/07

    •   access tools are highly heterogeneous, often poorly defined and constantly undergoing
        development;
    •   the necessary funding and expertise are not always available.
Given its limited resources, ANDS will need to schedule its interaction with potential participants and
communities to make progress where it can. The readiness of participants will be a vital consideration
in ANDS resource allocation.
Therefore the strategy will need to include elements as follows.
    A. The early development of an outline framework that allows progress to be made while the
       framework is further refined, and the refinement of the framework over time with broader
       contextual interests.
    B. The early design and implementation followed by ongoing delivery of utility services such as
       a registry of Australian repositories, a national collections and data discovery service and a
       national persistent identifier service, as well as the further development of those services and
       the addition of new services when required.
    C. An assessment of readiness and activities to transition early adopters into the framework and
       onto the ANDS access support services; for repositories and/ or collections.
    D. Assistance for institutions to implement repositories that fit within the framework and which
       can support the data federation model.
    E. Assistance for researchers, research projects and research communities to develop or adopt
       relevant data management plans and then obtain the support services that those plans require.
    F. Actions that assist research groups and communities manage and access research data
       distributed across multiple independently managed repositories.
This implies an operational arrangement that can:
    •   provide national leadership in framework development;
    •   design, implement and operate 24x7 services;
    •   support a high-tech and potentially intensive consulting and support activity anywhere in the
        nation; and
    •   deliver a coherent, sustained and high level of engagement across diverse groups including:
        researchers, data management professionals, organisational management and policy makers.
While simplification is desirable, the obvious simplification of dealing only with public accessible
data is not possible. The complexity in managing data for federated access and reuse begins when the
data is collected, where access may rightly be strongly restricted, and continues throughout its
lifecycle including its ultimate publication where some or all of the data is made public.
Therefore, ANDS services and activities must encompass restricted access to data across its entire life
cycle rather than merely considering data after it is released to open access.




                                                                                           Page 9 of 41
AeRIC                             ANDS – Towards an Australian Data Commons                       30/11/07



PART TWO – RATIONALE FOR ANDS
Research Activities and Data
Research activities use and process data in very many ways:
      •    Ideas with people
      •    Observations with sensors
      •    Experiments with instruments
      •    Models with computing
      •    Analysis and comparisons using many techniques.
Much of that data is rightly private to the activity, but some of the
data is sourced from outside the activity and some data is destined
for use beyond the activity, raising the question: from where does
input data come and to where could output data go?
This view has been made as simple as possible; so that the data generated by research activities is
taken to include all three of the kinds of data described in the National Science Board’s (NSB) report,
Long Lived Digital Data Collections2:
      •    “Observational data, such as direct observations of ocean temperature on a specific date, the
           attitude of voters before an election, or photographs of a supernova are historical records that
           cannot be recollected.”
      •    “Experimental data such as measurements of patterns of gene expression, chemical reaction
           rates, or engine performance present a more complex picture. Noting that while in principle,
           data from experiments that can be accurately reproduced need not be retained, it may not be
           possible to reproduce precisely all of the experimental conditions, particularly where some
           conditions and experimental variables may not be known and when the costs of reproducing
           the experiment are prohibitive.”
      •    “Computational data, such as the results from executing a computer model or simulation.
           Noting that if comprehensive information about the model (including a full description of the
           hardware, software, and input data) is available, preservation of the data may not be necessary
           because the data can be reproduced.”

A Research Data Commons
The question then is: from where does input data come and to where could output data go? A
potentially appealing answer from a researcher’s perspective is to suppose a ‘data commons’ into
which data output can be assigned and from which data input can be sourced.




2
    See http://www.nsf.gov/pubs/2005/nsb0540/start.jsp


                                                                                             Page 10 of 41
AeRIC                        ANDS – Towards an Australian Data Commons                                   30/11/07

Such a data commons has already been implemented within some research communities, mirroring as
it does the publication process for research results familiar to all researchers.
Reuse of data is however much more difficult than building new ideas on the published results of
others, as researchers often do not produce data that is re-usable by others; each discipline has its own
needs so that there are no common research wide standards; and research sponsors have differing ideas
on the circumstances under which data can be shared. For data to be re-usable it must at least be
discoverable, accessible, readable and interpretable, which in turn define a more detailed set of policy,
compliance and technological requirements
For instance, the NSB report distinguished the elements of such a ‘data commons’ into three
categories of data collections:
        Research data collections are the products of one or more focused research projects and typically
        contain data that are subject to limited processing or curation. They may or may not conform to
        community standards, such as standards for file formats, metadata structure, and content access
        policies. Quite often, applicable standards may be nonexistent or rudimentary because the data types are
        novel and the size of the user community small. Research collections may vary greatly in size but are
        intended to serve a specific group, often limited to immediate participants. There may be no intention to
        preserve the collection beyond the end of a project. One reason for this is funding. These collections are
        supported by relatively small budgets, often through research grants funding a specific project.
        Resource or community data collections serve a single science or engineering community. These digital
        collections often establish community-level standards either by selecting from among pre-existing
        standards or by bringing the community together to develop new standards where they are absent or
        inadequate. The budgets for resource or community data collections are intermediate in size and
        generally are provided through direct funding from agencies. Because of changes in agency priorities, it
        is often difficult to anticipate how long a resource or community data collection will be maintained.
        Reference data collections are intended to serve large segments of the scientific and education
        community. Characteristic features of this category of digital collections are a broad scope and a diverse
        set of user communities including scientists, students, and educators from a wide variety of disciplinary,
        institutional, and geographical settings. In these circumstances, conformance to robust, well-
        established, and comprehensive standards is essential, and the selection of standards by reference
        collections often has the effect of creating a universal standard. Budgets supporting reference
        collections are often large, reflecting the scope of the collection and breadth of impact. Typically, the
        budgets come from multiple sources and are in the form of direct, long-term support, and the
        expectation is that these collections will be maintained indefinitely.
These three categories can be viewed as particular points along a continuum. As data use moves from
research  community  reference, increasing attention must be paid to long-term curation activity,
and more of the context within which the data was collected/created needs to be retained. This need
arises as the use of the data is increasingly further away from the creation event and the creators
themselves, and so more needs to be made explicit.
In order to make it explicit now, it needed to be captured at the time of creation, and hence the
systemic difficulties in research data management where reuse is the goal.
It is also thought that in many projects there may be an intention to preserve, but a lack of resources,
as described above, a lack of trusted infrastructure, and uncertain community priorities represent
insurmountable hurdles to many researchers.

Role of institutionally supported collections
The NSB report goes on to say:
        Note that digital collections in each of these three categories can be housed in a single physical location
        or they may be virtual, housed in a set of physical locations and linked together electronically to create
        a single, coherent collection. The distinction between centralized and distributed collections can have
        important implications for developing policy for funding and for ensuring their persistence and
        longevity. Data collections may also differ because of the unique policies, goals, and structure of their
        funding agencies.



                                                                                                    Page 11 of 41
AeRIC                       ANDS – Towards an Australian Data Commons                            30/11/07

This notion that the collection is a view, possibly virtual and superimposed across a range of
institutional data retention systems, is appealing if research institutions already have their own
business drivers for supporting the better capture of the data outputs of research activities. The concept
of a federated ‘data commons’ is also appealing from financial and operational perspectives if it can
be layered on pre-existing data management infrastructure. Such a federation would also mean that the
entities that authorise, support and promote research activities also remain responsible for the
collection and dissemination of the research outputs.
Institutions could be seen to have an interest in supporting effective means of data capture and
retention for research data outputs, for a range of purposes, including:
    •   as an evidentiary record and/or archive holding representing the research activity funded and
        pursued under the organisation’s brand and with its imprimatur;
    •   to ensure evidence exists in support of published research claims, including the provenance of
        that data, so that important research can be repeated and verified when required;
    •   to contribute to the creation and expansion of national data assets;
    •   to provide certainty of access and ease of use in subsequent or further research by the
        originators or such assignees as may be determined or agreed (eg. in support of
        commercialisation);
    •   to ensure ease of reuse by research collaborations in which the researchers or the institution
        wishes to participate, particularly where cross-disciplinary;
    •   as part of the development of profile enhancing collections; and
    •   as part of an efficient process for contributing research data outputs into general access
        collections, for discovery and reuse in serendipitous modes.
In this approach the concept of the ‘data commons’ could be achieved by providing suitable access
methods into the institutional research data stores, requiring only minimal additional physical
infrastructure.




Unfortunately, it is often the case that the greatest value from collecting data is derived when it is
physically co-located and amenable to deep aggregate analysis. The ‘data commons’ will therefore
need;
    •   some ex-institutional resources;
    •   the ability for data to move between repositories and computing, analysis and visualisation
        resources;
    •   more effective means for synthesis and analysis across distributed holdings; and
    •   the possibility for institutions to host each other’s data.
Regardless of these nuances, it is certainly the case that much research data could be discovered and
accessed as part of the ‘data commons’ while remaining resident in institutional research data stores,


                                                                                            Page 12 of 41
AeRIC                            ANDS – Towards an Australian Data Commons                          30/11/07

particularly if metadata describing this institutionally-hosted data is aggregated for discovery
purposes.
This accords with the summary from the Association of Research Libraries’ report to the National
Science Foundation (NSF), To Stand the Test of Time: Long-term Stewardship of Digital Data Sets in
Science and Engineering3, which includes in its findings:
           The scale of the challenge regarding the stewardship of digital data requires that
           responsibilities be distributed across multiple entities and partnerships that engage institutions.
And that:
           Responsibility for the stewardship of digital information should be vested in distributed
           collections and repositories that recognize the heterogeneity of the data while ensuring the
           potential for federation and interoperability.
            Stewardship of digital resources involves both preservation and curation. Preservation entails
           standards-based, active management practices that guide data throughout the research life
           cycle, as well as ensure the long-term usability of these digital resources. Curation involves
           ways of organizing, displaying, and repurposing preserved data.
This is still a simplified view. The idea that data originates in many research activities and finds its
way over time into the ‘data commons’ describes only one pathway for data. Large reference data sets
are sometimes created through community level investments that transcend individual research
projects, and much data originates outside of the research field altogether as depicted below.




3
    See http://www.arl.org/bm~doc/digdatarpt.pdf


                                                                                               Page 13 of 41
AeRIC                              ANDS – Towards an Australian Data Commons                                30/11/07


In particular, it is the case that:
       •   many government agencies collect data sets as a consequence of their business, including
           environmental data, health informatics, social and economic data;
       •   some large research investments are structured to provide community reference sets directly
           into the general access domain, usually based around large research facilities that measure or
           interact with the natural environment (particle accelerators, space exploration, remote sensing
           are some examples of these); and
       •   some government agencies have legislative power to gather economic and social data from
           business and the community, but also data on some aspects of the physical environment.
Research projects that wish to draw on pre-existing data sources would need a discipline-specific lens
through which to view, aggregate and potentially process data from such a ‘data commons’ as inputs
to their specific research activities.
The diagram above depicts the ‘data commons’ as a holistic entity, whereas it can only come into
being by building inter-operation between a large variety of virtual access and physical data copies
that various sections of the data generating and using communities agree to support.
Importantly, many communities, including Astronomy and High Energy Physics and others, have
well-developed processes and international standards that implement examples of the Community
Research Investments depicted on the left side of the diagram.
NCRIS is undertaking additional investments in that component of the space in areas such as
biosecurity, ecology and marine sciences.
The goal of ANDS is to assist institutions and research communities including large-scale data
acquisition investments and existing collections join the federation/framework ad to ensure they can
operate together as a ‘data commons’.
Therefore ANDS will also need to assist those communities which have already developed separate,
discipline specific ‘data commons’ to engage within this broader framework, so that the collective
national investment in research data is made in ways that ensure the data can be more widely
discovered and reused. This would involve registering the various repositories where data resides and
providing access to the aggregated metadata so that data sources relevant to a discipline, community
or project can be discovered, viewed, processed and reused.
The complexity of inputs to the data common and the number of jurisdictions that could be involved
was recognised in the report of the UK Research Information Network, Stewardship of digital
research data: a framework of principles and guidelines4, which commented:
           What kind of policy framework do we need?
           7. Given the wide range of contexts in which research is conducted, the need is for a broad and shared
           framework to provide a basis on which key agents can develop approaches adapted to their needs. Such
           a framework must meet a number of requirements.
           Roles and responsibilities
           8. The first requirement is that the framework should be developed collaboratively and help to make
           explicit the roles and responsibilities of the key players in the research and research communications
           processes: researchers themselves, funders, the institutions in which researchers work, those who access
           and use data, and organisations such as libraries and archives that take responsibility for access and
           long-term preservation.
           Sensitivity to the needs of researchers and the contexts in which research is conducted
           9. The institutional, disciplinary, ethical, and funding contexts in which researchers gather and use data
           vary considerably, and policy and practice must be sensitive to those differences. One set of



4
    See http://www.rin.ac.uk/data-principles


                                                                                                      Page 14 of 41
AeRIC                         ANDS – Towards an Australian Data Commons                                  30/11/07

        arrangements will not be appropriate for all circumstances. It is thus essential that researchers
        themselves, in their various disciplinary and institutional settings, should be fully engaged in adapting
        and refining the framework, so that it takes full account of their needs and aspirations.
        Sensitivity to the requirements of different kinds of research data
        10. The framework must be sensitive to the different kinds of digital research data - from texts and
        numbers to audio and video streams - and to how they were generated
        o    for different purposes and through different processes
                   scientific experiments, which may in principle be reproduced, although it may in practice
                       prove difficult, or not cost-effective, to do so.
                   models or simulations, where it may be more important to preserve the model and
                       associated metadata than the computational data arising from the model.
                   observations of specific phenomena at a specific time and location, where the data will
                       usually constitute a unique and irreplaceable record.
                   derived data, resulting from processing or combining “raw” or other data
                   canonical or reference data relating, for example, to gene sequences, chemical structures,
                       or literary texts.
        o    by different groups of people and organisations
                  from the research community itself in the course of research
                  from a variety of bodies in the public, private and voluntary sectors for a wide range of
                      purposes, where the data may nevertheless be of value for research.
        o    collected together for different reasons
                   for the benefit of those engaged in a specific project, where some or all of the data may or
                      may not retain a value beyond the life of the project
                   for the benefit of a wider group within a discipline, or across disciplines, to provide
                      reference information, or a basis for evidence-based policy-making.
        11. The scholarly communications process itself, of course, may lead to the modification of digital
                research data, or to the generation of new data in the course of selecting, processing,
                disseminating and preserving original data created by research.”

The Problem of the Lens
The concept of looking at the ‘data commons’ through a ‘discipline specific lens’ goes directly to the
most fundamental problems in research data management, namely:
        The context in which data is captured or created defines the processing that can
        be usefully applied to it, and as a corollary:
             •   unless that context is captured and retained explicitly with the data, the
                 value of the data itself is greatly reduced to those for whom the context is
                 unknown; and
             •   a feedback loop exists as the context that must be captured with the data
                 may need to change if the nature of further processing applied to the data
                 changes.
Because of this feedback loop, a ‘community lens’ is most often developed hand-in-hand with the
methods for capturing the relevant contextual information around the actual observations, experiments
or computations, over which the ‘lens’ is to be applied.
This feedback loop also determines the origin of the private data in the opening diagram. Research
activities collect data in whatever way best supports their evolving and dynamic analysis processes
and the non-standard nature of the data management can limit its value for those not involved.
For data intended or able to be used again, the need to capture context leads research teams and
communities to develop and use data and metadata standards, and to collect data according to those
standards, so that they can derive value from accessing their shared results. Such standards currently


                                                                                                   Page 15 of 41
AeRIC                            ANDS – Towards an Australian Data Commons                                   30/11/07

differ significantly based on the nature of the data collected and the context needed to support the
analysis proposed.
Thus the ‘data commons’ is naturally composed of layers of information representing better defined
(and potentially overlapping) subsets of the overall data, more valuable because further analysis can be
more easily applied to them.
Once the evolution of standards and differences across disciplines are admitted, the problem of
transforming existing data holdings also arises. Where the necessary metadata were not captured, it
may not be possible to transform the data, leading to legacy data sets and access methods, and leaving
behind ‘islands of information.’
Consequently, for the foreseeable future, the ‘data commons’ must be composed of:
      •    a federation of data realms with separate governance, access rights and decision-making
           processes (of fixed complexity and with the potential for simplifying agreements);
      •    a federation of systems each supporting a part of the multitude of data sets and their associated
           access methods and tools (of rapidly growing complexity); and
      •    a federation of standards for content and its associated contextual information (of growing
           complexity but with the potential for convergence in parts over time).
Given this complexity, it is unsurprising that the vision of a ‘data commons’ depends on significant
technological progress and is difficult to realise using today’s technologies.
All of this is well understood as summarised in the report by the UK Office of Science and Innovation,
Developing the UK’s e-infrastructure for science and innovation5:
      …it is clear from current trends that the creation of new knowledge by synthesis of data from existing or
      ongoing experiments will become increasingly important in the next decades. However, behind this vision,
      lies a requirement on the data management technology for far greater ease in data collection, interoperation,
      aggregation and access. Technology and tooling must be developed to meet this requirement both in the
      manifestation of the data itself and the software that manages it. Critical to this will be the collection,
      management and propagation of metadata along with the data itself. Technology and tools must be
      developed which facilitate:
           1.   Metadata schema creation with optimal reuse of existing schema hence maximising the potential
                for data interoperation,
           2.   Metadata creation as early as possible and as automatically as possible, and
           3.   Metadata propagation, so that data and metadata are managed together by applications hence
                enabling additional types of analysis of the data at each stage in its evolution (provenance, audit,
                etc).
Therefore where broader data sharing is important, research groups, research communities and more
systemic data management activities must pursue one or more of the following options:
      •    reduce the overall fragmentation by adopting common standards for common data issues,
           (including metadata, vocabularies, file format and encodings), raising the base level of
           interoperability, or conceptually coalescing the specialist collections wherever possible,
           essentially allowing each related ‘lens’ to see more data;
      •    develop interoperation applications on the user side of the lenses (workflows and functions
           above the specialist tools), treating the data commons as a source of disparate information to
           be processed into more useful associated forms through those applications; and
      •    create translation or content interoperation solutions between the information layers by using
           semantic-mapping and ontology techniques to build smarter lenses, which might be depicted
           as tools that bridge several information islands.
In all three options, progress is both possible and regularly achieved in specific cases.


5
    See http://www.nesc.ac.uk/documents/OSI/index.html


                                                                                                       Page 16 of 41
AeRIC                      ANDS – Towards an Australian Data Commons                           30/11/07

However, even if common standards could be imposed, only that which is understood and agreed at
any moment can be standardised. Therefore the first option may be limited to the standards needed to
support well-understood and generic access and data management functions.
Interoperation on the user side of the ‘lens’ is in fact a programmed special case of the third option,
but is likely to be permanently useful as it represents transformations of data in the data commons into
new collections of enhanced value, which could be provided back to the commons.
The third option, although not a short-term ANDS deliverable, provides the pathway to the longer-
term solution by directly addressing the ability to represent data semantically, and base data
transformation and analysis on that semantic representation.
This current state of technology readiness therefore suggests the following principles for developing
ANDS plans:
    •   focus on services that assist the establishment and enduring operation of a virtual data
        commons across institutional and reference repositories;
    •   focus on activities that increase the value of the content of the data commons by (in order):
            o improving the quantity and quality of research data accessible in the commons;
            o facilitating discovery and access over the commons;
            o bringing quality data from other jurisdictions into the research data commons; and
    •   apply content translation or user side interoperation to yield added value data; and
    •   focus on the universal adoption of frameworks and standards where they enable ‘lenses’ that
        most researchers would require, such as browsing the collections in the commons.




                                                                                          Page 17 of 41
AeRIC                            ANDS – Towards an Australian Data Commons                                   30/11/07



Rationale for ANDS Programs
The previous section has outlined the concept of a data commons, which may well represent a kind of
‘data nirvana’. However, it leads to a discussion that takes ANDS towards specific functions within
the overall ecosystem for data and suggests that an appropriate overall goal for ANDS could be:
           to deliver greater access to Australia’s research data assets in forms that support easier and
           more effective data use and reuse.
A very important problem to address in reaching towards that goal concerns the various roles and
responsibilities that might be attributed to the participants in the ecosystem, and especially ANDS.
The following table from UKOLN’s report, Dealing with Data: Roles, Rights, Responsibilities and
Relationships6, lays out responsibilities in research data management as follows.
Role & Rights                         Responsibilities                               Relationships
Scientist                             Manage data for life of project.               With institution as employee.
creation and use of data              Meet standards for good practice.              With subject community
    • Of first use.                   Comply with funder/ institutional data         With data centre.
    • To be acknowledged.             policies and respect IPR of others.            With funder of work.
    • To expect IPR to be             Work up data for use by others.
          honored.
    • To receive data training
          and advice.
Institution                           Set internal data management policy.           With scientist as employer.
curation of and access to data        Manage data in the short term.                 With data centre through expert
     • To be offered a copy           Meet standards for good practice.              staff.
          of data.                    Provide training and advice to scientists.
                                      Promote the repository service.

Data centre                           Manage data for the long-term.                 With scientist as “client”
curation of and access to data        Meet standards for good practice.              With user communities.
     • To be offered a copy           Provide training for deposit.                  With institution through expert
          of data.                    Promote the repository service.                staff.
     • To select data of long-        Protect rights of data contributors.           With funder of service.
          term value.                 Provide tools for reuse of data.
User                                  Abide by license conditions.                   With data centre as supplier.
        rd
use of 3 party data                   Acknowledge data creators / curators.          With institution as supplier.
     • To reuse data (non-            Manage derived data effectively.
          exclusive license).
     • To access quality
          metadata to inform
          usability.
Funder:                               Consider wider public-policy perspective       With scientist as funder.
set/react to public policy drivers    & stakeholder needs.                           With institution.
     • To implement data              Participate in strategy co-ordination.         With data centre as funder.
          policies.                   Develop policies with stakeholders.            With other funders.
     • To require those they          Participate in policy co-ordination, joint     With other stakeholders as
          fund to meet policy         planning & fund service delivery.              policy-maker and funder of
          obligations.                Monitor and enforce data policies.             services.
                                      Resource post-project long-term data
                                      management.
                                      Act as advocate for data curation & fund
                                      expert advisory service(s).
                                      Support workforce capacity development
                                      of data curators.



6
    See http://www.jisc.ac.uk/whatwedo/programmes/programme_digital_repositories/project_dealing_with_data.aspx


                                                                                                       Page 18 of 41
AeRIC                            ANDS – Towards an Australian Data Commons                          30/11/07

Publisher                           Engage stakeholders in development of    With scientist as creator, author
maintain integrity of the           publication standards.                   and reader.
scientific record                   Link to data to support publication      With data centres and
    • To expect data are            standards.                               institutions as suppliers.
          available to support      Monitor & enforce public. standards.
          publication.
    • To request pre-
          publication data
          deposit in long-term
          repository.
It is self evident that a ‘data commons’ can only be constructed with the participation of the holders of
data, a role that is explicitly provided for in the table above, under the heading Data Centre.
This role has been implicitly assigned to institutions in the preceding discussion.
Identifying this primary stewardship role of existing research institutions is a key distinction and
decision point for ANDS. It sets the agenda for developing an Australian research data commons,
which essentially defines a model that leaves the data where it resides and provides access methods to
that data within those repositories as a virtual data commons.
Programs of activities within ANDS will therefore need to contribute towards this end in a series of
interlocking and mutually reinforcing ways.
Given the way responsibilities are shared across the participants in the research ecology, those
programs will also have to have a suitable impact on researchers, research administrators, research
support staff, research institutions, research funding agencies and government departments, if the goal
is to be achieved.
Key issues that will therefore need to be addressed through the detailed programs of ANDS include:
    •    deciding the standards that need to be adopted, and by whom and how, and the translation
         systems that ANDS might operate where different standards are unavoidable;
    •    determining which interfaces into what repositories are needed for a data commons to exist –
         and how to achieve that, and what new repositories are needed – and how to help them into
         existence;
    •    designing and providing utilities that support: registration of the data making up the data
         commons; a generic discovery lens; and access of found objects through persistent identifiers;
         and supporting the on-going development and sustainable operation of these utilities;
    •    identifying communities to work with to co-develop more specialised lenses that can assist
         those communities make use and rely on the data commons – and to understand the further
         implications of such developments;
    •    looking beyond the lenses, developing and simplifying the relationship and intellectual
         property agreements needed for data to be reused once seen.
In order to do this, another responsibility could be added to the table above, that of aggregator:

Role & Rights                          Responsibilities                       Relationships
Enable federated discovery and         Engage stakeholders in building a      With data centre as
access                                 federated metadata repository          contributor
To be offered metadata                 Maintain a registry of contributors    With user as primary target
describing data held in data                                                  audience
                                       Promote the discovery service
centres
                                                                              With scientist as developer
                                       Enable harvesting of or access to
To enable the development of                                                  of specialised lens
                                       subsets
more specialised lenses




                                                                                              Page 19 of 41
AeRIC                      ANDS – Towards an Australian Data Commons                         30/11/07

The particular structure of activities proposed for ANDS is easily made apparent as an overlay on the
data commons:




A particular advantage of this separation is that the delivery and implementation modes of each
program are distinct, as detailed below.
The specific proposition is that ANDS should pursue four programs of activity based on the key parts
of the research data commons where change can assist ANDS to achieve its goal.
    1. A Frameworks Program to establish community consensus on broad issues such as roles,
       responsibilities, policies, ownership, and licences for reuse. The result will be an overall
       framework that allows ANDS goals to be achieved, and provides researchers and institutions
       with guidelines and policies that better support data reuse.
    2. A Utilities Program to provide the common utilities infrastructure needed to aggregate data
       federations, repositories, and services into a research data commons. The result will be an
       inter-connection of institutionally-supported data repositories that provides seamless
       discovery and access across a virtual data commons.
    3. A Repositories Program to increase the number of reliable and sustainable research
       repositories able to hold reusable data and participate in the commons. The result will be
       institutionally-supported research data repositories that support researchers and research
       communities adopt the guidelines and policies developed by the frameworks program.
    4. A Researcher Practice Program to build the capacity and capability to create and reuse high-
       quality data within research communities seeking to do so. The result will be enhanced
       practice by researchers in managing data, the lodgement of increased volumes of high quality
       data into the commons, and the wider deployment of tools and practices that support effective
       data reuse.




                                                                                         Page 20 of 41
AeRIC                       ANDS – Towards an Australian Data Commons                           30/11/07



Rationale for ANDS Services
ANDS is a real-world initiative that requires decisions to be made around the part it will play in the
national research data commons and the functional elements it will provide as services.
The services that could be required within any data commons can be broadly visualised as follows.




The diagram provides an idealised framework to describe key services needed for a data commons.
Importantly, there is no assumption of a single national or global commons, but rather a series of
agreement-based groupings, aligned with projects, disciplines, or communities, and which may exist
for short periods or in perpetuity. The diagram draws together common or foundational services that
would underpin such a federation of federations.
At the top, users access materials in the collections through ‘lenses’ such as portals, applications,
browsers, etc. Discovery may occur through the ANDS generic lens or through a specific lens. Users
may sit in more than one federation at any given time, especially as trans-disciplinary research
becomes more common. The applications are likely to be ‘federation-aware’ since they need to share
standard interfaces appropriate to a federation, but users should be ‘federation-agnostic’, within the
rights they hold to access a federation.
The diagram identifies services specific to a domain, a discipline, or a community. Some domains may
not need all of the services, some may need additional services, and many will require services to be
specifically tuned to their discipline requirements. The main federation-specific element at the centre
of the diagram indicates services in a very generic fashion, i.e. they might be instances derived from a
common code base, but then adapted, populated and operated by discipline-specific interests.
The Human Resource and Meta-Services identify a set of services that have cross-domain purposes.
They could be provided as a national platform, some may appear globally, and some might in other
countries that researchers link with.


                                                                                           Page 21 of 41
AeRIC                       ANDS – Towards an Australian Data Commons                              30/11/07

The services to be offered within ANDS programs can be based on these functional elements and
established as a priority from an ANDS perspective if they are:
    •   fundamental to making a national data commons emerge; or
    •   highly relevant to the current needs of research data federating communities in Australia.

Component Services

Security and VO Management Framework (A)
Access to the enabling services and the content within a federation must be mediated through an
appropriate security, trust and virtual organisation framework. This access framework ideally would
leverage the frameworks for security and VO management in other application areas, thus facilitating
common access control for applications involving for example data access, high-performance
computing, visualisation and collaboration.
As the rules for access to some data may be complex, and as use of the data commons depends on
trust, it may be appropriate to provide a registry of licensing and rights descriptions, noting that access
controls apply and change throughout the lifecycle of any data collection. Restrictions may apply to
content reuse as well as editing, curation, repositioning and other low-level mechanisms.
As indicated in the diagram, these functions need to be implemented through all the services provided
by the data commons, and so no specific separate services are envisaged.

Federated Data Access Services (B)
Federated access to the multiple data sources is enabled by a suite of “commons-level” infrastructural
elements. These ‘utilities’ aggregate information about participating repositories and data centres and
create the basic substrate of cohesion on which a federation rests.
These services are fundamental for discovery, access and use of content stored in the data commons.
Examples here are federation-level registries and discovery services (at both collection and item
level), which allow users to search and discover objects within a federation of repositories.
Standardised descriptions and a collections registry are enabling elements for these services.
A federation-level service registry aggregates information about the web services or other machine-to-
machine interfaces available within a federation of repositories and data centres. It brokers access to
content discovered in the federation by listing the protocols available for accessing the data.
A federation-level statistics service aggregates the statistics of participating repositories and data
centres. This allows users to measure usage, activity, holdings, etc. across the federation.
The priorities for ANDS in this area are vital as a data commons can only emerge at a national level, if
certain fundamental elements are provided at a national scale, to support functions such as:
    •   registration of standardised descriptions of data collections;
    •   registration of data collection access methods, services and policies; and
    •   aggregated/ federated discovery.
The establishment of these elements will provide the minimum base environment for the development
of a data commons, and each of these elements can leverage prior Systemic Infrastructure Initiative
(SII) projects or pilot programs.
A second order of priority for ANDS would be a statistics service at the data commons level.
Although not fundamental, this service would provide measurable feedback on usage, holdings, and
activity across the Data Australia federation. It would be fairly straightforward to implement and can
also build off existing SII work.
As a broad concept, a workflow engine service would provide capabilities to applications to allow
them to make use of a federation of repositories in a more structured and standardised fashion. These


                                                                                              Page 22 of 41
AeRIC                       ANDS – Towards an Australian Data Commons                           30/11/07

become highly discipline-specific in instantiation, but the underlying technologies to manage
sequences of activities do support some level of standardisation and common development.

Repository Interface Services (C)
Some services for digital collections and data repositories have a generic aspect. For example, data
submission services can lower the cost of participation for all users by simplifying the process by
which data is accepted into a repository, even to the extent of interfacing data sources (such as
research activities, software or instruments).
Dissemination services on the other hand transform data items from federated sources into more useful
forms. For example, complex transformations such as visualisations are sometimes produced at a
central facility using data held in many different repositories. Analysis, mining, and fusion services
function in very similar ways. Data re-positioning services move data between repositories. Other
services here may also provide interfaces for preservation and quality assurance mechanisms.
These services directly interface to repositories, and may in fact run inside them. They add value to the
repositories themselves, and enable richer management of the federated data. The number of services
capable of being delivered to data collections in this mode is unlimited.
Priorities are therefore required, and these may be established as follows.
    •   A service that assists submission of complex data would materially assist the transfer of more
        quality data into the commons, and hence must be of high priority. The same applies to any
        service that assists in disseminating and re-using data. Development in this area could
        leverage the SII-funded Repository Interoperability Framework (APSR-ARROW-NLA), but
        would need some scoping and development (with NEAT) to focus more directly on the
        submission and dissemination requirements of key stakeholder communities.
    •   In the area of data repositioning ANDS can allow ICI and NCI to provide services for re-
        positioning data for performance. Re-positioning for protection however should be a key
        capability for the ANDS Data Repositories Program and the ANDS Framework Programs.
    •   ANDS could offer a useful commons-level service for the anonymisation of public records,
        although scoping and development would be required.
    •   An obsolescence notification service for data formats relies on a registry of data formats and
        obsolescence information, which most communities are ill prepared to support. Obsolescence
        notification will be more important for institutional repositories and data centres, as the
        diversity of their content rises. For the moment reliance is placed on the fact that data
        collections in general retain strong links between the creators and the curators.
    •   Local quality assurance services exist at collections and data centres. A generic service would
        need to be scoped through NEAT. Metadata validation and quality checking is a service that
        ANDS could quite usefully provide centrally. Independent feedback on metadata quality
        would provide considerable value to the data commons as a whole. Some kind of light-weight
        registry of metadata schema would be a component of this service.

Meta-Registries and Services (D)
Continued, precise, and automated manipulation of data in the repositories will be enabled by a global
network of “meta” registries and services, which include identifier services, format and schema
registries, as well as registries to enhance cross-disciplinary views such as ontology and federation
registries. Future services here may also provide references to discipline-specific standards and
frameworks to allow users to discover them for areas they are less familiar with.
Identifier services have the highest priority. They are an essential part of a global infrastructure for
assigning and resolving persistent identifiers to significant digital objects. These identifiers are an
integral part of the “information mesh” of repositories, as they support the unambiguous and ongoing
referencing of digital objects and services on the network allowing citation and cross-referencing of
materials over time and automated workflows between people, collections and services.


                                                                                           Page 23 of 41
AeRIC                      ANDS – Towards an Australian Data Commons                            30/11/07

In terms of other ANDS priorities, a national workflow service has not yet surfaced as a high priority
in PfC consultations, though domain specific federations may create their own. The situation could be
reviewed periodically. As well registries of metadata schema, formats, workflows, and ontologies
may not yet part of current practice and may be over over-ambitious if pursued too early.
ANDS is nevertheless motivated to maintain registries that are fundamental to any services it wishes
to run over the commons. For example a registry of NCRIS metadata schemata will be immediately
useful if ANDS runs a validation service to support quality assurance activities.
Rather than viewing these as ANDS services in the ANDS Services and Utilities Program, it may be
better (for the moment) to pursue work in these areas under the ANDS Researcher Practice Program
and shift the focus of activity towards maturing community capability in all these areas. Less-formal
documentation and registers of community activity could be maintained on the ANDS web site with a
view to moving towards registries (globally or nationally) as demand dictates. Consultancy,
documentation, and tools development could be some more achievable deliverables.

Human Resources (E)
A data commons relies on human capabilities that are ensured by education, training, expertise,
advice, policy and awareness and sometimes involvement in international standards processes.
Therefore education, training and consultancy programs are required to build the capability of data
creators, data stewards, and data consumers to participate in the commons.
A community consensus is required regarding the expected behaviour of all players involved in the
data commons. Bodies such as federal funding and policy agencies, and the research institutions can
influence behaviour by the conditions associated with funding or employment. A coordinated policy
framework is needed to provide clarity, support and incentives for positive data management practices.
ANDS will be active in all elements of the Human and Organisational Resources layer. As described
above, ANDS consultancy, education, and training programs will target repository service providers,
researcher practice, and existing or embryonic domain-specific research data federations.
The ANDS Framework Program will work with policy makers and domain experts to establish a
framework of policies that support the stewardship and innovative reuse of research data.
All ANDS programs will support the development, awareness, and take-up of international and
community supported standards. ANDS aims to make data frameworks, repositories, federations and
researcher practice all more standards-based. However, the actual development of particular data
standards and protocols will need to be pursued by those with a special interest.

Repositories, Data Centres (F)
The data commons model does not require ANDS to take direct responsibility for providing data
repository facilities; this responsibility is distributed among research and governmental agencies.
Nevertheless a healthy data commons requires stable data facilities with public interfaces for
exchanging information with federation-level services and other data facilities.
Content in a research data collection is often housed in multiple repositories and data centres, and each
of these may support many federations, giving a many-to-many mapping. Therefore repositories can
only tolerate limited changes to their internal management processes and technology platforms when
providing interfaces to common federation-level services. Individual repositories and data centres also
usually add value to their own data collections by offering a number of local services such as direct
discovery and access, usage statistics, item presentation, data visualisation, analysis, curation, etc.
The Data Repositories Program needs to assist participating repositories create the capacity to:
    •   dynamically maintain standardised descriptions and support metadata harvesting;
    •   support and document access protocols;
    •   support and document access policies;


                                                                                           Page 24 of 41
AeRIC                       ANDS – Towards an Australian Data Commons                              30/11/07

    •   provide unique identification and support persistence; and
    •   participate in a data-mirroring cooperative.
This might translate into a “Data Australia Core Set” of capabilities and administrative and descriptive
information that the ANDS Repository Program would assist develop and deploy. The aim for ANDS
would be to catalyse an efficient and cohesive content layer nationally to enhance efficiencies for
researchers, applications developers, content service providers, IT managers, and policy makers.
The ANDS Repository and Framework programs should also contribute frameworks and best-practice
guidelines for data facility service providers, along with tools, resources, consultancy, and training and
education programs to improve data stewardship throughout the community.
When appropriate, these activities could lead ANDS to various ways of specifying, measuring, and
auditing data facility service provision such as service level agreements, star ratings, certification, etc.

User Portals and Applications
Data producers and data users (researchers) will contribute, access, and manipulate data in the
commons through the lens of domain specific portals and applications. Again the data commons
reference model does not require a national body like ANDS to provide all these lenses; this
responsibility is distributed among the various research communities and projects. Providing the
capability for these lenses to interact seamlessly with the data (anywhere) in the commons is a role for
ANDS, which would require standards, interfaces and tools and consultancy support to assist
researcher communities in developing this capacity.

Role of Metadata
Metadata (data about the data) comes in many different forms, and while there are many metadata
typologies, three distinctions are relevant to this discussion:
    •   descriptive (this describes and identifies information resources, and is often used for
        discovery);
    •   structural (this describes the relationships between resources, and is sometimes used to
        facilitate navigation and presentation of electronic resources); and
    •   administrative (this can include provenance, access control, and preservation information, and
        is used to facilitate both short-term and long-term management and processing of resources).
ANDS will need to deal with different elements of the metadata problem in two contexts and support
in each an evolutionary approach that tracks the understanding and concerns of relevant disciplines.
The first context is the descriptive and structural metadata developed by research communities to
capture the complexities of data within different research fields. Each instance of this metadata will be
quite discipline specific, and often determined through international fora. ANDS could provide most
assistance by providing advice on how best to capture additional metadata, assisting inter-community
standards development, and by working with communities that have not yet developed standard
approaches.
A second context is the descriptive and administrative metadata required for the functioning of the
data commons, including metadata to help with access control, interoperability, and preservation.
ANDS could help here by providing advice on best practice and by producing tools for trans-discipline
and inter-domain searching.




                                                                                              Page 25 of 41
AeRIC                      ANDS – Towards an Australian Data Commons                           30/11/07



PART THREE – IMPLEMENTATION
The Overall ANDS Structure
The four ANDS programs are depicted below:




Overall, NCRIS seeks to engage a single entity under an NCRIS contract to take responsibility for the
implementation of ANDS and these four programs.
The following arrangements are expected to apply.
    •   The Frameworks Program is expected to be implemented through a small number of centres
        of expertise and will need to engage strongly with federal government agencies and national
        institutions.
    •   The Repositories Program will need to provide staff able to consult with and deliver on-site
        support to all higher education institutions and significant research organizations, though the
        function may be hosted at a small number of institutions with existing expertise.
    •   The Utilities Program is likely to define services to be outsourced to suitable suppliers, which
        may be suppliers of similar national services such as for the Australian Access Federation.
    •   The Researcher Practice Program is expected to strongly engage with research communities in
        activities mixing expertise between data management and research skills, and will require
        project implementation capabilities and consulting support in all regions, such as would be
        available through the Australian Research Collaboration Service (formerly Interoperation and
        Collaboration Infrastructure).




                                                                                          Page 26 of 41
AeRIC                       ANDS – Towards an Australian Data Commons                            30/11/07



Frameworks Program
Aims
    •   To influence relevant national policies, including undertaking the development of policies
        where appropriate.
    •   To build a common understanding of data management issues and solutions across
        government departments (commonwealth and state), research funding agencies, and research
        intensive organizations.

Rationale
A range of agreements around policies and responsibilities are needed if a cohesive network of
research repositories is to be established and if they are to grow into an Australian ‘data commons’.
While ANDS will need to motivate and energise this activity for its own purposes, the framework can
only be developed in concert with representation from a broader community of data interests.
Of particular interest for ANDS, will be the detailed elaboration of the framework to ensure that
coherent utilities and data management implementations can be established.

Implementation
This program is likely to be best provided through a small highly skilled team with a high degree of
internal communication, managing a range of formal interactions and able to bring in relevant
expertise on a needs basis.
Development of frameworks with a more specific technical focus will be provided for in the utilities
and repositories programs.

Focus
In recognition of the complexity involved in a data commons, four primary areas of focus have been
identified for this program:
    •   Governments and their policy settings.
    •   Research funding agencies.
    •   Research Institutions and Organizations.
    •   Legal arrangements.

Service Beneficiaries
The frameworks program will reduce the complexity encountered managing and sharing data over
time for holders of long term data and will provide greater certainty over rights and obligations for all
participants interested in research data use and reuse.

Community Engagement
Primarily, the frameworks program will have responsibility for developing and sustaining the “Data
Australia Policy Forum” described earlier; specifically within the four areas of focus.

Government
    •   Work with government departments that fund research to ensure that the data management
        and availability obligations for researchers on receiving funding exist and are consistent with
        ANDS objectives.
    •   Work with government departments that undertake research to ensure that they manage and
        make available this research in a way that supports ANDS objectives.

                                                                                            Page 27 of 41
AeRIC                      ANDS – Towards an Australian Data Commons                         30/11/07

Research funding agencies
    •   Work with agencies that fund research to ensure that the data management and availability
        obligations for researchers on receiving funding exist and are consistent with ANDS
        objectives.

Institutions
    •   Engage with Universities Australia and its members, and other research-intensive
        organizations, to assist them develop data management policies and guidelines that are
        consistent with ANDS objectives.

Legal frameworks
    •   Review available legal opinion and legislation in the research data management area and
        identify gaps in what is available to assist government departments, funding agencies,
        institutions and researchers.
    •   Commission work to fill the identified gaps.

Inter-Program Connections
While defining the space, this program will need to remain informed about practicalities encountered
in the other programs and map out an evolutionary strategy towards its longer term goals.

Services
    •   Ongoing development of suitable terms for funding arrangements between institutions that
        share data responsibilities or service solutions.
    •   Publications including data management and data reuse guidelines and best practices.
    •   Recommendations on legal and policy changes for adoption across the sector.




                                                                                        Page 28 of 41
AeRIC                      ANDS – Towards an Australian Data Commons                            30/11/07



Utilities Program
Aims
    •   To ensure necessary technical and 24x7 operational services are provided so that repositories
        can be aggregated into federations to underpin the development of a research data commons
    •   To ensure that services develop and evolve to meet changing data reuse requirements

Rationale
The functionality of the proposed ‘data commons’ depends on the existence of specific services.
ANDS will need to ensure that suitable registration, discovery, and access services are operated by
appropriate providers to support data discovery and access across the data commons.
ANDS may also need to ensure that more advanced services developed within the researcher practice
program aimed at making data reuse more feasible can also be suitably deployed.

Implementation
This program will require the development and establishment, or procurement, of highly reliable
operational technical services to meet ANDS specification.

Focus
This program will develop the architecture and provide the utility infrastructure to facilitate a number
of domain-specific, national and global data federations. A particular focus will be services that allow
Australian data collections to be registered, discovered and accessed in common ways.

Service Beneficiaries
    •   Providers of content into the institutional repositories and the data commons.
    •   Users of content from institutional repositories and the data commons.
    •   Research communities building and operating (new or already existing) data federations.

Community Engagement
This program will engage formally with the wider community through the Data Australia Service
Providers Forum and Support Network.

Inter-Program Connections
The utilities program will implement at a technical and architectural level some of the community
consensus, policy, and vision formulated by the frameworks program, particularly related to metadata.
It will maintain and provide descriptions of interfaces, protocols, and schema used by ANDS registry,
discovery, access, and identifier utilities to the broader community, and will assist the frameworks and
repositories program develop the notion of a standard content environment for research data.
The utilities program will work with the researcher practice program to specify interfaces amongst
ANDS utilities, the standard content environment, and domain-specific access tools or portals.

Services
    •   Utilities Design and Development: execution of development plans and strategies that bring
        in required expertise.
    •   Utilities Delivery: Definition of service levels; implementation of services, quality control
        and usage reporting.


                                                                                           Page 29 of 41
AeRIC                      ANDS – Towards an Australian Data Commons                             30/11/07



Repositories Program
Aims
    •   To improve and standardise capacity and capability at institutionally supported repositories.

Rationale
The development of an Australian research data commons is framed around managing access to data
stored in a network of repositories, sustained at an institutional level. It success will depend on
Australian research data being routinely deposited into stable and sustainable data management and
preservation environments.
Therefore ANDS will need to improve, supplement, and integrate the available repositories at
institutions. A program of activity aimed at this objective is needed for three reasons.
    •   Existing institutional repositories are primarily focussed on managing a relatively small
        number and range of digital objects. The data commons will require management of a much
        larger number of larger and more widely varied objects. This may be a possibility for some
        existing repositories (with additional configuration and customisation work), but for others
        will require replacement or augmentation with software better suited to the expanded data
        requirements. The increasing range of large data driven science activities is a particular need
        that must be met.
    •   Existing repositories are mostly focussed on open access documents, with the demands of the
        research quality framework largely being treated as a one-off exception. The data commons
        will require a more sophisticated authorisation framework to support the different levels of
        access described in the data lens discussion earlier in this document, which will mean that
        repository implementations will need to adapt to those requirements.
    •   The ANDS utilities program will provide a range of new services that will require integration
        of existing and new repository infrastructure to meet the needs of the data commons.
The development of institutionally supported data repositories, integrated with the Australian Access
Federation, and supporting flexible role-based access control will act as a powerful stimulus to
innovation in eResearch techniques and applications.

Implementation
The repositories program will need to be constructed around a group of data management
professionals who can provide coherent and consistent advice to multiple institutions, and support the
creation and adoption of reference implementations. The nature of the advice provided, and the
effectiveness of how it is delivered, will also need to be monitored on an ongoing basis.
Success will depend on the degree of cooperation achieved with those institutions, so regional
distribution of the team is likely to be required.
As the team would be expected to operate in accordance with well-established principles and policies,
the requirement for internal team communication is not as high as in the frameworks program.

Focus
The repositories program will mostly provide guidance to research institutions about how to
implement best practice data management planning and policy in support of research communities.
The intention will be to enhance institutionally supported repositories by improving:
    •   the ability of repositories to manage and expose research data content in a sustainable manner;
    •   the capacity of data repositories to support a range of content types and community specific
        access services and analysis tools, and to participate fully in the data commons;



                                                                                          Page 30 of 41
AeRIC                       ANDS – Towards an Australian Data Commons                            30/11/07

    •   the ability of data repositories to support cross-disciplinary collaborative research through
        appropriate meta-data support.

Service Beneficiaries
The Repositories program will target its services at data facility service providers and their systems,
and content management services and systems. It will provide training, consultancy, and targeted
technical assistance.

Community Engagement
This program will engage its stakeholders directly, through existing repository manager networks
(ARROW Community, APSR ORCA Network) and by establishing a data service providers forum.
Development of services and resources should be co-ordinated with the Digital Curation Centre
(DCC), the NLA and NAA, the UK Research Information Network, and the NSF, wherever possible.

Inter-Program Connections
This program will need to interoperate closely with the Frameworks program (to ensure that it meets
its needs), the Utilities program (to ensure close integration of repository solutions with the new
services as they become available), Researcher Practice (to ensure that the barriers to contribution are
as low as possible and that it meets the special requirements of all target disciplines), and the AAF (to
ensure that AAF attributes meet the needs for authorisation and that ANDS solutions integrate fully
with the AAF).

Services
    •   Resource Development & Training: institutional repository data management guidelines and
        training; data management environment development guidelines and training.
    •   Repository Reference System: development of a reference data management system for
        green-field sites.
    •   Repository Consultancy: Targeted advice, consultancy and technical assistance in data
        management and key capacity development; on-site and remote systems integration and
        customisation assistance.
    •   Repository Rating: Definition of service levels; implementation of a star rating scheme to
        rate data management capability and capacity to participate in Data Australia.




                                                                                            Page 31 of 41
AeRIC                       ANDS – Towards an Australian Data Commons                           30/11/07



Researcher Practice Program
Aims
    •   To assist researchers to align their data management practices with the needs and outputs of
        the other programs.
    •   To assist communities to develop and deploy data management tools and practices that
        increase the quantity and quality of data available to the data commons.

Rationale
This program operates at the ‘researchface’ where data is generated, shared and reused.
Through the researcher practice program, ANDS will help Australian researchers, research data
managers, and institutions to develop the necessary skills, and to deploy tools, systems and services, to
create, manage, and share high quality research data.
This will improve data management practice at all stages of the data lifecycle and provide the best
possible basis for communities and data sharing services that rely on this data.

Implementation
In this program, ANDS will work directly with NCRIS and other researcher communities to develop
and provide standards, tools and services that federate data for research reuse.
While the specific tools and services required by each community will be the focus of projects defined
by NeAT, a longer term goal is the development of trans-disciplinary policies, standards and tools
common across research data needs, and to understand how they should be best supported by
institutional mechanisms.
Engagement with researchers will need to have reference to their institutions (who provide them with
a range of support services and who define data management policies) but will primarily focus on
enhancing data sharing and reuse across and between discipline communities.
It will also be important to coordinate activities with those undertaken by relevant national and
international support agencies.
This program is likely to be achieved best through co-investment with research communities on a
project basis.

Focus
The researcher practice program will improve data management by researchers, research communities
and their supporting institutions and service providers in their research settings (laboratories,
fieldwork, etc).
The program will give priority to data federating communities where significant improvements to
practice and the consequential quantity and quality of accessible and re-usable research data can be
achieved.

Service Beneficiaries
This program will assist researchers and research communities seeking to reuse data within and across
disciplines through practice improvements, support for search and access tools, meta-data capture
tools and work on implementations that support community agreed frameworks and standards.




                                                                                           Page 32 of 41
AeRIC                       ANDS – Towards an Australian Data Commons                             30/11/07

Community Engagement

Researchers
    •   Embed data management professionals in targeted research groups to identify improvements
        in data management practice and workflows, and to test new approaches.
    •   Pilot guidelines and tools with selected researchers.

Research Communities
    •   Identify data management best practice.
    •   Work with discipline communities to identify their priority data management issues, and
        develop approaches to address these, consistent with ANDS goals, and in conjunction with
        NeAT.
    •   Work with discipline communities to ensure that their data management practices are
        consistent with international best practice and ANDS objectives.

Institutions
    •   Provide consulting services, based on best practice, to research data managers to assist them
        with data management policies and planning.
    •   Develop data retention and disposal guidelines, consistent with the outputs of the frameworks
        program, for implementation in institutions.
    •   Develop and promulgate best practice data management policy and plans.

Inter-Program Connections
This program will need to ensure that its developments are aligned with the outputs of the frameworks
program and assist the adoption of those outputs wherever possible.
It will also need to ensure that the tools and services developed through NeAT projects are consistent
with any requirements imposed by their planned migration to the ANDS utilities or some other service
provider.
Finally, the program may need to assist institutional data management systems integrate with
researcher and discipline support requirements.

Services
    •   Integration projects that bring together researchers, repositories, and utilities to create new
        components of the Australian data commons.
    •   Development of improved tools and guidelines to support data management practice.
    •   Outreach documentation and consulting offerings, including discipline-specific data
        management approaches and advisories for members of disciplines.
    •   Recommendations on changes in practice, and published feedback on applicability of
        approaches.




                                                                                             Page 33 of 41
AeRIC                        ANDS – Towards an Australian Data Commons                          30/11/07



ATTACHMENTS
Attachment A: Proposed ANDS Utilities
Prioritisation & Scheduling
The ANDS Utilities Program will prioritise some elements of the data commons reference model for
instantiation. These considerations are informed by the previously described fundamental objectives of
ANDS, the current reality of Australian higher education research, and the constraints of a real-world
budget.


2008 Service Initialisation           2008 Service Development &      2008 Watching Brief
                                      Pilot; 2009 Service
                                      Initialisation

Identifier Service                    Discovery Service: federated    Obsolescence Notification &
                                      authenticated search            NCRIS Format Registry
Collections Service Registry
                                      Aggregated Statistics Service   Service Registry (UDDI-style)
Discovery Services: federated
open search portal                    Quality Assurance Services      Visualisation Services (eg.,
                                      (Metadata Validity, Service     Australian Maps, Australian
Access Services:
                                      Availability..)                 Gazetteer)
    •   Policy-Rights-
                                                                      Merging/ Fusion Services
        Licensing Registry            Referred to NEAT & the
                                      Repositories Program            Curation Services (data
    •   Authorisation Fabric
                                                                      transformation)
                                      Submission Services
Federation Registry
                                                                      NCRIS Metadata Schema
                                      Re-positioning Services
                                                                      Registry
                                                                      NCRIS Ontology Database
                                                                      Registry
                                                                      ANDS Utilities Program Service
                                                                      Descriptions

Identifier Service
•   Common Handles server
•   Identifier migration services
•   Identifier integration services
•   Consultancy, guidelines, best practice, etc to assist data providers in implementing PID
    infrastructure (this will be implemented via the Repositories program)
•   Administration of Handles licenses
To lay the foundations for a broader Australian persistent identifier infrastructure, ANDS will also
provide:
•   Global handles mirror
•   Australian handles proxy
•   Administration of the Australian NCRIS handles prefix (102)




                                                                                          Page 34 of 41
AeRIC                       ANDS – Towards an Australian Data Commons                           30/11/07

Collection/ Service Registry
The ANDS Utilities Program will run a national collection/ service registry, and will be pro-active in
federating with existing and future research community registries. It will also contribute to an
embryonic international federation of research registries. The service will aim to comprehensively
cover Australian higher education research collections and will cooperate with data registries from
other domains and jurisdictions (museum, cultural, government, etc).
The ANDS registry will be a machine-readable registry of descriptions of Australian research data
collections as well as descriptions of the data access methods (services) available for each collection.
The registry will enable data producers and stewards to register their data for public use and will also
enable researchers to search and access this previously inaccessible area of the grey internet. The
registry of data access services should enable automated data workflows to be established through
machine-to-machine communication.
The registry infrastructure should also provide the required information for standardised
acknowledgment and citation of research data collections.
The ANDS Central Registry Service will build upon the work of the SII-funded Online Research
Collections Australia (ORCA) registry project. It will transition this experimental pilot into a
production-level piece of the Australian information infrastructure.
This transition will include: implementation of feedback from the current pilot and specific NCRIS-
domain-area needs, implementation of federated search and query protocols; schema optimisation;
enterprise/ production level software framework, data provider modules for (NCRIS) content
repositories, documentation for content repositories, and participation in global federated registries
initiatives.
This service relies heavily upon the ANDS Repositories Program providing consultancies, support,
training and targeted intervention for content repositories. The capacity to support standardised
collection description and data provision needs to be a key component of the “standard content
environment” to be specified and promulgated by the Repository Program.




Figure: Conceptual View of a Collections/ Service Registry


                                                                                            Page 35 of 41
AeRIC                       ANDS – Towards an Australian Data Commons                             30/11/07

Discovery Services: federated open search portal
ANDS will build a central discovery service that over time will allow access to an increasing range of
resources. This discovery service will in principle allow individuals to find and access resources of all
types (data, documents and annotations) that are held in Australian research repositories. These
different resource types need to be included because of the need to support their interconnections
(annotations on data, links from documents to data) and cross-discovery.
Access to resources will be constrained by the attributes made available by the AAF for the users and
any access constraints on the resources themselves. Access restrictions on metadata records
(separately to resources) will be supported where appropriate. Users who choose not to login will only
see open access metadata records in result sets.
The discovery service will use a mix of metadata aggregation and federated search as appropriate,
based on the underlying resource platforms and technology decisions. At its heart will be a central
metadata store containing resource descriptions harvested from Australian research repositories. This
store will build on the metadata already being harvested for the ARROW Discovery service by
including data collection service descriptions contributed to the ANDS registry (and the data in
federated registries from other domains or jurisdictions).
The service will also provide access to other aggregations such as those supported by the National
Library of Australia, including the holdings of Australian library collections and digitised and born
digital newspapers, journals and cultural heritage materials.
Activities for development over the ANDS timeline include implementation of authenticated federated
search building on existing work in the AAF, and using established standards such as the NISO Meta-
search Initiative standards, Z39.50, and SRU/SRW.

Access Services: policy registry
The ANDS Utility program will provide a national registry covering issues such as data access
policies, usage rights, and licensing requirements associated with data access. The registry will
provide a location for collecting the various policies in use across ANDS, as well as providing
template policies that can be adapted for discipline-specific needs.
An important element of this service will be the development and dissemination of a set of simple
policies for access to secure datasets (ie, datasets that are not publicly available as open access). These
policies will be developed in conjunction with the AAF, based on widely available AAF-specified
attributes. The goal is to encourage data providers wherever possible to select an appropriate access
policy from a small number of well-understood access policies. These policies will be machine
readable, and hence allow automated access to data based on the interaction of the AAF and secure
data repositories. Sample restrictions might include: anyone logged into the AAF; any staff member
logged into the AAF; any staff member from a designated group/Virtual Organisation in the AAF; any
pre-selected, named individual logged into the AAF; and any pre-selected, named individual logged in
to the AAF using strong authentication. The development of these policies will draw on existing work
of the ARROW, APSR and MAMS projects.

Access Services: authorisation fabric
The ANDS Utility program will provide tools to support flexible authorisation across the whole fabric
of ANDS technologies. These tools will be based on open standards (eg, XACML, SAML, GSI), and
provide ways of specifying authorisation via flexible machine-readable policies which can be
independent of the underlying resources or software.
Identity information for authorisation will be provided via integration with the AAF. AAF identity
information can be delivered in a number of forms, including SAML assertions via Shibboleth, short-
lived Grid certificates generated from SAML assertions, and traditional PKI. The authorisation fabric
will need to support both human to machine as well as machine and machine interfaces.



                                                                                             Page 36 of 41
AeRIC                       ANDS – Towards an Australian Data Commons                              30/11/07


The authorisation fabric can be used to:
    •   control access to requests for creation of new identifiers;
    •   filter results for the authenticated federated search/discovery service (both metadata and
        resources);
    •   provide an authorisation layer over existing data repositories where secure access is required;
    •   assist with data repositioning decisions which involve access control issues around data
        location;
    •   help deal with difficult privacy requirements, including user and data anonymity and de-
        linking;
    •   assist with data collections that require regular changes to access rules (e.g., indigenous
        collections);
    •   control access to registry information where records cannot be made public;
    •   control access to data processing and abstraction services in an environment where human
        access to the underlying data is restricted.
A longer-term goal of the authorisation fabric is to provide an end-to-end identity and access
management infrastructure for secure data collections that can be based on user attributes (and hence
be privacy preserving where required) rather than only on rigid identity methods (such as X.509). The
authorisation fabric will build on the existing XACML work of the RAMP project, as well as lessons
learned from the DART and ARCHER projects.

Federation Registry
The ANDS Utilities program will include a federation registry. This is a relatively minor (but
fundamental) service to record the existence of federations of data. The ANDS data commons notion
is in reality a number of overlapping domain-specific federations.
At its most basic, this federation registry is a machine-readable list of the registries and interface
services of these other federations, and as such this functionality will be developed through an
extension of the ANDS collections registry.
This registry will evolve over time as the commons design becomes clearer, and the requirements for
cross-federation and cross-repository discovery processes are developed.

Quality Assurance Services
To improve research and repository practices requires understanding and feedback around current
practices. Ultimately these will lead to enhanced levels of trust around ANDS services and content in
the commons, and a stronger valuation of data publication.
ANDS Quality Assurance services will support automated processes to measure various concepts of
quality around the infrastructure and the content. For the infrastructure this will include service
availability of the repositories themselves and measures of performance of the access and query
services being provided. For the content this will include object and metadata access, effective
identifier resolution, adherence to schemas, simple sanity checks of metadata fields, checksums and
other machine-readable metrics.

Aggregated Statistics Service
ANDS will develop and pilot an aggregated statistics service in 2008 aiming for a production service
in 2009.
This service will allow researchers and administrators to view and report upon data usage, holdings,
and activity across the wider data commons. Standardised statistics across participating data
repositories in the data commons will allow:
    •   researchers to make more confident assertions about the relative impact of their data sets and

                                                                                              Page 37 of 41
AeRIC                      ANDS – Towards an Australian Data Commons                           30/11/07

        collections;
    •   repository managers to benchmark against other facilities, and
    •   funding agencies to gauge the broader effect of policy and funding.
With the cooperation of NCRIS domain areas and ANDS data repository partners, this service will
propose and trial a 'benchmark set' of statistical measures and filtering approaches appropriate to
research data repositories.
The service will build upon the SII-funded DEST repository projects (APSR, ARROW) and use
familiar OAI-PMH-like harvesting approaches for statistics aggregation.
By providing standardised information across a broad range of data centres, collections, and
repositories, this service will be able to inform a number of policy, planning, and funding decisions
within the overall scholarly communications cycle.
This service relies heavily upon the ANDS Repositories Program to provide consultancies, support,
and targeted intervention for content repositories. The capacity to support standardised statistical
measures and data provision needs to be a key component of the “standard content environment” to be
specified and promulgated by the Repository Program.




                                                                                          Page 38 of 41
AeRIC                       ANDS – Towards an Australian Data Commons                            30/11/07



Attachment B: Indicative Budget
All figures $million
             ANDS      Framework Utilities Repositories Practice NeAT               Total   NCRIS Other
 2007/08       0.25        0.75         2.00         0.75         0.75      2.00    6.50      3.93     2.58
 2008/09       0.50        1.00         2.50         1.50         1.50      4.00    11.00     6.45     4.55
 2009/10       0.50        1.00         2.50         2.50         1.50      5.00    13.00     7.45     5.55
 2010/11       0.50        1.00         2.50         3.00         1.50      3.00    11.50     6.70     4.80
 TOTALS        1.75        3.75         9.50         7.75         5.25      14.00   42.00     24.53    17.48


 NCRIS        100%         70%          70%          50%          50%       50%

Explanations
ANDS
•   The leadership of ANDS and is costed at a full time executive director and two support positions.
•   Fully funded by NCRIS.
Framework Program
•   Composed of 4-5 full time staff and some contracted support.
•   A contribution to this activity is expected from sponsors of research data reuse.
Utilities Program
•   Further work is required to fully define the services that will be created.
•   Co-development of shared utilities is expected from relevant data federating interests.
Repositories Program
•   Assistance to identified communities involved in populating the data commons with their content,
    starting at 10-12 people and rising to 20 or more.
•   Half funding is proposed.
Researcher Practice Program
•   10-12 people distributed around the country with the objective of assisting institutions develop
    and adopt best practice in repository systems and research policy frameworks.
•   A contribution from the institutions is expected.
NeAT
•   Projects in the range of $500k per annum to $2M per annum and of durations ranging between 1
    and 3 years.
•   About 6 projects operating simultaneously at any one time.
•   NeAT to fund no more than half the cost of each project.




                                                                                            Page 39 of 41
AeRIC                       ANDS – Towards an Australian Data Commons                           30/11/07



Attachment C: The ANDS Technical Working Group
The concept of the Australian National Data Service was proposed as a key initiative within the
investment plan for Platforms for Collaboration as part of the National Collaboration Research
Infrastructure Strategy. It had been scoped within that plan as a result of extensive consultations
across the sector.
The ANDS Technical Working Group was established by DEST to develop a more detailed scoping
and implementation proposal for ANDS following the acceptance of the ANDS investment concept by
NCRIS (in April 2007) and the convening of a sector wide representative forum to consider progress
with ANDS.
That ANDS forum held on 2 May 2007 confirmed all major aspects of the ANDS concept and agreed
that the technical working party should be formed to develop the more detailed proposal.
Participants in the ANDS forum were:
    •   Evan Arthur, Group Manager, Innovation and Research Systems Group, Department of
        Education, Science and Training (DEST)
    •   Tony Boston, Assistant Director-General, Resource Sharing, National Library of Australia
        (NLA)
    •   Michael Briers, Chief Executive Officer, Securities Industry Research Centre of Asia-Pacific
        (SIRCA)
    •   Markus Buchhorn, Director, ICT Environments, Division of Information, Australian National
        University (ANU)
    •   Adrian Burton, Project Leader, Australian Partnership for Sustainable Repositories (APSR)
    •   John Byron, Executive Director, Australian Academy of the Humanities
    •   Ross Coleman, Director, Sydney eScholarship, University of Sydney Library
    •   James Dalziel, Director, Macquarie E-Learning Centre Of Excellence (MELCOE)
    •   Paul Davis, Executive Director, Victorian eResearch Strategic Initiative (VeRSI)
    •   Graeme Dudgeon, Manager, Information and Communication Technology, Department of
        Primary Industries (DPI) Victoria
    •   Ben Evans, Manager, APAC National Facility and Acting Head of the ANU Supercomputer
        Facility (ANUSF)
    •   Anne Fitzgerald, OAKLaw project, Queensland University of Technology (QUT)
    •   Rhys Francis, Executive Director, Australian e-Research Infrastructure Council (AeRIC)
    •   William Wright, Head, Data Management Section, Bureau of Meteorology (BoM)
    •   Andrea Grosvenor, Manager, Research and Advanced Infrastructure, Department of
        Communications, Information Technology and the Arts (DCITA)
    •   Cathrine Harboe-Ree, University Librarian, Monash University and Deputy President,
        Council of Australian University Librarians (CAUL)
    •   Sophie Holloway, Data Archive Manager, Australian Social Science Data Archive, ANU
    •   Craig Johnson, Leader, BlueNet and eMarine Information Infrastructure (eMII) Projects,
        University of Tasmania
    •   Anne-Marie Lansdown, Branch Manager, Innovation and Research Branch, DEST
    •   Carey Lonsdale, A/g Executive Director, Investment Branch, National Health & Medical
        Research Council (NHMRC)
    •   Jonathan Manton, Executive Director, Mathematics, Information and Communication
        Sciences, Australian Research Council (ARC)
    •   Steve Matheson, Head, National Statistical Service Leadership Branch, Australian Bureau of
        Statistics (ABS)


                                                                                           Page 40 of 41
AeRIC                     ANDS – Towards an Australian Data Commons                       30/11/07

   •    Gavan McCarthy, Director, eScholarship Research Centre, The University of Melbourne
   •    Clare McLaughlin, Innovation and Research Branch, DEST
   •    Alan McMeekin, Chair, Council of Australian University Directors of Information
        Technology (CAUDIT)
   •    Deborah Mitchell, Director, ACSPRI Centre for Social Research, ANU
   •    John Morrissey, Executive Manager, Technology Planning, CSIRO
   •    Peter Nicholson, Innovation and Research Branch, DEST
   •    Linda O’Brien, Vice-Principal Information, The University of Melbourne
   •    Bernard Pailthorpe, CEO, Queensland Cyber Infrastructure Foundation (QCIF)
   •    Alex Reid, Director, E-Research & Middleware, AARNet
   •    Mick Reid, Principal, Michael Reid & Associates
   •    Andrew Rohl, CEO, iVEC – The hub of advanced computing in Western Australia
   •    Stuart Ross, Director, Solutions Development, Information Services Branch, Geoscience
        Australia
   •    Tony Rothnie, Innovation and Research Branch, DEST
   •    Nick Tate, Director, Information Technology Services and AusCERT, University of
        Queensland
   •    Andrew Treloar, Director and Chief Architect, ARCHER Project, Monash University
   •    Nigel Ward, Technical Director, Australian Advanced Distributed Learning (ADL) Initiative
        Partnership Laboratory
   •    Andrew Wilson, Assistant Director, Information Policy, National Archives of Australia
        (NAA)
   •    Tony Williams, Director, South Australian Partnership for Advanced Computing (SAPAC)
   •    Robert Woodcock, Research Group Leader, CSIRO Exploration & Mining
The members of the ANDS TWG, listed below, developed this document in response to their brief,
with assistance from the Innovation and Research Branch of DEST.
    • Markus Buchhorn, Director, ICT Environments, Division of Information, Australian National
        University (ANU)
    • Adrian Burton, Project Leader, Australian Partnership for Sustainable Repositories (APSR)
    • James Dalziel, Director, Macquarie E-Learning Centre Of Excellence (MELCOE)
    • Rhys Francis, Executive Director, Australian e-Research Infrastructure Council (AeRIC)
    • Clare McLaughlin, Director, NCRIS Platforms for Collaboration, Innovation and Research
        Branch, DEST
    • John Morrissey, Executive Manager, Technology Planning, CSIRO
    • Ray Norris, Australia Telescope National Facility (ATNF)
    • Judith Pearce, Assistant Director, Innovation, National Library Australia
    • Wayne Richards, Principal Architect, National Data Network, Australian Bureau of Statistics
    • Nick Tate, Director, Information Technology Services and AusCERT, University of
        Queensland
    • Andrew Treloar, Director and Chief Architect, ARCHER Project, Monash University
    • Andrew Wilson, Assistant Director, Information Policy, National Archives of Australia
    • Lesley Wyborn, Group Leader, Interoperability, Information Services and Technology
        Branch, Geoscience Australia




                                                                                     Page 41 of 41

								
To top