DC3A: Melbourne Neuropsychiatry Centre (MNC)
Bioinformatics Development Project
1 Project Purpose and Activities
The MNC has one of the largest databases of brain scans and associated neuropsychiatric research data
in the world. It has National and International collaborators using and contributing to the database.
Build workflow for automating documentation of dataset segments used in individual studies
and publications. This will include researchers, datasets, associated projects and publications.
Build workflow for automating creation of citable persistent identifiers for unique studies and
linking with publications.
Build software to automate capture of public facing metadata to University of Melbourne
Registry which will deliver collections metadata to the ARDC.
MNC has 270+ publications resulting from datasets stored in the MNC database. Completing the
work above will result in ~ 100 dataset descriptions described in the ARDC by June 2011, with an
expected 25+ being registered each year after that.
2 Deliverables
Deliverable
D1 Project plan agreed by ANDS.
D2 Five sample Collection records in ARDC, with associated Party and Activity records,
of agreed standard.
D3 High level design documents:
a. Metadata mappings between each pair of metadata formats where
mappings required, including to RIF-CS.
b. Process descriptions for capturing MNC dataset metadata and storing it in
VITRO registry.
c. Process descriptions of integration between pre-print register, dataset
metadata and IP information.
d. Design document of overall system.
D4 Deployed system that:
a. Extracts metadata from datasets in the MNC database.
b. Automatically enriches dataset with pre-print register metadata, and
copyright/IP metadata, by connecting with those systems.
c. Allows extracted metadata to be enriched by user input.
d. Generates RIF-CS Collection, Party and Activity records from metadata.
e. Allows authorised external users to access datasets from MNC database,
using a query builder.
f. Stores datasets in VITRO registry.
g. Allows users to develop advanced queries to find datasets.
h. Deposits RIF-CS metadata in VITRO registry for ARDC harvest, including
Service descriptions.
i. Automatically assigns persistent identifiers to datasets where required.
D5 Collection descriptions for 100 datasets, with associated, Party, Service and Activity
descriptions produced by deployed system and made visible to ARDC.
a. As many descriptions as possible should contain links, or access
information, immediately shareable data.
D6 Source code for all developed software, with developer’s manuals (to facilitate
reuse) deposited in agreed open-source repository.
D7 Deployed, permanent, operational feed of Collection, Party, Service and Activity
description records to ARDC operational, with output of agreed quality.
DC3B: Longitudinal qualitative and quantitative
survey data capture and re-use, Youth Resource
Centre
1 Project Purpose and Activities
The Youth Research Centre’s Life Patterns Research Program maintains an extensive qualitative and
quantitative data base on a cohort of 2000 young Australians who left secondary school in 1991 and of a
second cohort of 3000 who left school in 2005. With ARC funding through to 2014 for annual
quantitative and qualitative data capture for the second cohort (Gen Y) and biannual data capture for
the first cohort (Gen X), this activity aims to enable wider access and use of the data by developing the
infrastructure to:
a) make sets of the existing data available for re-use,
b) streamline capture of new data so that it is more readily available for re-use, and
c) build the capacity to efficiently respond to future requests for derived data sets.
Appropriate structures for the capture of relevant metadata (compliant with DDI2, DDI3 and RIF-CS
schemas) and tools to extract this metadata from workflows will be developed.
2 Deliverables
Deliverable
D1 Project plan agreed by ANDS, in ANDS standard template format.
D2 Statement of ethical issues policy, indicating how future datasets will be released.
D3 Five sample collection descriptions (with associated activity, service and party
descriptions) in ARDC, including one for the Life Patterns Research Program Project
submitted to the ARDC
D4 High level design descriptions:
a. Process descriptions for deriving, describing and publishing re-usable data
sets
b. High level software design document, showing data flows and links
between components.
D5 Deployed system that:
a. Automates the deriving and publishing of re-usable data sets
b. Automatically captures and extracts metadata in quantitative and
qualitative data capture workflows
c. Allows extracted metadata to be enriched by human input.
d. Allows authorised external users to access datasets.
e. Automatically assigns persistent identifiers where required.
D6 Twenty Collection descriptions, with linked Party, Service and Activity descriptions,
produced by deployed system and available for harvest by ARDC. These describe
linked, comparative longitudinal case studies on young people’s life trajectories
from each of cohort 1 and cohort 2 illustrative of the underlying the quantitative
and qualitative data.
a. As many descriptions as possible should contain links, or access
information, immediately shareable data.
D7 Source code for all developed software deposited in agreed open-source repository,
with developer’s manuals to facilitate reuse.
D8 Deployed, permanent, operational feed of collection, party and activity records to
ARDC, with output of agreed quality.
DC3C: Optimising Metadata Capture, Data Sharing
Procedures and Long-term Re-use of Video data in
the Social Sciences
3 Project Purpose and Activities
The University of Melbourne has an especially rich humanities and social science research community
that utilises video as its primary form of data capture. The increasing use of video as a research tool
poses particular challenges for aggregated data storage initiatives. This project will integrate metadata
capture facilities at selected sites within the University of Melbourne as part of facilitating sharing and
re-use. The project will address current metadata issues associated with large-scale audio-visual
repositories and workflows to enable efficient generation of metadata, ensuring that stored video data
is accessible and searchable through the ARDC. The project will:
Develop software to automate the capture of metadata from existing mature video storage
systems developed by the ICCR (International Centre for Classroom Research),
Develop and – where possible - utilise existing infrastructure to identify generic workflow tools
that will enable rich knowledge of data sets, access services and parties to the research to be
systematically (RIF-CS) captured from the researchers,
Develop standards compliant video data and metadata deposit services.
These are generic goals which are broadly applicable to activities elsewhere within the university, for
example in the Faculty of Architecture, Building and Planning and the Faculty of the VCA and Music.
4 Deliverables
Deliverable
D1 Project plan agreed by ANDS, in ANDS standard template format.
D2 High level design documents:
a. Business process description and high level design for publication of video
dataset descriptions, including case-specific protocols and ethics
considerations for data access locally, nationally and internationally.
b. Metadata schema for video dataset descriptions
c. Mapping of schema to RIF-CS
d. Web services design documents and validation against existing research
projects.
D3 Five sample Collection descriptions, with linked Activity, Party and Service
descriptions, representing the selected active video intensive projects across the
University in ARDC.
D4 Deployed system to be used by research staff and data librarians that:
e. Allows video data to be deposited
f. Automatically extracts metadata from video data
g. Allows extracted metadata to be enriched by user input
h. Generates RIF-CS Collection, Party and Activity descriptions from metadata.
i. Ingests metadata into the University of Melbourne VITRO registry.
j. Automatically assigns persistent identifiers where required.
D5 Operational automatic feeds of Collection descriptions and associated Party and
Activity information to ARDC
D6 Agreed number (to be specified in project plan) of Collection descriptions, with
associated Party, Service and Activity descriptions, produced by deployed system
and available to ARDC.
a. As many descriptions as possible should contain links to, or access
information for, immediately shareable data.
D7 Deposit of all developed software in agreed open source repository, accompanied
by developer manuals to facilitate reuse.
DC3D: Human and mouse neuroimaging collections
in the national data commons
5 Project Purpose and Activities
DaRIS is a raw data management system based on the Mediaflux digital asset management platform and
has been in operation for the last 3 years at the Neuroimaging Computational and Data Management
Facility (CDMF). There it has been used to routinely receive MR images from researchers and organise
them into a subject-centric data model, ready for access by project members. It hosts over 70 mouse
and human projects, each with many tens of subjects and some with time-dependent data.
Map DaRIS project-metadata to the ANDS schema
Write a DaRIS service to populate ANDS-compliant metadata,
Develop an adapter to harvest the ANDS-compliant metadata from DaRIS
Connect identifiers within DaRIS to ANDS persistent identifiers (PIDs).
Relationship to the National Imaging Facility (NIF) ANDS proposal
The DaRIS system has been selected by the NCRIS NIF to provide its data management capability. The
NIF will manage collections of data from a range of domains which are primarily but not only
neuroimaging (e.g. plant imaging, microscopy, etc.). This is possible because the general DaRIS
framework can be tailored to a number of domains. Nonetheless, each domain requires different
metadata definition design, data capture protocols and workflows; therefore the metadata capture
process is inherently different for each domain.
The University of Melbourne ANDS proposal focuses on neuroimaging metadata exposure (with
collections managed by DaRIS held at UoM) whereas the separate NIF ANDS proposal focuses on
operationalising the DaRIS system to multiple nodes of the NIF as well as exposing NIF collections with
DaRIS. There is thus no dependency between these two proposals.
6 Deliverables
Deliverable
D1 Project plan agreed by ANDS, in ANDS standard template format.
D2 Mapping of DaRIS project metadata to RIF-CS.
D3 High level design of:
a. DaRIS RIF-CS generation service.
b. DaRIS-ARDC OAI-PMH feed.
c. Integration with ANDS Persistent Identifier Service.
d. Automated extraction of metadata from datasets.
D4 Ethics policy for re-use of data collections. (If suitable number of collections cannot
be shared with other researchers on an ongoing basis, project cannot proceed.)
D5 Five sample Collections descriptions, accompanied by Party, Service and Activity
descriptions, representing a range of different dataset types, in ARDC.
D6 Deployed, tested, documented system that:
a. Extracts metadata from datasets.
b. Allows users to enhance metadata for datasets.
c. Generates ANDS-compliant RIF-CS from datasets.
d. Exposes RIF-CS as OAI-PMH feed, with controls to prevent harvesting of
non-shareable collections.
e. Provides direct download access to datasets with appropriate
authentication and authorisation controls.
f. Automatically assigns persistent identifiers where needed.
D7 Collection descriptions for 100 datasets, including Service, Activity and Party
description links, produced by deployed system and available for harvest by ARDC.
a. As many descriptions as possible should contain links to, or access
information for, immediately shareable data.
D8 Deposit of all developed software in agreed open source repository, accompanied
by developer manuals to facilitate reuse.
DC3E: Humanities and Social Science Data at the
University of Melbourne
7 Project Purpose and Activities
The University of Melbourne has one of the most rich and diverse humanities and social science (HASS)
research communities in Australia and is well ranked internationally. HASS researchers at Melbourne
generate and hold valuable data sets and associated materials that are currently not easily discoverable,
accessible or configured for further research purposes. This project will build infrastructure (tools and
services) to connect this diverse community with the UoM Registry (Vitro) which will in turn
communicate the relevant metadata to the ARDC. The project will:
Develop and utilize existing (OHRM-based) infrastructure to identify generic workflow tools that
will enable rich knowledge of data sets and related materials, access services and parties to the
research to be systematically (RIF-CS) captured from the researchers.
Development of a generic web services-based data capture tool to be used both by researcher
staff, data librarians or other staff in the data management fabric. This will be based on the ‘pre-
register’ work done for the Australian Women’s Register in 2009
Develop standards compliant ‘access service’ descriptions
Ensure project, data, party and service descriptions concord with Data Documentation Initiative
(v2&3) requirements.
It will inform the development and utilisation of digital and analogue archival preservation,
curation and access systems for the University
8 Deliverables
Deliverable
D1 Project plan agreed by ANDS, in ANDS standard template format.
D2 Sample descriptions in ARDC as follows:
a. Collection, Activity, Party and Service descriptions representing five
selected active HASS projects across the University in ARDC.
b. Five sample Collection descriptions, with associated Party, Service and
Activity records, drawn from those projects made available in the ARDC.
D3 Mapping of one (or more, if applicable) dataset formats held in OHRM, to RIF-CS.
D4 Design document for web services to be built.
D5 Design document for a generic web service-based data entry, ingest and metadata
management tool (henceforth “web data capture tool”) to be used both by
researcher staff, data librarians or other staff.
D6 Deployed, tested, documented system that:
a. Allows data related to humanities projects to be input, managed, browsed,
searched.
b. Provides a RIF-CS feed into the University of Melbourne’s VITRO registry.
c. Is integrated with data from a number of existing OHRM databases, to be
specified in the project plan.
d. Can be controlled through web services, including the bulk retrieval of data.
D7 Agreed number of Collection descriptions (as determined and specified in project
plan), with associated Service, Party and Activity descriptions, produced by
deployed system and available for harvest by ARDC.
a. As many descriptions as possible should contain links to, or access
information for, immediately shareable data.
D8 Deposit of all developed software in agreed open source repository, accompanied
by developer manuals to facilitate reuse.
D9 All descriptions available for harvest from VITRO.
DC3F: Capturing multi-modal data to support
research in cardiovascular and neurological
medicine
9 Project Purpose and Activities
Complex physiological data is routinely collected on patients as part of clinical care (echocardiography,
intravascular ultrasound, x-ray angiography, optical computerised tomography, patient clinical data,
etc.). However, this rich multi-model data is not usually subjected to subsequent analysis nor is it made
available to researchers from other disciplines for novel analysis. Making this multi-model data available
along with patient outcomes such as morbidities will provide the opportunity for collaborative groups to
employ novel strategies to developed assessments and models based on this data. This project will form
necessary base of making multi-model data collections available, enabling the establishment of new
links between biomedical research groups in engineering, physics and bioinformatics. This project will
occur in collaboration with BioGrid Australia where it will use the access, de-identification and privacy
protection protocols already established there. The major activities will be:
Map BioGrid metadata to the ANDS schema,
Write a service to populate ANDS-compliant metadata,
Develop a service to harvest ANDS-compliant metadata from multiple BioGrid data sets which
form a single study,
Enable the assignment of globally unique identifiers that link to the source of multi-modal
datasets.
10 Deliverables
Deliverable
D1 Project plan agreed by ANDS, in ANDS standard template format.
D2 Ethics policy quantifying which datasets will be shareable and under which
conditions, and which will never be shareable.
D3 High level design documents:
a. Mapping of BioGrid datasets to RIF-CS.
b. Design of service to generate RIF-CS.
c. Process descriptions, including ethics approvals, ETL processes,
deidentification and metadata annotation.
D4 Five sample Collection descriptions, with associated Service, Party and Activity
descriptions, in ARDC.
D5 Deployed, tested, documented system that:
a. Extracts metadata from datasets in BioGrid.
b. Allocates persistent identifiers where needed.
c. Allows that metadata to be enriched by user input.
d. Provides access for authorised external users to the datasets.
e. Provides an OAI-PMH feed of ANDS-compliant RIF-CS.
f. Allows datasets of different levels of shareability to be managed.
D6 Collection descriptions with associated Service, Party and Activity descriptions for
ten multi-modal and all descriptions produced by deployed system, and available
for harvest by ARDC.
a. As many descriptions as possible should contain links to, or access
information for, immediately shareable data.
D7 Deposit of all developed software in agreed open source repository, accompanied
by developer manuals to facilitate reuse.
DC3G: Founder and Survivors Project
11 Project Purpose and Activities
The Founders and Survivors Project (http://www.foundersandsurvivors.org/ ) has brought together a
number of research data sets created from records relating to the 73,000 convicts transported to
Tasmania in the 19th century and their descendents to create a population database of national and
international significance for historical, demographic and population health researchers.
This project will:
Develop a toolkit based around the projects XML/TEI workflow for further relevant records sets
to be systematically ingested into the population database,
Build the infrastructure to enable persistent identification and descriptions of derived data sets
produced on request from the population database to be made available to the ARDC.
12 Deliverables
Deliverable
D1 Project plan agreed by ANDS, in ANDS standard template format.
D2 High level design documents:
a. Description of automated processes for deriving, describing and publishing
ingest data sets.
b. Description of automated processes for deriving, describing and publishing
derived data sets.
c. Mapping of data set metadata to RIF-CS.
D3 Five sample Collection descriptions, with associated Party and Activity descriptions,
representing both ingest and derived datasets in ARDC.
D4 Deployed system that:
a. Generates RIF-CS Collection, Party and Activity descriptions from derived
and ingest datasets.
b. Includes an extraction and ingestion toolkit for researchers to incorporate in
their data collection workflows to facilitate the production of ingest data
sets to the population database
D5 User documentation for all types of users.
D6 Collection, party and activity descriptions for 20 ingest data sets that meet ANDS
requirements, produced by the deployed system, and available for harvest to the
ARDC.
a. As many descriptions as possible should contain links to, or access
information for, immediately shareable data.
D7 Source code for all developed software published to open source repository, with
developer documentation to facilitate reuse.