Managing Scientific Data
Holly Miller
MBLWHOI Library, Marine Biological
Laboratory
MBLWHOI Library
Marine Biological Laboratory
Woods Hole Oceanographic Institution
•Boston University Marine
Program
•MIT/WHOI Joint Program
•Sea Education Association
•NOAA National Marine
Fisheries Service
•the United States Geological
Survey
2
Data
Infrastructure
• 65 Virtual Servers over 17 Physical Servers
• Combined totals:
–312GB RAM
–272 Processor cores
–196TB of storage
13
We’ve Got Data
What to do with it?
Unstructured Natural
Data Language
Acquisition Processing
Structured Data
Acquisition
Data XML
REST
Warehouse
Data Considerations
•Accessible
•Easy to find and retrieve
•Quality
•Analysis tools
•Visualization tools
•Potential for reuse
18
Three
Examples:
LigerCat
Cell Image Library
WHOAS repository
LigerCat
Search PubMed Database
20 million citations, biomedical literature
Medical Subject Headings ~= Key words
22
What can you search for?
• Concepts ‐ Alzheimer disease, vitamin D, mitochondria
• Author names ‐ Borisy GG
• Organism names ‐ Mus musculus
• Institutions ‐ Marine Biological Laboratory
igerCat: Literature and Genomics Research Catalogue ligercat.ubio.org
Results are displayed in a ‘tag’ cloud
igerCat: Literature and Genomics Research Catalogue ligercat.ubio.org
Histogram of publication date
igerCat: Literature and Genomics Research Catalogue ligercat.ubio.org
MeSH clouds in EOL
• 1,360,665 total species processed65,630
species returned PubMed
articles5,544,635 articles were analyzed
igerCat: Literature and Genomics Research Catalogue ligercat.ubio.org
Cell Image Library
Cell Image Library
• Collaboration (American Society for Cell
Biology, Harvard, and others)
• Resource for cell image data
• Metadata added by experts
• Ontology terms used for annotation
Workflow
Woods Hole Data
Repository
for Data Supporting Published
Articles
Archiving data associated with
scientific journal articles
MBLWHOI Library mblwhoilibrary.org
Scientific Article
36
Workshops
Woods Hole, April 2009
Paris, April 2010
• Stakeholders included scientists, data
managers and librarians
• Data must be discoverable, citeable and
available on the internet
• Resources, standards and workflows
must be defined to support the publisher
and funding agency mandates
• Action item - Library develop process to
MBLWHOI Library mblwhoilibrary.org
Metadata Schema
• Dublin Core alone not enough
• Also need Darwin Core and
“Woods Hole Core”
MBLWHOI Library mblwhoilibrary.org
Persistent Identifiers
Digital Object Identifiers
(DOI’s)
• Library has existing relationship with
CrossRef to assign DOI’s
• Provides link from figure in article to
data
MBLWHOI Library mblwhoilibrary.org
Cultural Shift
• Early discussions often ended with
my data is too complicated, files too
large, etc.
• Growing recognition that
transparency is important – that
means make data available
• Just do it!
MBLWHOI Library mblwhoilibrary.org
Ongoing Challenges
•Researcher concerns regarding reuse and
misuse of data
•Proprietary file types
•Responsibility for quality control of data
•Additional work for authors
MBLWHOI Library mblwhoilibrary.org
Summary
Summary
• Scientific data is heterogeneous
• Data management is complicated
43
Data Considerations
•Accessible
•Easy to find and retrieve
•Quality
•Analysis tools
•Visualization tools
•Potential for reuse
44
Keys for Success
• Sustainability ‐ Always in
development
• Infrastructure ‐ Strong foundation
• User Experience ‐ Easy and
beautiful
45
Lessons Learned
• Modular, reusable architectures
• Robust, flexible infrastructures
• Data standards compliance
• Structured processes
• Clear communication
46
Thank you!
Lisa Raymond
Ann Devenish
Ryan Schenk
John Hufnagle
Anthony Goddard
Funding:
George F. Jewett Foundation
Ellison Medical Foundation
NIH
NLM 47