How to Build an Ontology
Barry Smith http://ontology.buffalo.edu/smith
1
Ontology
A classification of entities and the relations between them. Ontology is a list of types structured by relations Defined by a scientific field's vocabulary and by the canonical formulations of its theories. Scientific theories consist of generalizations. What I will not be talking about: XML, OWL, ..., data(types), information models, file formats ...
2
Top-Level
GO OBO, OBO Core NCBO FMA NCBC Roadmap Centers NCI EVS NECTAR (National Electronic Clinical Trials and Research) Network
3
Instances are not included in an ontology
It is the generalizations that are important
(but instances must still be taken into account)
4
A B C
515287 521683 521682
DC3300 Dust Collector Fan Gilmer Belt Motor Drive Belt
5
Ontology
Types
Instances
6
Ontology = A Representation of Types
7
Each node of an ontology consists of: • preferred term (aka term) • term identifier (TUI, aka CUI)
• synonyms
• definition, glosses, comments
Ontology = A Representation of Types
8
Nodes in an ontology are connected by relations: primarily: is_a (= is subtype of) and part_of designed to support search, reasoning and annotation
Ontology = A Representation of Types
9
Rules for formating terms
• Terms are names of types: if you prefix a term with the type ___ the term should still make sense • Hence: terms should be in the singular • Terms should be lower case • Avoid abbreviations even when it is clear in context what they mean („breast‟ for „breast tumor‟)
10
Motivation: to capture reality
Inferences and decisions we make are based upon what we know of reality. An ontology is a computable representation of this underlying bio(techno)logical reality. Enables a computer to reason over the data in (some of) the ways that we do.
11
Biomedical ontology integration / interoperability
Will never be achieved through integration of meanings or concepts The problem is precisely that different user communities use different concepts What’s really needed is to have well-defined commonly used relationships
12
Concepts
Biomedical ontology integration will never be achieved through integration of meanings or concepts
The problem is precisely that different user communities use different concepts
13
Concepts
Concepts are in your head and will change as our understanding changes Ontologies represent types: not concepts, meanings, ideas ... Types exist, with their instances, in objective reality – including types of experimental process, design, method, ...
14
Most ontologies are execrable But some good ontologies do already exist
• as far as possible don‟t reinvent • use the power of combination and collaboration • ontologies are like telephones: they are valuable only to the degree that they are used and networked with other ontologies
15
Why do we need rules/standards for good ontology?
Ontologies must be intelligible both to humans (for annotation) and to machines (for reasoning and error-checking): unintuitive rules for classification lead to errors Intuitive rule facilitate training of curators and annotators Common rules allow alignment with other ontologies Logically coherent rules enhance harvesting of content through automatic reasoning systems
16
Rules on types
Don‟t confuse types with concepts Don‟t confuse types with ways of getting to know types Don‟t confuse types with ways of talking about types Don‟t confuses types with data about types
17
First Rule: Univocity
Terms (including those describing relations) should have the same meanings on every occasion of use. In other words, they should refer to the same types in reality
18
Second Rule: Positivity
There are no negative types Terms such as „non-mammal‟ or „nonmembrane‟ do not designate genuine types. (There are also no conjunctive and disjunctive types: rabbit and nailfile; rabbit or nosewipe)
19
Third Rule: Objectivity
Which types exist is not a function of our biological knowledge. Terms such as „unknown‟ or „unclassified‟ or „unlocalized‟ do not designate biological natural kinds.
20
Fourth Rule: Single Inheritance
No type in a classificatory hierarchy should have more than one is_a parent on the immediate higher level
21
Rule of Single Inheritance
no diamonds:
B is_a1
C is_a2
A
22
Problems with multiple inheritance
B
is_a1 A
C
is_a2
„is_a‟ no longer univocal
23
„is_a‟ is pressed into service to mean a variety of different things
shortfalls from single inheritance are often clues to incorrect entry of terms and relations the resulting ambiguities make the rules for correct entry difficult to communicate to human curators
24
is_a Overloading
serves as obstacle to integration with neighboring ontologies The success of ontology alignment demands that ontological relations (is_a, part_of, ...) have the same meanings in the different ontologies to be aligned.
25
To the degree that the above rules are not satisfied, error checking and ontology alignment will be achievable, at best, only with human intervention and via force majeure
26
Current Best Practice:
The Foundational Model of Anatomy
27
Anatomical Space
Anatomical Structure
Organ Cavity Subdivision
Organ Cavity
Organ
Organ Part
Serous Sac Cavity Subdivision
Serous Sac Cavity
Serous Sac
Organ Component
Organ Subdivision
Tissue
Pleural Cavity
Pleural Sac
Parietal Pleura
Pleura(Wall of Sac)
Interlobar recess
Visceral Pleura Mesothelium of Pleura
28
Mediastinal Pleura
Current Best Practice:
The Foundational Model of Anatomy
Follows formal rules for definitions laid down by Aristotle. When A is_a B, the definition of „A‟ takes the form: an A =def. a B which ...
a human being =def. an animal which is rational
29
FMA Example
Cell def an anatomical structure which consists of cytoplasm surrounded by a plasma membrane with or without a cell nucleus Plasma membrane =def a cell part that surrounds the cytoplasm
30
The FMA regimentation
Brings the advantage that each definition reflects the position in the hierarchy to which a defined term belongs. The position of a term within the hierarchy enriches its own definition by incorporating automatically the definitions of all the terms above it. The entire information content of the FMA‟s term hierarchy can be translated very cleanly into a computer representation
31
GO now adopting structured definitions contain both genus and differentiae
Essence = Genus + Differentiae neuron cell differentiation = Genus: differentiation (processes whereby a relatively unspecialized cell acquires the specialized features of..) Differentiae: acquires features of a neuron
32
Ontology alignment
One of the current goals of GO is to align:
Cell Types in GO
with
Cell Types in the Cell Ontology
cone cell fate commitment
keratinocyte differentiation adipocyte differentiation dendritic cell activation
retinal_cone_cell keratinocyte
fat_cell dendritic_cell lymphocyte T_lymphocyte garland_cell
lymphocyte proliferation
T-cell homeostasis garland cell differentiation heterocyst cell differentiation
heterocyst
33
Alignment of the two ontologies will permit the generation of consistent and complete definitions
GO
id: CL:0000062 name: osteoblast def: "A bone-forming cell which secretes an extracellular matrix. Hydroxyapatite crystals are then deposited into the matrix to form bone." [MESH:A.11.329.629] is_a: CL:0000055 relationship: develops_from CL:0000008 relationship: develops_from CL:0000375
+
Cell type
=
Osteoblast differentiation: Processes whereby an osteoprogenitor cell or a cranial neural crest cell acquires the specialized features of an osteoblast, a bone-forming cell which secretes extracellular matrix.
New Definition
34
Other Ontologies to be aligned with GO
Chemical ontologies
– 3,4-dihydroxy-2-butanone-4-phosphate synthase activity
Anatomy ontologies
– metanephros development
GO itself
– mitochondrial inner membrane peptidase activity
OBO core
35
eventually to comprehend all of OBO
36
Top Level OBO-UBO
continuants: objects, characteristics, spatial regions occurrents: processes, temporal regions, spatio-temporal regions
37
Definitions should be intelligible to both machines and humans
Machines can cope with the full formal representation Humans need modularity
38
Fifth Rule: Terms and relations should have clear definitions
These tell us how the ontology relates to the world of biological instances, meaning the actual particulars in reality:
– actual cells, actual portions of cytoplasm, and so on
39
But
Some terms are primitive (cannot be defined) AVOID CIRCULAR DEFINITIONS ! Avoid definitions of the forms: An A is an A which is B (person = person with identity documents) An A is the B of an A (heptolysis = the causes of heptolysis)
40
types
animal
mammal
substance
organism
cat
leaf type
siamese
frog
instances
41
Benefits of well-defined relationships
If the relations in an ontology are welldefined, then reasoning can cascade from one relational assertion (A R1 B) to the next (B R2 C). Find all DNA binding proteins should also find all transcription factor proteins because transcription factor is_a DNA binding protein
42
What happens when an ontology has no clear definition of A is_a B:
cancer documentation is_a cancer disease prevention is_a disease living subject is_a information object representing an animal or complex organism individual allele is_a act of observation
43
Anatomical Space
Anatomical Structure
Organ Cavity Subdivision
Organ Cavity
Organ
Organ Part
Serous Sac Cavity Subdivision
Serous Sac Cavity
Serous Sac
Organ Component
Organ Subdivision
Tissue
Pleural Cavity
Pleural Sac
Parietal Pleura
Pleura(Wall of Sac)
Interlobar recess
Visceral Pleura Mesothelium of Pleura
44
Mediastinal Pleura
How to define A is_a B
A is_a B =def. all instances of A are as a matter of biological science also instances of B here A and B are names of types in reality
45
How to define A is_a B
A is_a B =def. for all a if a instance_of A, then a instance_of B
46
Kinds of relations
Between types:
– is_a, part_of, ...
Between an instance and a type
– this explosion instance_of the type explosion
Between instances:
– Mary‟s heart part_of Mary
47
Part_of as a relation between types is more problematic than is standardly supposed
heart part_of human being ? human heart part_of human being ? human being has_part human testis ? testis part_of human being ?
48
Definition of part_of as a relation between types
A part_of B =Def all instances of A are instance-level parts of some instance of B human testis part_of adult human being
49
Instance level this nucleus is adjacent to this cytoplasm implies: this cytoplasm is adjacent to this nucleus
Type level nucleus adjacent_to cytoplasm Not: cytoplasm adjacent_to nucleus seminal vesicle adjacent_to urinary bladder Not: urinary bladder adjacent_to seminal vesicle
50
Definitions of the all-some form
allow cascading inferences If A R1 B and B R2 C, then we know that every A stands in R1 to some B, but we know also that, whichever B this is, it can be plugged into the R2 relation
51
transformation_of
C
same instance
C1
c at t1
c at t
time
pre-RNA child
mature RNA adult
52
transformation_of
A transformation_of B =Def.
Every instance of A was at some earlier time an instance of B adult transformation_of child
53
embryological development
C c at t C1 c at t1
54
tumor development
C c at t C1 c at t1
55
derives_from
C c at t C1 c1 at t1
time
C'
c' at t
instances
ovum zygote derives_from sperm
56
One main obstacle to integrating biological and experimentgenerated data
Most ontologies have no facility for dealing with time and instances
57
EXPO: Experiment Ontology
58
representational style part_of experimental hypothesis experimental actions part_of experimental design
59
tool part_of experimental design (confuses object with specification)
60
hypothesis driven is_a Galilean
61
physical is_a scientific experiment (avoid abbreviations)
62
admin info about experiment is_a scientific experiment
63
where is the top level? objects, processes, characteristics
64
is_a and part_of never cross categorial divides (cf. tripartite organization of GO)
if A is_a B then A is an object type iff B is an object type then A is a process type iff B is a process type then A is a characteristic type iff B is a characteristic type
65
Some thoughts on time
continuants vs. occurrents objects, characteristics vs. processes time timeline day daytime menstrual cycle high tide
66
What is time?
67
Top Level OBO-UBO
continuants: objects, characteristics, spatial regions occurrents: processes, temporal regions, spatio-temporal regions
Space = the largest spatial region Time = the largest temporal region
68
Relative time, subjective time
terms describing (regions of) time in special (qualitative, perspectivedependent, landmark dependent) ways
tomorrow, yesterday uptown, downtown phase A trial Wednesday
69
Characteristics are continuants
many characteristics have realizations, applications or executions, which are processes
plan design method menstrual cycle function
70
GlaxoSmithKline*
What we need is “industrial-strength” ontologies with a consistent and rich representation formalism that are amenable for use as an integration framework, and support reasoning capabilities. We anticipate that pharma‟s need to bring together mountains of data and information and to properly analyse that information all depend on having a stable, well-developed semantic framework that links information/data and that allows reasoning systems to perform some of our more "mundane" analysis work.
*Robin McEntire
71
OBO Relation Ontology
“Relations in Biomedical Ontologies”, Genome Biology, Apr. 2005
relations for continuants behave differently from relations for processes
72
part_of
for component types is time-indexed
A part_of B =def. given any particular a and any time t, if a is an instance of A at t, then there is some instance b of B such that a is an instance-level part_of b at t
73
part_of
for process types is not time-indexed
A part_of B =def. given any particular a, if a is an instance of A, then there is some instance b of B such that a is an instance-level part_of b at t
74
Main Upper Level Ontologies
CYC Cycorp (Austin, TX) human being = partially tangible thing SUO (Suggested Upper Ontology) IEEE monkey, body covering DOLCE (Descriptive Ontology for Linguistic and Cognitive Engineering) BFO (Basic Formal Ontology)
75
SUO top level
Entity
– Physical
• Object
– SelfConnectedObject » Substance » CorpuscularObject » Food – Region – Collection – Agent
• Process
– Abstract
• SetOrClass • Relation • Quantity
– Number – PhysicalQuantity
• Attribute • Proposition
76
MIGS Specification Top Levels
Organism Phenotype Environment Sample Process Data Process
77