How to Build an Ontology
Barry Smith http://ontology.buffalo.edu/smith
1
Mission of the NCBO
To create software and support services for science-based ontology development and use in the biomedical domain Science-based = ontologies for support of scientific research (taken as encompassing evidence-based medicine) Science-based = using the scientific method as part of the process of ontology development and testing
2
Scientific ontologies have special features
Every term in a scientific ontology must be such that the developers of the ontology believe it to refer to some entity on the basis of the best current evidence
5
For scientific ontologies
reusability is crucial compatibility with neighboring scientific ontologies is crucial it should not be too easy to add new terms to an ontology
we want to introduce these features in clinical medicine ...
6
An Ontological Square
Upper-level integrating ontologies Domain ontologies
10
An Ontological Square
Upper-level integrating ontologies
Ontologies in support of science Administrative ontologies
11
Domain ontologies
An Ontological Square
Upper-level integrating ontologies
Ontologies in support of science Administrative ontologies (for ecommerce, etc.)
Domain ontologies
BFO (Basic Formal SNOMED Ontology) SwissProt DOLCE FMA FOAF top level: person, topic, document, primary topic ... Amazon.com ontology Library of Congress Catalog
12
Problem of ensuring sensible cooperation in a massively interdisciplinary community
concept type instance model representation data
13
from Handbook of Ontology (Semantic Web approach)
RetailPrice hasA Denomination InstanceOf Dollar (p. 101) SI-Unit instanceof System-of-Units (p. 40)
14
from: Ontological Engineering (Semantic Web approach)
location =def. a spatial point identified by a name (p. 12)
arrivalPlace =def. a journey ends at a location (p. 13) facet =def. ternary relation that holds between a frame, a slot, and the facet (p. 51)
15
Entity =def
anything which exists, including things and processes, functions and qualities, beliefs and actions, documents and software (Levels 1, 2 and 3)
16
First basic distinction
universal vs. instance (science text vs. diary) (man vs. Maximilian)
17
Instances databases
For scientific ontologies
it is generalizations that are important = universals, types, kinds, species
18
Catalog vs. inventory
A B C
515287 521683 521682
DC3300 Dust Collector Fan Gilmer Belt Motor Drive Belt
19
Catalog vs. inventory
20
Catalog of Universals/Types
Ontology
Universals
Instances
22
Ontology = A Representation of Universals
23
Each node of an ontology consists of: • preferred term (aka term) • term identifier (TUI, aka CUI)
• synonyms
• definition, glosses, comments
Ontology = A representation of universals
24
An ontology is a representation of universals
We learn about universals in reality from looking at the results of scientific experiments in the form of scientific theories – which describe not what is particular in reality but what is general
25
universals
animal
mammal
substance
organism
cat
leaf class
siamese
frog
instances
Domain =def
a portion of reality that forms the subjectmatter of a single science or technology or mode of study or administrative practice ...;
proteomics HIV epidemiology
27
Representation =def
an image, idea, map, picture, name or description ... of some entity or entities.
28
Ontologies are representational artifacts
comparable to science texts
29
The Periodic Table
Periodic Table
33
Ontologies are here
34
or here
35
What do ontologies represent?
36
Ontologies do not represent concepts in people‟s heads
37
They represent universals in reality
38
“leg” is not the name of a concept
concepts do not stand in part_of connectedness causes treats ... relations to each other
39
instances
A B C
515287 521683 521682
DC3300 Dust Collector Fan Gilmer Belt Motor Drive Belt
universals
Inventory vs. Catalog: Two kinds of composite representational artifacts
Databases represent instances Ontologies represent universals
41
How do we know which general terms designate universals?
Roughly: terms used by scientists to designate entities about which we have a plurality of different kinds of testable proposition
(cell, electron ...)
42
Problem: fiat demarcations
male over 30 years of age with family history of diabetes abnormal curvature of spine
participant in trial #2030
43
Problem: roles
fist
patient FDA-approved drug
44
Administrative ontologies often need to go beyond universals
Fall on stairs or ladders in water transport injuring occupant of small boat, unpowered
Railway accident involving collision with rolling stock and injuring pedal cyclist
Nontraffic accident involving motor-driven snow vehicle injuring pedestrian
45
universals vs. classes
universals
{a,b,c,...}
classes
46
Class =def
a maximal collection of particulars determined by a general term („cell‟. „electron‟), („ „restaurant in Palo Alto‟, „Italian‟)
the class A = the collection of all particulars x for which „x is A’ is true
47
Problem
The same general term can be used to refer both to universals and to collections of particulars. Consider:
HIV is an infectious retrovirus HIV is spreading very rapidly through Asia
48
universals vs. classes
universals
{c,d,e,...}
classes
49
Extension =def
The extension of a universal A is the class: instance of the universal A (it is the class of A’s instances) (the class of all entities to which the term „A‟ applies)
50
universals vs. classes
universals
defined classes
51
universals vs. classes
universals
populations, ...
52
Defined class =def
a class defined by a general term which does not designate a universal the class of all diabetic patients in Leipzig on 4 June 1952
53
OWL is a good representation of defined classes
• sibling of Finnish spy • member of Abba aged > 50 years
54
Terminology =def.
a representational artifact whose representational units are natural language terms (with IDs, synonyms, comments, etc.) which are intended to designate universals together with defined classes.
55
universals, classes, concepts
universals
defined classes
„concepts‟
56
universals < defined classes < „concepts‟
„concepts‟ which do not correspond to defined classes: „Surgical or other procedure not carried out because of patient's decision‟ „Absent nipple‟
57
(Scientific) Ontology =def.
a representational artifact whose representational units (which may be drawn from a natural or from some formalized language) are intended to represent 1. universals in reality 2. those relations between these universals which obtain universally (= for all instances) lung is_a anatomical structure lobe of lung part_of lung
58
Part II
How to Build an Ontology
59
How to build an ontology
work with scientists to create an initial top-level classification find ~50 most commonly used terms corresponding to universals in reality arrange these terms into an informal is_a hierarchy according to this Universality principle A is_a B every instance of A is an instance of B fill in missing terms to give a complete hierarchy (leave it to domain scientists to populate the lower levels of the hierarchy)
60
Principle of Low Hanging Fruit
Include even absolutely trivial assertions (assertions you know to be universally true) pneumococcal virus is_a virus
Computers need to be led by the hand
61
MeSH
MeSH Descriptors Index Medicus Descriptor Anthropology, Education, Sociology and Social Phenomena (MeSH Category) Social Sciences Political Systems National Socialism National Socialism is_a Political Systems National Socialism is_a Anthropology ...
62
Principle
Use singular nouns
Terms in ontologies represent universals
63
Goal: Each term in an ontology represents exactly one universal
there are universals also of collectivities: population complex of cells
64
the use-mention confusion
Conceptual Entities =Def. An organizational header for concepts representing mostly abstract entities.
swimming is healthy and has eight letters
65
Principle
Avoid confusing between words and things Avoid confusing between concepts in our minds and entities in reality
Recommendation: avoid the word „concept‟ entirely
66
Trialbank
„information‟ = def. „a written or spoken designation of a concept‟
67
Trialbank
„Heparin therapy‟ is an instance of „written or spoken designation of a concept‟ What are the problems here? 1. misuse of quotation marks 2. confusion of instances and universals 3. confusion of concept and reality
68
Plant Ontology
cell = def. plant cell, consisting of protoplast and cell wall; ...
69
Principle
For the sake of interoperability with other ontologies, do not give special meanings to terms with established general meanings
(Don‟t use „cell‟ when you mean „plant cell‟)
70
ICNP: International Classification of Nursing Procedures
water =def. a type of Nursing Phenomenon of Physical Environment with the specific characteristics: clear liquid compound of hydrogen and oxygen that is essential for most plant and animal life influencing life and development of human beings.
71
Principle
Supply definitions wherever possible (both human-understandable natural language definitions, and equivalent formal definitions)
72
Principle
Each term should have at most one definition*
*which may have both natural-language and formal versions
73
The Problem of Circularity
A Person = def. A person with an identity document cell = def. plant cell, consisting of protoplast and cell wall; ...
74
Principle
Avoid circular definitions
(The term defined should not appear in its own definition)
75
HL7
„stopping a medication‟ = def. change of state in the record of a Substance Administration Act from Active to Aborted
76
Principle
A definition should use terms which are easier to understand than the term defined (HL7 creates a topsy turvy world, in which simple things are made difficult)
77
Principle
Use Aristotelian definitions
An A is a B which C‟s.
78
Principle
Do not seek to define everything
79
In every ontology
some terms and some relations are primitive = they cannot be defined (on pain of infinite regress) Examples of primitive relations: identity instance_of
80
Rules for formatting terms
• Avoid abbreviations even when it is clear in context what they mean („breast‟ for „breast tumor‟) • Avoid acronyms • Avoid mass terms („tissue‟, „brain mapping‟, „clinical research‟ ...) • Treat each term „A‟ in an ontology is shorthand for a term of the form „the universal A‟
83
Univocity
Terms should have the same meanings on every occasion of use. (They should refer to the same universals) Basic ontological relations such as is_a and part_of should be used in the same way by all ontologies
84
Universality
Ontologies should include only those relational assertions which hold universally pneumococcal virus causes pneumonia
85
Universality
Often, order will matter:
We can assert adult transformation_of child but not child transforms_into adult
86
Universality
viral pneumonia caused by virus but not virus causes pneumonia pneumococcal virus causes pneumonia
87
Universality
protocol-design earlier_than results analysis but not results analysis later_than protocol-design
88
Positivity
Complements of universals are not themselves universals. Terms such as non-mammal non-membrane other metalworker in New Zealand do not designate universals in reality
89
Ontology of universals logic of terms
There are no conjunctive and disjunctive universals: anatomic structure, system, or substance musculoskeletal and connective tissue disorder
rheumatism, excluding the back
90
Objectivity
Which universals exist in reality is not a function of our knowledge. Terms such as unknown unclassified unlocalized arthropathies not otherwise specified do not designate universals in reality.
91
Keep Epistemology Separate from Ontology
If you want to say that We do not know where A’s are located do not invent a new class of A’s with unknown locations (A well-constructed ontology should grow linearly; it should not need to delete classes or relations because of increases in knowledge)
92
Keep Sentences Separate from Terms
If you want to say
I surmise that this is a case of pneumonia do not invent a new class of surmised pneumonias
93
Single Inheritance
No kind in a classificatory hierarchy should have more than one is_a parent on the immediate higher level
94
Multiple Inheritance
thing
blue thing is_a blue car
car
is_a
95
Multiple Inheritance
is a source of errors encourages laziness serves as obstacle to integration with neighboring ontologies hampers use of Aristotelian methodology for defining terms hampers use of statistical search tools
96
Multiple Inheritance
thing
blue thing
is_a1 blue car
car
is_a2
97
is_a Overloading
The success of ontology alignment demands that ontological relations (is_a, part_of, ...) have the same meanings in the different ontologies to be aligned.
98
Compositionality
The meanings of compound terms should be determined 1. by the meanings of component terms together with 2. the rules governing syntax
99
Why do we need rules/standards for good ontology?
Ontologies must be intelligible both to humans (for annotation and curation) and to machines (for reasoning and error-checking): the lack of rules for classification leads to human error and blocks automatic reasoning and error-checking Intuitive rules facilitate training of curators and annotators Common rules allow alignment with other ontologies
100
ontologies are legends for cartoons
Randomized controlled trials
http://rctbank.ucsf.edu/ontology/outline/index.htm
102
Top-Level Class Hierarchy for RCT
Root
Secondary-study Trial-details Trial Concept
• • • • • • • Generic-concept Population-concept Protocol-concept Design-concept Outcome-concept Administrative-concept Intervention-concept
103
Trial Details
Root
Secondary-study Trial-details
• • • • Erratum Publication-details Trial-entry-details Administrative-details
– Secondary-administrative-details – Primary-administrative-details » Executed-administrative-details » Intended-administrative-details
• Conclusion-details • Background-details
– Intended-background-details – Executed-background-details
• • • •
Stopping-details Retraction-details Correction-details Fraud-details
104
Top-Most Class Hierarchy for RCT
Root
Secondary-study Trial-details Trial Concept
• • • • • • • Generic-concept Population-concept Protocol-concept Design-concept Outcome-concept Administrative-concept Intervention-concept
105
Concept
• Generic-concept
– – – – – – – – – – – – – – – – – – Term-information Time-entity Rule-concept Situation Subgroup Recruitment-flowchart Population Recruitment Site-enrollment Follow-up-compliance Follow-up-activity Follow-up Protocol-change Treatment-assignment Protocol Reason Outcomes-followup Secondary-study-protocol
106
• Population-concept
• Protocol-concept
Concept
• Design-concept
– – – – – – – – – – – – – – – Survival-analysis-and-results Statistical-analysis-and-results Sample-size-calculation Trial-design Hypothesis-concept Study-objective Study-monitoring Regression-analysis-and-results Stopping-rule Special-variable-information Outcome-assessment Miscellaneous-outcome-entity Result-entity Outcome-value-entity Outcome
107
• Outcome-concept
Concept
• Administrative-concept
– – – – – – – – Publication-concept Study-site Person Ethics Study-committee Funder Institution Registry-id
• Intervention-concept
– – – – – – – – Blinding-concept Compliance-details Intervention-step Intervention-arm Co-intervention Intervention Compliance-result Intervention-logic
108
Top-Level Class Hierarchy for RCT
Root
Secondary-study Trial-details Trial Concept
• • • • • • • Generic-concept Population-concept Protocol-concept Design-concept Outcome-concept Administrative-concept Intervention-concept
109
What the top level should look like
110
Two kinds of entities
occurrents (processes, events, happenings) continuants (objects, qualities, states...)
111
Continuants (aka endurants) have continuous existence in time preserve their identity through change exist in toto whenever they exist at all
Occurrents (aka processes) have temporal parts unfold themselves in successive phases exist only in their phases
112
You are a continuant
Your life is an occurrent
You are 3-dimensional
Your life is 4-dimensional
113
Dependent entities
require independent continuants as their bearers There is no run without a runner There is no grin without a cat
114
Dependent vs. independent continuants
Independent continuants (organisms, buildings, environments) Dependent continuants (quality, shape, role, propensity, function, status, power, right)
115
All occurrents are dependent entities
They are dependent on those independent continuants which are their participants (agents, patients, media ...)
116
BFO Top-Level Ontology
Continuant
Occurrent (always dependent on one or more independent continuants)
Independent Continuant
Dependent Continuant
117
= A representation of top-level types
Continuant
Occurrent
biological process
Independent Continuant Dependent Continuant
cell component
molecular function
118
Top-Level Ontology
Continuant
Occurrent
Independent Continuant
Dependent Continuant
Side-Effect, Stochastic Process, ... Functioning
Function
119
Top-Level Ontology
Continuant
Occurrent
Independent Continuant
Dependent Continuant
Functioning
Side-Effect, Stochastic Process, ...
Function
120
Top-Level Ontology
Continuant
Occurrent
Independent Continuant
Dependent Continuant
Functioning
Side-Effect, Stochastic Process, ...
Quality
Function
Spatial Region
instances (in space and time)
121
122
123
CTO will be part of OBI
Ontology of Biomedical Investigations
http://obi.sourceforge.net which is in turn part of the OBO Foundry http://obofoundry.org
124
125
126
127
128
129
Top-Level Class Hierarchy for RCT
Root
Secondary-study Trial-details Trial Concept
• • • • • • • Generic-concept Population-concept Protocol-concept Design-concept Outcome-concept Administrative-concept Intervention-concept
132
Amended Top-Level Class Hierarchy for RCT
Entity
Continuant Population Protocol Design Occurrent Trial Secondary-study Intervention ?? Trial-details ?? Outcome-concept ?? Administrative-concept
133
Concept
• Generic-concept
– Term-information – Time-entity – Rule-concept » Clinical-rule Exclusion-rule Inclusion-rule » Rule-entity Recursive-rule Base-rule » Ethnicity-language-rule » Age-gender-rule » Situation
134
135
136
Concept
• Protocol-concept
– – – – – – – – – Follow-up-compliance Follow-up-activity Follow-up Protocol-change Treatment-assignment Protocol Reason Outcomes-followup Secondary-study-protocol
137
Amended Top-Level Class Hierarchy for RCT
Entity
Continuant Protocol
• Secondary-study-protocol
Reason Occurrent
• Treatment-assignment • Follow-up
– Follow-up-activity – Outcomes-follow-up
• Protocol-change
138
Concept
• Population-concept
– – – – – Subgroup Recruitment-flowchart Population Recruitment Site-enrollment
139
Amended Top-Level Class Hierarchy for RCT
Entity
Continuant Protocol
• Secondary-study-protocol
Recruitment-flowchart
Reason Population
• Subgroup
Occurrent
• Priors
– Recruitment – Site-enrollment – Treatment-assignment
• Follow-up
– Follow-up-activity – Outcomes-follow-up
• Protocol-change
140
Concept
• Administrative-concept
– – – – – – – – Publication-concept Study-site Person Ethics Study-committee Funder Institution Registry-id
141
Continuant
• Information object
– Publication – Registry-ID
• Study-site • Person • Institution
– Study-committee – Funder
???Ethics
142
Concept
• Intervention-concept
– – – – – – – – Blinding-concept Compliance-details Intervention-step Intervention-arm Co-intervention Intervention Compliance-result Intervention-logic
143
Occurrent
• Intervention
– – – – Blinding Intervention-step Intervention-arm Co-intervention
• ??? Intervention-logic • ??? Compliance-result • ??? Compliance-details
144
END
167