Representing nested semantic information in a linear string of text
Michael Krauthammer, MD1, Stephen B. Johnson, PhD, George Hripcsak, MD MS1, David A.
Campbell1 and Carol Friedman 1,2
Department of Medical Informatics, Columbia University, New York, Queens College CUNY, New York
XML has been widely adopted as an important data respectively. Attributes of each element record either
interchange language. The structure of XML enables part-of-speech information, such as NN (short for
sharing of data elements with variable degrees of noun) or the type of disease, such as cardiomegaly.
nesting as long as the elements are grouped in a Ideally, we would envision a growing number of tags
strict tree-like fashion. This requirement potentially added to the existing markup as new research
restricts the usefulness of XML for marking up projects produce different kinds of annotations. For
written text, which often includes features that do not example, a future project may be concerned with
properly nest within other features. We encountered marking up sensitive data (Fig.1). The growing string
this problem while marking up medical text with of XML tags would therefore be a valuable resource
structured semantic information from a Natural for researchers who want to take advantage of
Language Processor. Traditional approaches to this previous analysis of the medical reports. In a related
problem separate the structured information from the paper, we discuss in detail how such complex XML
actual text mark up. This paper introduces an annotations can be stored and queried for a particular
alternative solution, which tightly integrates the set of tags or views .
semantic structure with the text. The resulting XML Fig. 1. successive markup of medical text reports
markup preserves the linearity of the medical texts S
and can therefore be easily expanded with additional <S>the heart is enlarged</S>
types of information.
the heart is enlarged
part of speech tagging
Extensible Markup Language (XML) is a subset of <p v="DE">the</p>
Standard Generalized Markup Language (SGML) <p v="NN">heart</p>
geared for data exchange and processing over the <p v="V">enlarged</p> p (DE) p (NN) p (VBZ) p (VBZ)
Web. XML is strongly supported by the industry and
the heart is enlarged
data standards such as HL7, which has adopted XML
for its upcoming RIM Version 3. Recently there
has been increasing interest in XML databases and <S>
the ability of XML to store semi-structured data[4, <d v="cardiomegaly">
<p>the</p> d (cardiomegalie)
5]. Semi-structured data is data that is self-describing <p>heart</p>
and can be processed and stored without an explicit <p>enlarged</p>
data schema. We are collecting such kinds of data for </d> p (DE) p (NN) p (VBZ) p (VBZ)
our data mining activities while parsing free medical
text reports with various types of parsers and taggers. the heart is enlarged
The parsing results in different kinds of text markup of sensitive data
annotations such as structured semantic information
for diseases and their body location and tagging may As XML requires documents to conform to a strict
identify syntactic part of speech information. As each structural composition, in particular in regard to how
data mining project „produces‟ it‟s own kind of text XML elements nest within other elements (Fig.1
annotations, we would prefer to store the cumulative right hand side), this cumulative storing of annotation
annotations of a medical report in a single XML file. may potentially be difficult to achieve. The linearity
Schematically, we would like to be able to mark up a of the written report may enforce an overlapping
sample sentence such as markup of the text, which is problematic in XML. To
the heart is enlarged. give an example, consider marking up the sentence
with different kinds of annotation tags as seen in the auscultation revealed no murmur or gallop.
Fig 1. Here, the p and d elements stand for part-of- Semantically, we could mark up the three main
speech and disease, representing an example of a concepts featured in the sentence as follows:
successive syntactic and semantic markup,
the <pr>auscultation</pr> revealed no <f>murmur</f> Fig. 1. XML tree structures
or <f>gallop</f> S
Here, the elements pr and f stand for procedure and
finding, respectively. Expanding the semantic
pr (ausculation) f (no gallop)
markup to include negation, the markup is not
f (no murmur)
problematic if we consider the negation for the
concept murmur only: n(negation)
the <pr>auscultation</pr> revealed <f><n>no</n> the ausculation revealed no murmur or gallop
murmur</f> or <f>gallop</f>.
Here, the element n stands for negation. Given the (b)
meaning of the sentence, the concept gallop should
also be negated, which results in the following pr (ausculation)
[…]<f><f><n>no</n> murmur</f> or gallop</f>.
In this markup, the two f elements are overlapping,
start no murmur
yielding an incorrect result where the finding no the ausculation revealed fs fs no murmur fe or gallop fe
murmur is nested in the finding no gallop (see Fig. start no gallop end no murmur end no gallop
2a). There are two problems with this representation:
First, the nesting implies an invalid part-of relation S
between the two findings, and second, querying the
XML tree becomes more difficult, as not all findings
pr (ausculation) f (no murmur) f (no gallop)
are situated on the same tree level.
n (negation id=4)
Table1 TEI strategies for overlapping markup
the ausculation revealed no murmur or gallop n(negation idref=4)
1. Boundary marking with milestone elements
2. Reconstitution of virtual element S
3. Fragmentation of elements
4. Multiple encoding of the same elements
f (erosion antrum) f (bleeding fundus) f (erosion antrum) f (bleeding fundus)
The SGML community has been recognizing this bl (antrum id=6) bl (fundus id=7)
issue for some time and has been proposing various
solutions that deal with overlapping text markup. The erosion bl (antrum idref=6) and bleeding bl (fundus idref=7) in the antrum and fundus
Text Encoding Initiative (TEI) discusses different
markup possibilities without proposing a single best Reconstitution of virtual elements is a commonly
strategy. These options are summarized in Tabl. 1. used strategy to deal with overlapping markup. At
our institution, we use a natural language processor
Milestone elements are empty elements, which called MedLEE[8-10], which transforms different
signify the start and end position of a markup. kinds of medical text reports (x-ray reports, discharge
Because they are empty (i.e. have no content), they summaries) into structured XML . The
do not cause nesting problems in case of overlapping representation creates numbered text elements, which
markup. In case of the example above are linked to semantic elements (virtual elements) in
[…]<fs/><fs/><n>no</n> murmur <fe/> or gallop<fe/>. a separate part of the document. This is shown
the elements fs and fe signify the start and end schematically in Fig.3. The semantic information
position of the annotation resulting in a flat markup constitutes the procedure ausculation, and the
(see Fig. 2b). While an XML parser would be able to findings no murmur and no gallop, which are linked
parse the above markup, the actual annotation would to numbered words in the text. This linking can be
not be recognized as a nested tree. A postprocessing represented in XML using identity attributes.
step would actually be necessary to convert the In case of MedLEE, the text elements are called phr
milestone representation into a regular tree structure. (for phrase) and are located in the text-section of the
documen (Fig. 4). The phr elements are linked via the
id/idref attribute (usually unique identifiers rather
than numbers as shown here for illustration purposes)
to the semantic findings in the structured section of
Fig.3 semantic information linked to words in the text
METHODS AND RESULTS
The problem can be generalized as follows: Given a
ausculation no murmur no gallop
string of words and a set of annotations which refer
to any combination of those words in the string,
assign annotations with no overlap and with minimal
empty elements and link attributes. Given our
the ausculation revealed no murmur or gallop
1 2 3 4 5 6 7
example sentence above, a possible markup which
satisfy this requirements would look as follows:
the <pr>ausculation</pr> revealed <f><n id=”4”>no</n>
Fig. 4. MedLEE format murmur</f> or <f>gallop<n idref=”4”/></f>.
<structured> This representation (which we will call linearized
<procedure v=”ausculatation” idref=”2” /> during the remainder of this paper) includes a single
<finding v=”murmur” idref=”5” ><certainty v=”no” link (the negation of the finding gallop is linked to
idref=”4”></finding> the actual word no in the text) as well as a single
<finding v=”gallop” idref=”7”><certainty v=”no” empty element (the negation of the finding gallop).
idref=”4”></finding> This markup produces a properly nested structure as
</structured> can be seen in Fig 2c.
<text>the <phr id=”2”>ausculation</phr> revealed <phr
Given the current MedLEE XML markup format
id=”4”>no</phr> <phr id=”5”>murmur<phr> or <phr
id=”7”>gallop</phr>. discussed above, which uses virtual elements linked
</text> to numbered text elements, we wanted to explore
whether we could generate the linearized
As can be easily seen, this representation correctly representation automatically. We developed an
handles the negation of the elements murmur and algorithm suited for this task. The main idea is to
gallop. As with the milestone elements, a represent each semantic concept (finding, procedure
postprocessor is needed to transform this split etc.) as an ordered set consisting of the element
representation into a nested tree, which includes numbers of the text. The algorithm decides which
semantic and textual information together. text elements will be included in the final markup of
each concept. By marking up one semantic concept
Two further TEI strategies, fragmentation of existing after another, the sets dynamically change their
elements as well as multiple encoding of the same content. Given the above example sentence, the sets
elements, are not really suited for our problem would initially look as follows (see also Fig.3 ):
because they either do not code all necessary
elements or increase the complexity of the markup. Semantic concept Sets
Beyond TEI, there are other solutions dealing with Auscultation (2)
concurrent markup. For example, “standoff No murmur (4,5)
annotation” introduces two separate documents for No Gallop (4,7)
the markup and the actual text corpus . Other The sets corresponds to the numbers of the text
solutions use XPath expressions  to encode elements in the text:
overlapping hierarchies , or use advanced markup the <2>ausculation</2> revealed <4>no</4>
grammars supporting concurrent markup . <5>murmur</5> or <7>gallop</7>.
The algorithm sorts the sets by listing the set
We propose an alternative representation, which
containing the lowest element numbers on top.
combines ideas from both the milestone and virtual
During the first iteration, the algorithm would then
elements strategy. Our main goal is to keep the text
mark up the semantic concept corresponding to the
and annotation together while minimizing the need
set on top - in this case auscultation. The algorithm
for empty elements and link attributes. In this
decides which text elements in the set are included in
representation, empty elements are not used as
the markup. The rule is as follows: Include every
starting and end points of a markup, but rather as
element whose number is equal or smaller than the
semantic elements, which link to remote text
element numbers of the next concept - in this case no
elements. The resulting representation should be
murmur. As 2 is smaller than any number in the next
linear and maintain the word order of the original
set, the text element number 2 is included in the
text, thus be suitable for inclusion into a semi-
markup of ausculation.
structured database. The linear structure guarantees
the ability to add new annotation where necessary as the <pr>ausculation</pr> revealed <4>no</4>
well as fast query performance through maintaining a <5>murmur</5> or <7>gallop</7>.
single main XML tree. The number 2 is removed from the sets, and the
remaining sets are again sorted. Concept no murmur
is now on top, corresponding to a set with text Bleeding antrum (3,6)
element numbers 4 and 5. Bleeding fundus (3,7)
Semantic concept Sets
No murmur (4,5) In this case, the algorithm would perform the same
No Gallop (4,7) iterations as before, backtracking after step 2 to
resolve a situation where two concepts correspond to
The first number, 4, is equal to the first number of the the same set of text elements. The final markup
set corresponding to no gallop and included in the would look as follows (ids/idrefs not shown, see also
markup. Fig. 2d):
Semantic Sets Included Not <f >erosions<bl/></f> and <f>bleeding><bl/></f> in the
concept included <f><bl >antrum</bl></f> and <f>fundus</bl></f>.
No murmur (5) 4 The element bl stands for body location, such as
No Gallop (7) 4 antrum or fundus. As can bee seen, although there
At this point, the table stores the number 4 as being was considerable overlap in the original semantic
included (no murmur) as well as excluded (no gallop) markup, obtaining a linearized representation is
in the text markup. The two remaining sets contain a feasible.
single element number, and the algorithm assigns This algorithm has been run on more complex
number 5 to the concept on top – no murmur. sentences, as well as on complete discharge
Marking up the latter concept demands a link summaries parsed by MedLEE. We are planning an
attribute at position 4, which corresponds to the word automatic validation of the algorithm by first
no, and serves as a reference for concept no gallop. automatically linearizing the original split MedLEE
the <pr>ausculation</pr> revealed <f><n id=”4”>no format (see above), and subsequently transforming
</n>murmur</f> or <7>gallop</7>. the linearized representation back to the split format.
The markup of the last concept – no gallop - is a If we can reconstruct the original format, we could
straightforward process. demonstrate that the algorithm is working and
conserving all data in a linearized fashion.
Semantic Sets Included Not
No Gallop (7) 4 DISCUSSION AND CONCLUSION
The fact that text element 4 was excluded from the There is growing interest in so-called semi-structured
markup necessitates an empty „negation‟ element XML databases [4, 5], which are very flexible in
with a link attribute. The resulting markup looks as storing incomplete and changing types of data
follows (Fig. 2c): content. We are encountering such types of data in
the <pr>ausculation</pr> revealed <f><n id=”4”>no our data mining activities on medical free text
</n>murmur</f> or <f>gallop<n idref=”4”/></f>. reports. Different projects generate various kinds of
text annotations, which we would like to store
Looking at a different example, it turns out that
conveniently in a single XML file. Unavoidably,
seemingly short sentences may contain surprisingly
some of these annotations will be overlapping,
many semantic concepts. Consider a sentence
generating an invalid XML structure. The
erosions and bleeding in the antrum and fundus. significance of this paper lies in the introduction of
MedLEE recognizes 4 different medical concepts in an alternative approach to resolving overlapping text
this sentence: erosion antrum, erosion fundus, markup that preservers the linearity of the original
bleeding antrum and bleeding fundus. Marking up text.
such a sentence is problematic. For example, the Marking up medical texts with semantic information
finding bleeding separates the finding erosion and the by Natural Language Processing (NLP) inherently
body location antrum, causing potential nesting produces overlapping annotations. We observed this
problems. problem especially with conjunction (and) or
The numbered sentence looks as follows: disjunctions (or) in the text. Consider a sentence
<1>erosions</1> and <3>bleeding</3> in the No bleeding in the antrum and fundus.
<6>antrum</6> and <7>fundus</7>. We use conjunction as above in our daily language to
The table of sets reveals a more complex situation conveniently communicate several facts in the
than above. shortest form possible. Expanding the above sentence
reveals two different facts
Semantic concept Sets No bleeding in the antrum.
Erosion antrum (1,6) No bleeding in the fundus.
Erosion fundus (1,7) Marking up the main constituents of each fact ,
[no bleeding antrum] and [no bleeding fundus], 12. Thompson, H. and M. D. Hyperlink
generates an overlapping annotation. Annotating such semantics for standoff markup of read-only
overlapping information with XML is difficult documents. in SGML Europe '97. 1997.
without generating an invalid nesting of elements. Barcelona.
Traditional approaches to this problem either separate 13. W3C, XML Path Language (XPath) Version
the annotation from the text by creating so called 1.0. 1999.
virtual elements or use empty elements to indicate the 14. Durusau, P. and M. O'Donnell.
start and end positions of a specific annotation. Both Implementing Concurrent Markup in XML.
approaches have their disadvantages: virtual elements in Extreme Markup Languages 2001. 2001.
separate the documents in two parts (annotation and Montreal.
text), and empty elements must be converted to 15. Sperberg-McQueen, C. and C. Huifeldt.
regular elements in order to represent a tree structure. GODDAG: A Data Structure for
The linearized representation of XML presented in Overlapping Hierarchies. in ACH-ALLC'99.
the paper keeps the XML tags and the text together 1999. Charlottesville, Virginia.
while minimally relying on link attributes as well as
empty elements. We think this representation is a
valid alternative to other markup solutions.
This work was supported by National Library of
Medicine grants R01-LM06910 and R01-LM06274
1. W3C, Extensible Markup Language (XML).
2. W3C, Standard Generalized Markup
Language (SGML). 1999.
3. HL7, Reference Information Model. 2002.
4. Deutsch, A., et al., Querying XML Data.
IEEE Data Engineering Bulletin, 1999.
5. Buneman, P. Semistructured data. in PODS-
97. 1997. Tucson, Arizona.
6. Johnson, S.B., et al., Using semistructured
data for clinical data mining. Symp. AMIA
2002 (submitted), 2002.
7. Sperberg-McQueen, C. and L. Burnard, TEI
Guidlines for Electronic Text Encoding and
Interchange (P3). 1994.
8. Friedman, C., et al., A general natural-
language text processor for clinical
radiology. J Am Med Inform Assoc, 1994.
1(2): p. 161-74.
9. Friedman, C., Towards a comprehensive
medical language processing system:
methods and issues. Proc AMIA Annu Fall
Symp, 1997: p. 595-9.
10. Friedman, C., A broad-coverage natural
language processing system. Proc AMIA
Symp, 2000: p. 270-4.
11. Friedman, C., et al., Representing
information in patient reports using natural
language processing and the extensible
markup language. J Am Med Inform Assoc,
1999. 6(1): p. 76-87.