A special case for XML markup Representing complex semantic

Document Sample
A special case for XML markup Representing complex semantic Powered By Docstoc
					          Representing nested semantic information in a linear string of text
                                   using XML
 Michael Krauthammer, MD1, Stephen B. Johnson, PhD, George Hripcsak, MD MS1, David A.
                           Campbell1 and Carol Friedman 1,2
      1                                                                      2
       Department of Medical Informatics, Columbia University, New York, Queens College CUNY, New York

XML has been widely adopted as an important data           respectively. Attributes of each element record either
interchange language. The structure of XML enables         part-of-speech information, such as NN (short for
sharing of data elements with variable degrees of          noun) or the type of disease, such as cardiomegaly.
nesting as long as the elements are grouped in a           Ideally, we would envision a growing number of tags
strict tree-like fashion. This requirement potentially     added to the existing markup as new research
restricts the usefulness of XML for marking up             projects produce different kinds of annotations. For
written text, which often includes features that do not    example, a future project may be concerned with
properly nest within other features. We encountered        marking up sensitive data (Fig.1). The growing string
this problem while marking up medical text with            of XML tags would therefore be a valuable resource
structured semantic information from a Natural             for researchers who want to take advantage of
Language Processor. Traditional approaches to this         previous analysis of the medical reports. In a related
problem separate the structured information from the       paper, we discuss in detail how such complex XML
actual text mark up. This paper introduces an              annotations can be stored and queried for a particular
alternative solution, which tightly integrates the         set of tags or views [6].
semantic structure with the text. The resulting XML        Fig. 1. successive markup of medical text reports
markup preserves the linearity of the medical texts                                                                    S
and can therefore be easily expanded with additional                   <S>the heart is enlarged</S>
types of information.
                                                                                                               the heart is enlarged
                                                            part of speech tagging
                  INTRODUCTION                                                                                             S
Extensible Markup Language (XML)[1] is a subset of                         <p v="DE">the</p>
Standard Generalized Markup Language (SGML)[2]                             <p v="NN">heart</p>
                                                                           <p v="V">is</p>
geared for data exchange and processing over the                           <p v="V">enlarged</p>      p (DE)      p (NN)       p (VBZ)   p (VBZ)
Web. XML is strongly supported by the industry and
                                                                                                       the         heart         is      enlarged
data standards such as HL7, which has adopted XML
                                                              semantic parsing
for its upcoming RIM Version 3[3]. Recently there
has been increasing interest in XML databases and                <S>
the ability of XML to store semi-structured data[4,                       <d v="cardiomegaly">
                                                                                <p>the</p>                       d (cardiomegalie)
5]. Semi-structured data is data that is self-describing                        <p>heart</p>
and can be processed and stored without an explicit                             <p>enlarged</p>
data schema. We are collecting such kinds of data for                     </d>                        p (DE)      p (NN)       p (VBZ)   p (VBZ)
our data mining activities while parsing free medical
text reports with various types of parsers and taggers.                                                the         heart         is      enlarged

The parsing results in different kinds of text             markup of sensitive data

annotations such as structured semantic information
for diseases and their body location and tagging may       As XML requires documents to conform to a strict
identify syntactic part of speech information. As each     structural composition, in particular in regard to how
data mining project „produces‟ it‟s own kind of text       XML elements nest within other elements (Fig.1
annotations, we would prefer to store the cumulative       right hand side), this cumulative storing of annotation
annotations of a medical report in a single XML file.      may potentially be difficult to achieve. The linearity
Schematically, we would like to be able to mark up a       of the written report may enforce an overlapping
sample sentence such as                                    markup of the text, which is problematic in XML. To
                 the heart is enlarged.                    give an example, consider marking up the sentence
with different kinds of annotation tags as seen in         the auscultation revealed no murmur or gallop.
Fig 1. Here, the p and d elements stand for part-of-       Semantically, we could mark up the three main
speech and disease, representing an example of a           concepts featured in the sentence as follows:
successive syntactic and semantic markup,
the <pr>auscultation</pr> revealed no <f>murmur</f>       Fig. 1. XML tree structures
or <f>gallop</f>                                                                               S
Here, the elements pr and f stand for procedure and
finding, respectively.   Expanding the semantic
                                                                              pr (ausculation)                     f (no gallop)
markup to include negation, the markup is not
                                                                                                                   f (no murmur)
problematic if we consider the negation for the
concept murmur only:                                                                                            n(negation)

the <pr>auscultation</pr> revealed       <f><n>no</n>             the          ausculation             revealed          no         murmur            or              gallop
murmur</f> or <f>gallop</f>.
Here, the element n stands for negation. Given the           (b)

meaning of the sentence, the concept gallop should
also be negated, which results in the following                          pr (ausculation)
[…]<f><f><n>no</n> murmur</f> or gallop</f>.
In this markup, the two f elements are overlapping,
                                                                                                                  start no murmur
yielding an incorrect result where the finding no           the      ausculation        revealed           fs      fs       no       murmur           fe        or        gallop        fe
murmur is nested in the finding no gallop (see Fig.                                                       start no gallop               end no murmur                      end no gallop

2a). There are two problems with this representation:
First, the nesting implies an invalid part-of relation                                             S
between the two findings, and second, querying the
XML tree becomes more difficult, as not all findings
                                                                                pr (ausculation)                        f (no murmur)                                f (no gallop)
are situated on the same tree level.
                                                                                                                  n (negation id=4)
Table1 TEI strategies for overlapping markup
                                                                   the           ausculation            revealed           no         murmur           or             gallop         n(negation idref=4)
1. Boundary marking with milestone elements
2. Reconstitution of virtual element                                                             S
3. Fragmentation of elements
4. Multiple encoding of the same elements
                                                            f (erosion antrum)                          f (bleeding fundus)                                f (erosion antrum)         f (bleeding fundus)

The SGML community has been recognizing this                                                                                                                    bl (antrum id=6)         bl (fundus id=7)
issue for some time and has been proposing various
solutions that deal with overlapping text markup. The      erosion       bl (antrum idref=6)       and      bleeding            bl (fundus idref=7)        in    the       antrum        and     fundus
Text Encoding Initiative (TEI)[7] discusses different
markup possibilities without proposing a single best      Reconstitution of virtual elements is a commonly
strategy. These options are summarized in Tabl. 1.        used strategy to deal with overlapping markup. At
                                                          our institution, we use a natural language processor
Milestone elements are empty elements, which              called MedLEE[8-10], which transforms different
signify the start and end position of a markup.           kinds of medical text reports (x-ray reports, discharge
Because they are empty (i.e. have no content), they       summaries) into structured XML [11]. The
do not cause nesting problems in case of overlapping      representation creates numbered text elements, which
markup. In case of the example above                      are linked to semantic elements (virtual elements) in
[…]<fs/><fs/><n>no</n> murmur <fe/> or gallop<fe/>.       a separate part of the document. This is shown
the elements fs and fe signify the start and end          schematically in Fig.3. The semantic information
position of the annotation resulting in a flat markup     constitutes the procedure ausculation, and the
(see Fig. 2b). While an XML parser would be able to       findings no murmur and no gallop, which are linked
parse the above markup, the actual annotation would       to numbered words in the text. This linking can be
not be recognized as a nested tree. A postprocessing      represented in XML using identity attributes.
step would actually be necessary to convert the           In case of MedLEE, the text elements are called phr
milestone representation into a regular tree structure.   (for phrase) and are located in the text-section of the
                                                          documen (Fig. 4). The phr elements are linked via the
                                                          id/idref attribute (usually unique identifiers rather
                                                          than numbers as shown here for illustration purposes)
                                                          to the semantic findings in the structured section of
                                                          the document.
Fig.3 semantic information linked to words in the text
                                                                       METHODS AND RESULTS
                                                          The problem can be generalized as follows: Given a
   ausculation       no murmur         no gallop
                                                          string of words and a set of annotations which refer
                                                          to any combination of those words in the string,
                                                          assign annotations with no overlap and with minimal
                                                          empty elements and link attributes. Given our
 the ausculation revealed no murmur or gallop
 1     2            3     4    5    6 7
                                                          example sentence above, a possible markup which
                                                          satisfy this requirements would look as follows:
                                                          the <pr>ausculation</pr> revealed <f><n id=”4”>no</n>
Fig. 4. MedLEE format                                     murmur</f> or <f>gallop<n idref=”4”/></f>.
<structured>                                              This representation (which we will call linearized
<procedure v=”ausculatation” idref=”2” />                 during the remainder of this paper) includes a single
<finding v=”murmur” idref=”5”        ><certainty v=”no”   link (the negation of the finding gallop is linked to
idref=”4”></finding>                                      the actual word no in the text) as well as a single
<finding    v=”gallop”    idref=”7”><certainty   v=”no”   empty element (the negation of the finding gallop).
idref=”4”></finding>                                      This markup produces a properly nested structure as
</structured>                                             can be seen in Fig 2c.
<text>the <phr id=”2”>ausculation</phr> revealed <phr
                                                          Given the current MedLEE XML markup format
id=”4”>no</phr> <phr id=”5”>murmur<phr> or <phr
id=”7”>gallop</phr>.                                      discussed above, which uses virtual elements linked
</text>                                                   to numbered text elements, we wanted to explore
                                                          whether we could generate the linearized
As can be easily seen, this representation correctly      representation automatically. We developed an
handles the negation of the elements murmur and           algorithm suited for this task. The main idea is to
gallop. As with the milestone elements, a                 represent each semantic concept (finding, procedure
postprocessor is needed to transform this split           etc.) as an ordered set consisting of the element
representation into a nested tree, which includes         numbers of the text. The algorithm decides which
semantic and textual information together.                text elements will be included in the final markup of
                                                          each concept. By marking up one semantic concept
Two further TEI strategies, fragmentation of existing     after another, the sets dynamically change their
elements as well as multiple encoding of the same         content. Given the above example sentence, the sets
elements, are not really suited for our problem           would initially look as follows (see also Fig.3 ):
because they either do not code all necessary
elements or increase the complexity of the markup.        Semantic concept            Sets
Beyond TEI, there are other solutions dealing with        Auscultation                (2)
concurrent markup. For example, “standoff                 No murmur                   (4,5)
annotation” introduces two separate documents for         No Gallop                   (4,7)
the markup and the actual text corpus [12]. Other         The sets corresponds to the numbers of the text
solutions use XPath expressions [13] to encode            elements in the text:
overlapping hierarchies [14], or use advanced markup      the   <2>ausculation</2>    revealed       <4>no</4>
grammars supporting concurrent markup [15].               <5>murmur</5> or <7>gallop</7>.
                                                          The algorithm sorts the sets by listing the set
We propose an alternative representation, which
                                                          containing the lowest element numbers on top.
combines ideas from both the milestone and virtual
                                                          During the first iteration, the algorithm would then
elements strategy. Our main goal is to keep the text
                                                          mark up the semantic concept corresponding to the
and annotation together while minimizing the need
                                                          set on top - in this case auscultation. The algorithm
for empty elements and link attributes. In this
                                                          decides which text elements in the set are included in
representation, empty elements are not used as
                                                          the markup. The rule is as follows: Include every
starting and end points of a markup, but rather as
                                                          element whose number is equal or smaller than the
semantic elements, which link to remote text
                                                          element numbers of the next concept - in this case no
elements. The resulting representation should be
                                                          murmur. As 2 is smaller than any number in the next
linear and maintain the word order of the original
                                                          set, the text element number 2 is included in the
text, thus be suitable for inclusion into a semi-
                                                          markup of ausculation.
structured database. The linear structure guarantees
the ability to add new annotation where necessary as      the  <pr>ausculation</pr>    revealed      <4>no</4>
well as fast query performance through maintaining a      <5>murmur</5> or <7>gallop</7>.
single main XML tree.                                     The number 2 is removed from the sets, and the
                                                          remaining sets are again sorted. Concept no murmur
is now on top, corresponding to a set with text             Bleeding antrum              (3,6)
element numbers 4 and 5.                                    Bleeding fundus              (3,7)
Semantic concept             Sets
No murmur                    (4,5)                          In this case, the algorithm would perform the same
No Gallop                    (4,7)                          iterations as before, backtracking after step 2 to
                                                            resolve a situation where two concepts correspond to
The first number, 4, is equal to the first number of the    the same set of text elements. The final markup
set corresponding to no gallop and included in the          would look as follows (ids/idrefs not shown, see also
markup.                                                     Fig. 2d):
Semantic          Sets        Included     Not              <f >erosions<bl/></f> and <f>bleeding><bl/></f> in the
concept                                    included         <f><bl >antrum</bl></f> and <f>fundus</bl></f>.
No murmur         (5)                4                      The element bl stands for body location, such as
No Gallop         (7)                             4         antrum or fundus. As can bee seen, although there
At this point, the table stores the number 4 as being       was considerable overlap in the original semantic
included (no murmur) as well as excluded (no gallop)        markup, obtaining a linearized representation is
in the text markup. The two remaining sets contain a        feasible.
single element number, and the algorithm assigns            This algorithm has been run on more complex
number 5 to the concept on top – no murmur.                 sentences, as well as on complete discharge
Marking up the latter concept demands a link                summaries parsed by MedLEE. We are planning an
attribute at position 4, which corresponds to the word      automatic validation of the algorithm by first
no, and serves as a reference for concept no gallop.        automatically linearizing the original split MedLEE
the <pr>ausculation</pr> revealed <f><n id=”4”>no           format (see above), and subsequently transforming
</n>murmur</f> or <7>gallop</7>.                            the linearized representation back to the split format.
The markup of the last concept – no gallop - is a           If we can reconstruct the original format, we could
straightforward process.                                    demonstrate that the algorithm is working and
                                                            conserving all data in a linearized fashion.
Semantic          Sets        Included     Not
concept                                    included
No Gallop         (7)                          4                    DISCUSSION AND CONCLUSION
The fact that text element 4 was excluded from the          There is growing interest in so-called semi-structured
markup necessitates an empty „negation‟ element             XML databases [4, 5], which are very flexible in
with a link attribute. The resulting markup looks as        storing incomplete and changing types of data
follows (Fig. 2c):                                          content. We are encountering such types of data in
the <pr>ausculation</pr> revealed <f><n id=”4”>no           our data mining activities on medical free text
</n>murmur</f> or <f>gallop<n idref=”4”/></f>.              reports. Different projects generate various kinds of
                                                            text annotations, which we would like to store
Looking at a different example, it turns out that
                                                            conveniently in a single XML file. Unavoidably,
seemingly short sentences may contain surprisingly
                                                            some of these annotations will be overlapping,
many semantic concepts. Consider a sentence
                                                            generating an invalid XML structure. The
erosions and bleeding in the antrum and fundus.             significance of this paper lies in the introduction of
MedLEE recognizes 4 different medical concepts in           an alternative approach to resolving overlapping text
this sentence: erosion antrum, erosion fundus,              markup that preservers the linearity of the original
bleeding antrum and bleeding fundus. Marking up             text.
such a sentence is problematic. For example, the            Marking up medical texts with semantic information
finding bleeding separates the finding erosion and the      by Natural Language Processing (NLP) inherently
body location antrum, causing potential nesting             produces overlapping annotations. We observed this
problems.                                                   problem especially with conjunction (and) or
The numbered sentence looks as follows:                     disjunctions (or) in the text. Consider a sentence
<1>erosions</1> and <3>bleeding</3>            in     the   No bleeding in the antrum and fundus.
<6>antrum</6> and <7>fundus</7>.                            We use conjunction as above in our daily language to
The table of sets reveals a more complex situation          conveniently communicate several facts in the
than above.                                                 shortest form possible. Expanding the above sentence
                                                            reveals two different facts
Semantic concept             Sets                           No bleeding in the antrum.
Erosion antrum               (1,6)                          No bleeding in the fundus.
Erosion fundus               (1,7)                          Marking up the main constituents of each fact ,
[no bleeding antrum] and [no bleeding fundus],             12.   Thompson, H. and M. D. Hyperlink
generates an overlapping annotation. Annotating such             semantics for standoff markup of read-only
overlapping information with XML is difficult                    documents. in SGML Europe '97. 1997.
without generating an invalid nesting of elements.               Barcelona.
Traditional approaches to this problem either separate     13.   W3C, XML Path Language (XPath) Version
the annotation from the text by creating so called               1.0. 1999.
virtual elements or use empty elements to indicate the     14.   Durusau,    P.     and     M.    O'Donnell.
start and end positions of a specific annotation. Both           Implementing Concurrent Markup in XML.
approaches have their disadvantages: virtual elements            in Extreme Markup Languages 2001. 2001.
separate the documents in two parts (annotation and              Montreal.
text), and empty elements must be converted to             15.   Sperberg-McQueen, C. and C. Huifeldt.
regular elements in order to represent a tree structure.         GODDAG: A Data Structure for
The linearized representation of XML presented in                Overlapping Hierarchies. in ACH-ALLC'99.
the paper keeps the XML tags and the text together               1999. Charlottesville, Virginia.
while minimally relying on link attributes as well as
empty elements. We think this representation is a
valid alternative to other markup solutions.

This work was supported by National Library of
Medicine grants R01-LM06910 and R01-LM06274


1.       W3C, Extensible Markup Language (XML).
2.       W3C, Standard Generalized Markup
         Language (SGML). 1999.
3.       HL7, Reference Information Model. 2002.
4.       Deutsch, A., et al., Querying XML Data.
         IEEE Data Engineering Bulletin, 1999.
5.       Buneman, P. Semistructured data. in PODS-
         97. 1997. Tucson, Arizona.
6.       Johnson, S.B., et al., Using semistructured
         data for clinical data mining. Symp. AMIA
         2002 (submitted), 2002.
7.       Sperberg-McQueen, C. and L. Burnard, TEI
         Guidlines for Electronic Text Encoding and
         Interchange (P3). 1994.
8.       Friedman, C., et al., A general natural-
         language text processor for clinical
         radiology. J Am Med Inform Assoc, 1994.
         1(2): p. 161-74.
9.       Friedman, C., Towards a comprehensive
         medical language processing system:
         methods and issues. Proc AMIA Annu Fall
         Symp, 1997: p. 595-9.
10.      Friedman, C., A broad-coverage natural
         language processing system. Proc AMIA
         Symp, 2000: p. 270-4.
11.      Friedman, C., et al., Representing
         information in patient reports using natural
         language processing and the extensible
         markup language. J Am Med Inform Assoc,
         1999. 6(1): p. 76-87.