Representing nested semantic information in a linear string of text using XML Michael Krauthammer, MD1, Stephen B. Johnson, PhD, George Hripcsak, MD MS1, David A. Campbell1 and Carol Friedman 1,2 1 2 Department of Medical Informatics, Columbia University, New York, Queens College CUNY, New York XML has been widely adopted as an important data respectively. Attributes of each element record either interchange language. The structure of XML enables part-of-speech information, such as NN (short for sharing of data elements with variable degrees of noun) or the type of disease, such as cardiomegaly. nesting as long as the elements are grouped in a Ideally, we would envision a growing number of tags strict tree-like fashion. This requirement potentially added to the existing markup as new research restricts the usefulness of XML for marking up projects produce different kinds of annotations. For written text, which often includes features that do not example, a future project may be concerned with properly nest within other features. We encountered marking up sensitive data (Fig.1). The growing string this problem while marking up medical text with of XML tags would therefore be a valuable resource structured semantic information from a Natural for researchers who want to take advantage of Language Processor. Traditional approaches to this previous analysis of the medical reports. In a related problem separate the structured information from the paper, we discuss in detail how such complex XML actual text mark up. This paper introduces an annotations can be stored and queried for a particular alternative solution, which tightly integrates the set of tags or views . semantic structure with the text. The resulting XML Fig. 1. successive markup of medical text reports markup preserves the linearity of the medical texts S and can therefore be easily expanded with additional <S>the heart is enlarged</S> types of information. the heart is enlarged part of speech tagging INTRODUCTION S <S> Extensible Markup Language (XML) is a subset of <p v="DE">the</p> Standard Generalized Markup Language (SGML) <p v="NN">heart</p> <p v="V">is</p> geared for data exchange and processing over the <p v="V">enlarged</p> p (DE) p (NN) p (VBZ) p (VBZ) </S> Web. XML is strongly supported by the industry and the heart is enlarged data standards such as HL7, which has adopted XML semantic parsing for its upcoming RIM Version 3. Recently there S has been increasing interest in XML databases and <S> the ability of XML to store semi-structured data[4, <d v="cardiomegaly"> <p>the</p> d (cardiomegalie) 5]. Semi-structured data is data that is self-describing <p>heart</p> <p>is</p> and can be processed and stored without an explicit <p>enlarged</p> data schema. We are collecting such kinds of data for </d> p (DE) p (NN) p (VBZ) p (VBZ) </S> our data mining activities while parsing free medical text reports with various types of parsers and taggers. the heart is enlarged The parsing results in different kinds of text markup of sensitive data annotations such as structured semantic information for diseases and their body location and tagging may As XML requires documents to conform to a strict identify syntactic part of speech information. As each structural composition, in particular in regard to how data mining project „produces‟ it‟s own kind of text XML elements nest within other elements (Fig.1 annotations, we would prefer to store the cumulative right hand side), this cumulative storing of annotation annotations of a medical report in a single XML file. may potentially be difficult to achieve. The linearity Schematically, we would like to be able to mark up a of the written report may enforce an overlapping sample sentence such as markup of the text, which is problematic in XML. To the heart is enlarged. give an example, consider marking up the sentence with different kinds of annotation tags as seen in the auscultation revealed no murmur or gallop. Fig 1. Here, the p and d elements stand for part-of- Semantically, we could mark up the three main speech and disease, representing an example of a concepts featured in the sentence as follows: successive syntactic and semantic markup, the <pr>auscultation</pr> revealed no <f>murmur</f> Fig. 1. XML tree structures or <f>gallop</f> S (a) Here, the elements pr and f stand for procedure and finding, respectively. Expanding the semantic pr (ausculation) f (no gallop) markup to include negation, the markup is not f (no murmur) problematic if we consider the negation for the concept murmur only: n(negation) the <pr>auscultation</pr> revealed <f><n>no</n> the ausculation revealed no murmur or gallop murmur</f> or <f>gallop</f>. Here, the element n stands for negation. Given the (b) S meaning of the sentence, the concept gallop should also be negated, which results in the following pr (ausculation) markup: […]<f><f><n>no</n> murmur</f> or gallop</f>. n(negation) In this markup, the two f elements are overlapping, start no murmur yielding an incorrect result where the finding no the ausculation revealed fs fs no murmur fe or gallop fe murmur is nested in the finding no gallop (see Fig. start no gallop end no murmur end no gallop 2a). There are two problems with this representation: (c) First, the nesting implies an invalid part-of relation S between the two findings, and second, querying the XML tree becomes more difficult, as not all findings pr (ausculation) f (no murmur) f (no gallop) are situated on the same tree level. n (negation id=4) Table1 TEI strategies for overlapping markup the ausculation revealed no murmur or gallop n(negation idref=4) 1. Boundary marking with milestone elements (d) 2. Reconstitution of virtual element S 3. Fragmentation of elements 4. Multiple encoding of the same elements f (erosion antrum) f (bleeding fundus) f (erosion antrum) f (bleeding fundus) The SGML community has been recognizing this bl (antrum id=6) bl (fundus id=7) issue for some time and has been proposing various solutions that deal with overlapping text markup. The erosion bl (antrum idref=6) and bleeding bl (fundus idref=7) in the antrum and fundus Text Encoding Initiative (TEI) discusses different markup possibilities without proposing a single best Reconstitution of virtual elements is a commonly strategy. These options are summarized in Tabl. 1. used strategy to deal with overlapping markup. At our institution, we use a natural language processor Milestone elements are empty elements, which called MedLEE[8-10], which transforms different signify the start and end position of a markup. kinds of medical text reports (x-ray reports, discharge Because they are empty (i.e. have no content), they summaries) into structured XML . The do not cause nesting problems in case of overlapping representation creates numbered text elements, which markup. In case of the example above are linked to semantic elements (virtual elements) in […]<fs/><fs/><n>no</n> murmur <fe/> or gallop<fe/>. a separate part of the document. This is shown the elements fs and fe signify the start and end schematically in Fig.3. The semantic information position of the annotation resulting in a flat markup constitutes the procedure ausculation, and the (see Fig. 2b). While an XML parser would be able to findings no murmur and no gallop, which are linked parse the above markup, the actual annotation would to numbered words in the text. This linking can be not be recognized as a nested tree. A postprocessing represented in XML using identity attributes. step would actually be necessary to convert the In case of MedLEE, the text elements are called phr milestone representation into a regular tree structure. (for phrase) and are located in the text-section of the documen (Fig. 4). The phr elements are linked via the id/idref attribute (usually unique identifiers rather than numbers as shown here for illustration purposes) to the semantic findings in the structured section of the document. Fig.3 semantic information linked to words in the text METHODS AND RESULTS The problem can be generalized as follows: Given a ausculation no murmur no gallop string of words and a set of annotations which refer to any combination of those words in the string, assign annotations with no overlap and with minimal empty elements and link attributes. Given our the ausculation revealed no murmur or gallop 1 2 3 4 5 6 7 example sentence above, a possible markup which satisfy this requirements would look as follows: the <pr>ausculation</pr> revealed <f><n id=”4”>no</n> Fig. 4. MedLEE format murmur</f> or <f>gallop<n idref=”4”/></f>. <structured> This representation (which we will call linearized <procedure v=”ausculatation” idref=”2” /> during the remainder of this paper) includes a single <finding v=”murmur” idref=”5” ><certainty v=”no” link (the negation of the finding gallop is linked to idref=”4”></finding> the actual word no in the text) as well as a single <finding v=”gallop” idref=”7”><certainty v=”no” empty element (the negation of the finding gallop). idref=”4”></finding> This markup produces a properly nested structure as </structured> can be seen in Fig 2c. <text>the <phr id=”2”>ausculation</phr> revealed <phr Given the current MedLEE XML markup format id=”4”>no</phr> <phr id=”5”>murmur<phr> or <phr id=”7”>gallop</phr>. discussed above, which uses virtual elements linked </text> to numbered text elements, we wanted to explore whether we could generate the linearized As can be easily seen, this representation correctly representation automatically. We developed an handles the negation of the elements murmur and algorithm suited for this task. The main idea is to gallop. As with the milestone elements, a represent each semantic concept (finding, procedure postprocessor is needed to transform this split etc.) as an ordered set consisting of the element representation into a nested tree, which includes numbers of the text. The algorithm decides which semantic and textual information together. text elements will be included in the final markup of each concept. By marking up one semantic concept Two further TEI strategies, fragmentation of existing after another, the sets dynamically change their elements as well as multiple encoding of the same content. Given the above example sentence, the sets elements, are not really suited for our problem would initially look as follows (see also Fig.3 ): because they either do not code all necessary elements or increase the complexity of the markup. Semantic concept Sets Beyond TEI, there are other solutions dealing with Auscultation (2) concurrent markup. For example, “standoff No murmur (4,5) annotation” introduces two separate documents for No Gallop (4,7) the markup and the actual text corpus . Other The sets corresponds to the numbers of the text solutions use XPath expressions  to encode elements in the text: overlapping hierarchies , or use advanced markup the <2>ausculation</2> revealed <4>no</4> grammars supporting concurrent markup . <5>murmur</5> or <7>gallop</7>. The algorithm sorts the sets by listing the set We propose an alternative representation, which containing the lowest element numbers on top. combines ideas from both the milestone and virtual During the first iteration, the algorithm would then elements strategy. Our main goal is to keep the text mark up the semantic concept corresponding to the and annotation together while minimizing the need set on top - in this case auscultation. The algorithm for empty elements and link attributes. In this decides which text elements in the set are included in representation, empty elements are not used as the markup. The rule is as follows: Include every starting and end points of a markup, but rather as element whose number is equal or smaller than the semantic elements, which link to remote text element numbers of the next concept - in this case no elements. The resulting representation should be murmur. As 2 is smaller than any number in the next linear and maintain the word order of the original set, the text element number 2 is included in the text, thus be suitable for inclusion into a semi- markup of ausculation. structured database. The linear structure guarantees the ability to add new annotation where necessary as the <pr>ausculation</pr> revealed <4>no</4> well as fast query performance through maintaining a <5>murmur</5> or <7>gallop</7>. single main XML tree. The number 2 is removed from the sets, and the remaining sets are again sorted. Concept no murmur is now on top, corresponding to a set with text Bleeding antrum (3,6) element numbers 4 and 5. Bleeding fundus (3,7) Semantic concept Sets No murmur (4,5) In this case, the algorithm would perform the same No Gallop (4,7) iterations as before, backtracking after step 2 to resolve a situation where two concepts correspond to The first number, 4, is equal to the first number of the the same set of text elements. The final markup set corresponding to no gallop and included in the would look as follows (ids/idrefs not shown, see also markup. Fig. 2d): Semantic Sets Included Not <f >erosions<bl/></f> and <f>bleeding><bl/></f> in the concept included <f><bl >antrum</bl></f> and <f>fundus</bl></f>. No murmur (5) 4 The element bl stands for body location, such as No Gallop (7) 4 antrum or fundus. As can bee seen, although there At this point, the table stores the number 4 as being was considerable overlap in the original semantic included (no murmur) as well as excluded (no gallop) markup, obtaining a linearized representation is in the text markup. The two remaining sets contain a feasible. single element number, and the algorithm assigns This algorithm has been run on more complex number 5 to the concept on top – no murmur. sentences, as well as on complete discharge Marking up the latter concept demands a link summaries parsed by MedLEE. We are planning an attribute at position 4, which corresponds to the word automatic validation of the algorithm by first no, and serves as a reference for concept no gallop. automatically linearizing the original split MedLEE the <pr>ausculation</pr> revealed <f><n id=”4”>no format (see above), and subsequently transforming </n>murmur</f> or <7>gallop</7>. the linearized representation back to the split format. The markup of the last concept – no gallop - is a If we can reconstruct the original format, we could straightforward process. demonstrate that the algorithm is working and conserving all data in a linearized fashion. Semantic Sets Included Not concept included No Gallop (7) 4 DISCUSSION AND CONCLUSION The fact that text element 4 was excluded from the There is growing interest in so-called semi-structured markup necessitates an empty „negation‟ element XML databases [4, 5], which are very flexible in with a link attribute. The resulting markup looks as storing incomplete and changing types of data follows (Fig. 2c): content. We are encountering such types of data in the <pr>ausculation</pr> revealed <f><n id=”4”>no our data mining activities on medical free text </n>murmur</f> or <f>gallop<n idref=”4”/></f>. reports. Different projects generate various kinds of text annotations, which we would like to store Looking at a different example, it turns out that conveniently in a single XML file. Unavoidably, seemingly short sentences may contain surprisingly some of these annotations will be overlapping, many semantic concepts. Consider a sentence generating an invalid XML structure. The erosions and bleeding in the antrum and fundus. significance of this paper lies in the introduction of MedLEE recognizes 4 different medical concepts in an alternative approach to resolving overlapping text this sentence: erosion antrum, erosion fundus, markup that preservers the linearity of the original bleeding antrum and bleeding fundus. Marking up text. such a sentence is problematic. For example, the Marking up medical texts with semantic information finding bleeding separates the finding erosion and the by Natural Language Processing (NLP) inherently body location antrum, causing potential nesting produces overlapping annotations. We observed this problems. problem especially with conjunction (and) or The numbered sentence looks as follows: disjunctions (or) in the text. Consider a sentence <1>erosions</1> and <3>bleeding</3> in the No bleeding in the antrum and fundus. <6>antrum</6> and <7>fundus</7>. We use conjunction as above in our daily language to The table of sets reveals a more complex situation conveniently communicate several facts in the than above. shortest form possible. Expanding the above sentence reveals two different facts Semantic concept Sets No bleeding in the antrum. Erosion antrum (1,6) No bleeding in the fundus. Erosion fundus (1,7) Marking up the main constituents of each fact , [no bleeding antrum] and [no bleeding fundus], 12. Thompson, H. and M. D. Hyperlink generates an overlapping annotation. Annotating such semantics for standoff markup of read-only overlapping information with XML is difficult documents. in SGML Europe '97. 1997. without generating an invalid nesting of elements. Barcelona. Traditional approaches to this problem either separate 13. W3C, XML Path Language (XPath) Version the annotation from the text by creating so called 1.0. 1999. virtual elements or use empty elements to indicate the 14. Durusau, P. and M. O'Donnell. start and end positions of a specific annotation. Both Implementing Concurrent Markup in XML. approaches have their disadvantages: virtual elements in Extreme Markup Languages 2001. 2001. separate the documents in two parts (annotation and Montreal. text), and empty elements must be converted to 15. Sperberg-McQueen, C. and C. Huifeldt. regular elements in order to represent a tree structure. GODDAG: A Data Structure for The linearized representation of XML presented in Overlapping Hierarchies. in ACH-ALLC'99. the paper keeps the XML tags and the text together 1999. Charlottesville, Virginia. while minimally relying on link attributes as well as empty elements. We think this representation is a valid alternative to other markup solutions. This work was supported by National Library of Medicine grants R01-LM06910 and R01-LM06274 REFERENCES 1. W3C, Extensible Markup Language (XML). 2000. 2. W3C, Standard Generalized Markup Language (SGML). 1999. 3. HL7, Reference Information Model. 2002. 4. Deutsch, A., et al., Querying XML Data. IEEE Data Engineering Bulletin, 1999. 22(3). 5. Buneman, P. Semistructured data. in PODS- 97. 1997. Tucson, Arizona. 6. Johnson, S.B., et al., Using semistructured data for clinical data mining. Symp. AMIA 2002 (submitted), 2002. 7. Sperberg-McQueen, C. and L. Burnard, TEI Guidlines for Electronic Text Encoding and Interchange (P3). 1994. 8. Friedman, C., et al., A general natural- language text processor for clinical radiology. J Am Med Inform Assoc, 1994. 1(2): p. 161-74. 9. Friedman, C., Towards a comprehensive medical language processing system: methods and issues. Proc AMIA Annu Fall Symp, 1997: p. 595-9. 10. Friedman, C., A broad-coverage natural language processing system. Proc AMIA Symp, 2000: p. 270-4. 11. Friedman, C., et al., Representing information in patient reports using natural language processing and the extensible markup language. J Am Med Inform Assoc, 1999. 6(1): p. 76-87.
Pages to are hidden for
"A special case for XML markup Representing complex semantic"Please download to view full document