i Medication Extraction Challenge by nikeborome


									                           Li, a Medication Event Extraction System

Lancet: a High Precision Medication Event Extraction System for Clinical Text

Zuofeng Li, MD, PhD,1 Feifan Liu, PhD,1 Lamont Antieau, PhD,1 Yonggang Cao, PhD,1
Hong Yu, PhD,1,2,*

1 College of Health Sciences, University of Wisconsin – Milwaukee, Wisconsin, USA

2 College of Engineering, University of Wisconsin – Milwaukee, Wisconsin, USA

* To whom correspondence should be addressed:
Hong Yu Ph. D.
2400 E Hartford Ave
Milwaukee WI 53211, USA
Email: hongyu@uwm.edu
Phone: (414) 229-3344
Fax: (414) 229-5100

                             Li, a Medication Event Extraction System

                      Originality Declaration and License Statements

    I, as corresponding author, promise that I and all persons listed as coauthors on this
submitted work have read and understand the "Originality of Manuscripts" statement regarding
submissions to JAMIA (available at
http://jamia.bmj.com/site/about/originalityofmanuscripts.xhtml) and confirm that this
submission is a new, original work that has not been previously submitted, published in whole
or in part, or simultaneously submitted for publication in another journal. Also, in accordance
with the aforementioned policy, we have included as part of the submission any previously
published materials that overlap in content with this new original manuscript.

    The Corresponding Author has the right to grant on behalf of all authors and does grant on
behalf of all authors, an exclusive license (or non-exclusive for government employees) on a
worldwide basis to BOTH The American Medical Informatics Association and its publisher for
JAMIA, the BMJ Publishing Group Ltd and its Licensees to permit this article (if accepted) to
be published in Journal of the American Medical Informatics Association and any other
BMJPGL products to exploit all subsidiary rights, as set out in our license

                                 Li, a Medication Event Extraction System


Objective: We present Lancet, a supervised machine-learning system that automatically extracts
medication events consisting of medication names and information pertaining to their prescribed
use (dosage, mode, frequency, duration and reason) from lists or narrative text in medical
discharge summaries.

Design: The Lancet system incorporates three supervised machine-learning models: a
conditional random fields (CRF) model for tagging individual medication names and associated
fields, an AdaBoost model with decision stump algorithm for determining which medication
names and fields belong to a single medication event, and a support vector machines (SVM)
disambiguation model for identifying the context style (narrative or list).

Measurements: We participated in the third i2b2 shared-task for challenges in natural language
processing for clinical data: medication extraction challenge. With the performance metrics
provided by the i2b2 Challenge, we report the micro F1 (precision/recall) scores on both the
horizontal and vertical level.

Results: Among the top ten teams, the Lancet system achieved the highest precision at 90.4%
with an overall F1 score of 76.4% (horizontal system level with exact match), a gain of 11.2%
and 12%, respectively, compared to the rule-based baseline system jMerki. By combining the
two systems, the hybrid system further increased the F1 score by 3.4% from 76.4% to 79.0%.

Conclusions: We conclude that supervised machine-learning systems with minimal external
knowledge resources can achieve a high precision with a competitive overall F1 score. Our
Lancet system based on this learning framework does not rely on expensive manually-curated
rules. The system is available online at http://code.google.com/p/lancet/.

                              Li, a Medication Event Extraction System


Medication is an important part of a patient’s medical treatment, and nearly all patient records
incorporate a significant amount of medication information. The administration of medication at
a specific time-point during the patient’s medical diagnosis, treatment, or prevention of disease is
referred to as a medication event,[1-3] and the written representation of these events typically
comprises the name of the medication and any of its associated fields, including but not limited
to dosage, mode, frequency, etc.[4] Accurately capturing medication events from patient records
is an important step towards large scale data mining and knowledge discovery,[5] medication
surveillance and clinical decision support,[6] and medication reconciliation.[7-10]

In addition to its importance, medication event information (e.g., treatment outcomes, medication
reactions and allergy information) is often difficult to extract, as clinical records exhibit a range
of different styles and grammatical structures for recording such information.[4] Thus,
Informatics for Integrating Biology & the Bedside (i2b2) recognized automatic medication event
extraction with natural language processing (NLP) approaches as one of the great challenges in
medical informatics. As one of 20 groups that participated in the i2b2 medication extraction
challenge, we report in this study on the Lancet system, which we developed for medication
event extraction.


Over two decades, several approaches and systems have been developed to extract information
from clinical narratives. Earlier work mapped terms appearing in clinical narratives to concepts
in external clinical terminologies (e.g., SNOMED).[11] Later systems explored syntactic and
semantic parsing and pattern matching (e.g., MedLEE and others) for deeper information
extraction.[12,13] Recently, supervised machine-learning approaches have been explored,
including those for finding temporal order in discharge summaries and others for identifying the
smoking status of patients from medical discharge records.[14,15]

Systems for medication event extraction have been reported previously. Gold et al. [1] developed
a rule-based system called MERKI to extract medication names and the corresponding attributes
from structured and narrative clinical texts. Cimino et al. [16] explored the MedLEE system to

                             Li, a Medication Event Extraction System

extract medication information from clinical narratives; medication names and three states of
medication events, namely, initiation, change and discontinuation, were extracted for the purpose
of medication reconciliation in their study. Recently, Xu et al. [4] built an automatic medication
extraction system (MedEx) on discharge summaries by leveraging semantic rules and parsing
techniques, achieving promising results for extracting medication and related fields.

There are also some commercial systems designed to extract medication information from
medical records, including LifeCode, A-Life Medical, FreePharma, etc. Jagannathan et al. [17]
evaluated the performance of four commercial NLP tools to extract medication information from
discharge summaries and family practice notes. Their analysis reported that these tools
performed well in recognizing medication names but poorly on recognizing related information
such as dosage, route and frequency.

Although the existence of such NLP systems is evidence of the progress that has been made in
this area, most of these systems are not publicly available. Furthermore, different systems have
been developed for different purposes and have been evaluated against different gold standards.
This makes comparing these approaches to one another a challenging task. Therefore, the i2b2
project attempts to provide a common purpose and gold standard to different NLP systems.[15]


A. Medication Event Extraction

The i2b2 challenge defines a medication event as an event incorporating a medication name and
any of the following associated fields: dosage, frequency, mode, duration and reason. Table 1
shows the definition released by the i2b2 organizers and shows that the i2b2 medication event
definition largely follows from previous work, particularly [13] and [1]. As an example, Figure 1
shows a clinical narrative/list excerpt released by the i2b2 organizers in which medication events
were annotated based on the i2b2 annotation guidelines.

                              Li, a Medication Event Extraction System

[Table 1 about here]

While the challenge was to extract all medication events from both lists and narrative context,
the challenge's main interest was in the extraction of medication information from the narrative
medical records, as illustrated in Figure 1.

[Figure 1 about here]

B. Training dataset and annotation

A dataset of 696 un-annotated de-identified patient discharge summaries from Partner Healthcare
(1990-2007) were released by the i2b2 organizers about ten weeks before the competition.[18]
The dataset is available at i2b2 web site.[19] At the same time, the organizers also released the
first version of the annotation guideline and 17 discharge summaries (a subset of the 696
discharge summaries) that were annotated by the organizers. Over the next ten weeks, all groups
participating in the challenge took part in a discussion over the guideline, which was iteratively
refined and the annotation of the 17 discharge summaries updated according to the discussion
and guideline refinements. Towards the end, the final annotation of the 17 discharge summaries
was considered “ground truth” by the i2b2 organizers.

Throughout this process, two of the authors (ZFL and LA) manually and independently
annotated 75 and 72 discharge summaries that were randomly selected from the 696 patient
discharge summaries. Each article is only annotated by one annotator. This collection of 147
summaries incorporated the 17 “ground truth” summaries. The 17 summaries annotated by
ourselves were then measured against the “ground truth” summaries to determine annotation
agreement, the results of which will be discussed in the error analysis section. In addition, after
competition, 10 summaries were re-annotated to explore the agreement between the two

                              Li, a Medication Event Extraction System

Our 147 manually annotated summaries incorporated a total of 5,184 medication entries (2,175
narratives and 3,009 lists); 2,742 instances of dosage; 2,042 instances of mode; 2,583 instances
of frequency; 223 instances of duration; and 709 instances of reason.


In this section, we describe Lancet, a supervised machine-learning system for medication event
extraction. For the performance comparison, we also implemented a rule-based system as a
baseline and a hybrid system.

A. The Lancet system

The overall Lancet system is shown in Figure 2. Lancet incorporated three supervised machine-
learning (ML) models: 1) CRF model, a conditional random fields (CRF) model for identifying
instances of a medication name (m) and its associated fields: dosage (do), mode (mo), frequency
(f), duration (du) and reason (r); 2)Medication relationship model, an AdaBoost classification
model with decision stump for associating a medication name with its corresponding fields; 3)
list/narrative SVM model, a support vector machines (SVM) classifier for distinguishing lists
from narratives.

In the following, we will first describe data pre-processing, and we will then describe each of the
three ML models and how they are integrated for the final Lancet system.

[Figure 2 about here]

1. Pre-processing

Our pre-processer first converts the text in each discharge summary into lower case. It then
applies manually curated pattern-matching rules to recognize discharge summary sub-sections,
including history, medication, physical examination, follow up, diagnosis, allergy, family

                              Li, a Medication Event Extraction System

history, etc. For instance, the following regular expressions were used to detect medication-
related subsections:

'(discharge|transfer|home|admi\w+|new)\s+(medication|med)s?', '(prn\s+)?med(ication)?s'

We applied Splitta for sentence boundary detection.[20] The sentence boundary information was
used for the list/narrative classification and association between the medication name and its
medication fields.

2. A CRF model for medication named entity recognition

Using the 147 annotated discharge summaries, we trained a conditional random field (CRF)
model to recognize the medication name and five fields (do, mo, f, du and r). The model was
trained using ABNER, an open-source biomedical named entity recognizer [21]. We applied the
default feature set, which are the standard bag-of-words, morphology, and n-gram features.

3. An AdaBoost model for associating a medication name with its corresponding fields

We built a supervised machine-learning classifier to associate a medication with its fields. This
two-way classifier attempted to determine whether a medical field was associated with a
medication name or not. As the number of potential medication-name-field pairs can be large, we
followed a heuristic rule suggested by the i2b2 organizers in which any medication name and
field within the distance of two lines (+/- two lines) was considered to be a candidate medication-
name-field pair. The features used to train the model are displayed in Table 2.

For implementation, we used the AdaBoost.M1 with Decision stump in the Weka toolkit, which
is a well-known algorithm less susceptible to over-fitting. [22]

[Table 2 about here]

4. A support vector machines (SVM) classifier for distinguishing lists from narrative text

                              Li, a Medication Event Extraction System

One of the i2b2 competition requirements was to determine whether the text describing a
medication is in a list or a narrative format. Using the 147 annotated discharge summaries as the
training data, we built a SVM classifier (Weka Toolkit [22]) to determine the format of each
candidate sentence.

We used bag of words, bi-grams, tri-grams and subsection features. The hypothesis for the
subsection feature is that medication events in the medication subsection (recognized in Section
IV.A.1) are more likely in the list format.

5. The integration

We integrated all three models into the Lancet system. Lancet first detects medication names and
fields with the CRF model, and then applies the AdaBoost model to determine whether a
medication field belongs to a medication name. Finally, a SVM classifier separates lists from

B. jMerki−A rule-based baseline system

jMerki was a rule-based system implemented in JAVA. It integrated the rules in the MERKI
system,[1] including rules for dosage, frequency, time and PRN. We added additional rules for
the i2b2 medication detection, including applying regular expressions to detect subheadings in
discharge summaries. The system performed dictionary look up and regular expression
matching for identifying related fields. We built a medication name dictionary with two external
knowledge resources, RxNorm and DrugBank.[1,23] This baseline system cannot recognize list
or narrative form, so the Lancet SVM classifier was employed for the performance evaluation.

C. The hybrid system

As a post hoc experiment, we built a hybrid system to increase both recall and precision.
Specifically, we aligned and matched jMerki and Lancet systems’ outputs. If both jMerki and
Lancet detected the same medication name, but differed in other content (e.g., dosage, etc), the

                               Li, a Medication Event Extraction System

Lancet’s output was chosen because it has a higher precision than jMerki. If jMerki and Lancet
did not agree with a medication name, then the hybrid system kept both medication entries
detected by the two systems. This step would increase recall.


A. Metrics

The i2b2 organizers used two sets of evaluation metrics: strict evaluation (exact match) and
relaxed evaluation (inexact match), which are adapted from the evaluations of the question
answering track in TREC.[24] For each medication entry, exact match calculates the precision
and recall of the instance, whereas inexact match calculates the proportion of system-returned
tokens that overlap with the ground truth. Given the aligned system output, two types of
evaluation measures were performed: Horizontal level (focus on medication events) and Vertical
level (focus on medication names and fields).

The organizers also performed evaluation at two different levels of granularity: (a) patient record
level, which was the micro-average over all the entries in a single record and then the macro-
average over all the records in the system output; and (b) system level, which is the micro-
average over all entries in the system output.

The primary evaluation metric of this completion is system level horizontal evaluation. To
calculate the precision and recall in horizontal level for system entry against ground truth entry,
the following formulas were used. For details, please refer [25].

                 Matches in terms of offset and field type
Pr ecision 
               Total number of fields in the system output

             Matches in terms of offset and field type
Re call 
            Total number of fields in the ground truth

Similar to the TREC evaluation, the F1-score was reported, which is the harmonic mean of
instance precision (IP) and instance recall (IR), F1= 2(IP*IR)/(IP+IR). IP is the total number of

                               Li, a Medication Event Extraction System

correctly identified instances out of the total number of identified instances. IR is the total
number of correctly identified instances out of the total number from the ground truth list.

B. Gold Standard

The gold standard used for the i2b2 evaluation was built as a community effort.[25] The whole
dataset incorporated 8, 942 instances of medication entries (3,936 narrative and 5,006 list), 4460
instances of dosage, 3,387 instances of mode, 4,039 instances of frequency, 553 instances of
duration and 1,637 instances of reason. We found that the gold standard medication names
belonged to 295 categories, which represented 50.4% of total drug categories in DrugBank. The
results suggested that the coverage of drugs in the i2b2 challenging task was reasonably broad.


A. Evaluation of Lancet in the i2b2 challenge 2009

Although we report the results of three systems in this study, Lancet was the only system of the
three that competed in the i2b2 challenge. Among the top ten systems, Lancet achieved the
highest precision at the system-level horizontal evaluation: 90.4% in exact matching and 94.0%
in inexact matching (Figure 3A and 3B). The corresponding F1 values were 76.4% and 76.5%.
For F1 value, Lancet ranked 10th in exact matching and 9th in inexact matching. For list, the
Lancet system achieved the highest precision of 93.1%, with an F1 of 66.0%, on exact match at
the system-level horizontal evaluation (Figure 3C). For narratives, Lancet achieved a precision of
36.6% with an F1 of 38.4%. Lancet ranked 10th for narratives or lists. [25]

[Figure 3 about here]

B. Comparison of the three systems

We described earlier the three systems we developed: the Lancet system, the rule-based jMerki,
and the hybrid system. Table 3 shows the results of all three systems. On horizontal level

                              Li, a Medication Event Extraction System

evaluation with exact matching, Lancet outperformed jMerki by 12.0% (system) and 10.4%
(patient). The hybrid system further improved the performance by 3.4% (system) and 4.6%
(patient), yielding the highest F1 score of 79% (system) and 77.6% (patient). For recall, both
Lancet (66.1%) and jMerki (58.7%) are relatively low. On the other hand, the recall of the hybrid
system increases to 74%.

Similarly, on the vertical level evaluation, Lancet outperformed jMerki, and the hybrid system
outperformed both. The hybrid system achieved good performance (F1 80%─85%) in the fields
of dosage, medication, mode, and frequency, while achieved poor performance (F1
2.4%─21.2%) in duration and reason fields. In addition, the results show that the system level
performance was consistently better than the patient level for both horizontal and vertical level

[Table 3 about here]

C. Error analysis
We first examined annotation inconsistency and then manually analyzed the system output. We
found that errors were contributed to data sparseness, multiple medication entries, grammatical
errors in clinical notes, and negated events.

1. The challenges in annotation

As described earlier, we annotated the 17 “ground truth” summaries and measured the annotation
agreement between our annotation and the annotation by the i2b2 organizer. With the exact
match evaluation metrics, the agreement between our annotation and “ground truth” was an
81.5% F1 score (system level, horizontal). On the vertical level, our annotations showed a high
agreement in medication (88% F1 score), dosage (85% F1 score), frequency (86% F1 score) and
mode (89% F1 score) but low agreement for duration (36% F1 score) and reason (33% F1 score).

We manually examined inconsistent annotations and found instances of ambiguity that gave rise
to annotation inconsistency. These included:

                               Li, a Medication Event Extraction System

          (1) Boundary ambiguity. Example: “Ofloxacin 2000 mg p.o. b.i.d. (both antibiotics to
continue for an additional two week course ).”
In this example, we annotated “two week course” as duration instead of “for an additional two
week course” in the gold standard, both of which are semantically correct.
          (2) Semantic ambiguity. Example: “NITROGLYCERIN 1/150 (0.4 MG) 1 TAB SL Q5MIN
X 3 doses PRN Chest Pain HOLD IF: SBP less than 100.”
In this example, “X 3 doses” performs two functions. In the i2b2 ground truth annotations, it was
annotated as “dosage” because it states the number of doses; however, we annotated it as
“duration” because, at the same time that it states the number of prescribed doses, it also tells the
duration over which this particular medication should be taken. Similarly, in “She was found to
have two 95% stenosis in a long segment of the left SFA and the left distal SFA and anterior
tibial vein graft was completely thrombosed. She was successfully treated with stent placement
and received heparin and urokinase in the Intensive Care Unit overnight with a turn-over
pulses of the left leg Doppler.” “anterior tibial vein graft” was annotated as the reason for
“heparin and urokinase” in the gold standard, while we considered “stent placement” as the

In addition, we found the ground truth to be imperfect. We correctly annotated “overnight” in the
above example as the duration of “heparin and urokinase,” while it was missed in the ground
truth. In another example below, we can see that the rule of “+/- 2 lines” led to improper
annotation in the ground truth:

          “CC: Hypotension after dialysis
          HPI: 56 yo male with h/o ESRD , CAD , CHF ( EF 20-25% ) admitted for
          hypotension after HD. He was in his USOH until 2 days PTA when he
          developed stomach upset , diarrhea , dry heaves , and a dry cough. He
          denied recent travels , and had remote Abx use. At Stodun Hospital ,
          he had 5.5 liters removed and afterwards his BP was 66/30. 1 liter of
          NS was given and his BP rose to 73/40.”

                              Li, a Medication Event Extraction System

Here obviously the reason for “NS” (normal saline) is “Hypotension after dialysis,” but due to
the “+/- 2 lines” limitation, “his BP” was annotated as the reason, which, strictly speaking, is not
correct and is very confusing.

All data was annotated by two of the authors: ZFL is a domain expert and LA is a linguist.
During the i2b2 competition, each discharge summary was annotated by one person only. A post
hoc annotation of 10 discharge summaries (by ZFL) showed 0.85─0.95 inter-annotator
agreement at medication name, dosage, mode and frequency. The agreement on duration and
reason was lower, with duration 0.71─0.89 and reason 0.24─0.42. When limiting to narrative
entries only, the agreement on all fields was 0.12─0.67.

2. Data sparseness
One advantage of supervised machine-learning systems is that the systems can predict correct
label even if the testing data do not appear in the training data. Such robustness is due to
systems’ ability to capture contextual information. As described earlier, we annotated a total of
147 records to be used as the training data. This collection of annotated data is in no way
complete. Nevertheless, Lancet detected the medication “persantine” from the text
“PERSANTINE ( DIPYRIDAMOLE ) 50 MG PO BID” even though “persantine” did not appear
in the training data. The reason is that Lancet learned the contextual patterns "<m> <do> <mo>
<f>" from the training data. On the other hand, Lancet failed to detect "Persantine and viability
cardiac PET scan 5/19/04" because no such contextual pattern appeared in the training data. As
a result, data sparseness hurts the recall of Lancet even though it is a supervised machine-
learning system.

On the other hand, we found that the jMerki lexicon missed 17% medication names. The missing
medication names included general drug names, drug name abbreviations (vanco for vancomycin;
kcl for potassium chloride), drug category names (beta-blocker , beta blocker, home medications
or hypoglycemics) and drug name combination (calcium+vim d). The results suggested that at
best jMerki could perform with 83% recall. On the other hand, the supervised ML system Lancet
could recover some of the medications that would otherwise be missed out by the jMerki system.

                              Li, a Medication Event Extraction System

3. Multiple medication entries
As described earlier, the Lancet system assigned a unique instance from each field to its
corresponding medication name. As a result, Lancet always missed out multiple medication
entries. An example is shown below:


In the above example, Lancet correctly detected one entry: “2 UNITS QAM” and associated it
with the medication "NPH HUMULIN INSULIN." On the other hand, the system missed out
three entries: “3 UNITS QPM SC”, “2 UNITS QAM” and “3 UNITS QPM.” Therefore Lancet
suffered recall. To estimate how much recall Lancet could lose, we examined our gold standard
data and found out a total of 449 such multiple medication entries out of the total 8,942
medication entries, a ~5% decrease in recall.

4. Medication name misspelling
Clinical texts are typically noisy, with significant grammatical errors. We found that such errors
hurt Lancet's performance. For example, the Lancet system failed to detect "Flagy" in the
"cholangitis ) Ampicillin and Flagy started 0/16 for ?early cholangitis. 3. CV: h/o htn ,
hyperlipidemia , CE set A B neg , " because "Flagy" was incorrectly spelled. When we manually
corrected the spelling as "Flagyl," we found that Lancet was able to correctly extract the
medication event.

Misspelled medication names have also led Lancet to fail to detect other correctly spelled
medications. For example, Lancet failed to detect both "levofloxacin" and "flagy" in "on daily
levofloxacin and flagy , will complete a 14 day course." After we corrected the misspelling of
"flagy", the Lancet system was able to detect both medication events.

5. Negation
Negation occurs in clinical notes. We found that negative medication events generally fall into
one of two categories: the medication allergies of patients and medications mentioned in the text
but not actually taken by the patient. Two examples are shown below:

                              Li, a Medication Event Extraction System


"The patient was placed on heparin instead of Coumadin for Chronicle device lead thrombus
with a PPT goal of 60-80."

Currently, the Lancet system does not incorporate negation and scope detection, and as a result,
it incorrectly extracted "ACE inhibitors" and "Coumadin," respectively.

D. Follow-up experiments

Based on the results of our error analyses, we further performed post-hoc experiments by
exploring negation detection, external medication name dictionaries, and others, to improve the
medication event extraction based on our Lancet system. The results are shown in Table 4.
Different features and models were explored: “NegPlus” added negative medication features that
we manually annotated for the training; “Digital normalization” replaced all the digits in the text
with placeholders; “Affix” used the nomenclature rules recommended by the world health
organization(WHO); “Dictionaries” combined five dictionaries in the model training, namely the
WHO nomenclature rules, the CORE Problem List Subset of SNOMED CT® released by the
National Library of Medicine, RxNorm, DrugBank and a modified common English word
dictionary Linux Word .[26] Finally, “Single-line” expanded the sequence scope from one line to
the whole article.

[Table 4 about here]

We noticed that adding negative medication information and affix information increased the
precision of the system; in particular, affix features yielded a precision of 91.2% compared to the
90.4% of Lancet. But neither of them achieved any overall gain due to degrading recall.
Applying digital normalization increased the recall slightly to 66.5%, but degrading precision
limited the overall gain of the F1 score. We found that combining more dictionary resources led
to a marginal improvement in the performance of the Lancet system in both recall and precision,
increasing the F1 score by 2.75%, from 76.4% to 78.5%. Similarly, changing the multi-line
sequence of one article into one single line also increased both recall and precision, yielding a F1

                               Li, a Medication Event Extraction System

score of 77.8% compared to the 76.4% of the Lancet system, with the best precision of 92.5%
compared to 90.4%.


The agreement between our annotation and the annotation by the i2b2 organizers has shown an
F1 score of 81.5%, which is much lower than the annotation agreement reported by the i2b2
organizers (an F1 score of 89.7% from comparing two organizer’s annotations against the ground
truth). We had annotated the 147 patient records throughout the 10 weeks during which the
annotation guideline was iteratively updated. We therefore speculate that the inconsistency was
at least partially introduced through the guideline refinement process. Since the Lancet system
was trained on these 147 records, the annotation inconsistency contributed to errors in the
system’s performance. In addition, the study on the community annotation agreement by the i2b2
organizers shows the mean system level F-measure to be 82.4% (exact) on 251 records, which
suggests that “ground truth” annotation on this task itself is very challenging.

The evaluation results (Table 3) have consistently shown that all our systems performed better at
the system level than at the patient level. The results indicate that Lancet performed relatively
better on discharge summaries which incorporate more medication events. Figure 4B shows that
most discharge summaries(67%) incorporate 10-50 medication events and the number of
discharge summaries decreased as the number of medication events increase. In addition, we can
see from Figure 4C that the system performance achieved the highest average F1 score on
discharge summaries that contain 90-100 medication events. As shown in Figure 4B and C, the
more medication events in a discharge summary, the better Lancet performed. We speculate that
that neighboring medication names in discharge summaries are useful features for Lancet.

[Figure 4 about here]

The results for list and narrative entries (Section VI-A) showed that the Lancet system performed
consistently better on lists than on narratives, a result that it shares with all the participating

                              Li, a Medication Event Extraction System

systems. The results are not surprising because in the list format, a medication and its related
fields are highly structured. In contrast, narratives incorporate complex syntactic and semantic
structures that pose a challenge for detecting medication events.

One thing we want to point out is that the lower performance on narrative entries does not
suggest that this system is restricted to dealing only with structured text. The concepts of list and
narrative were not clearly defined by this i2b2 challenge, and we observed that many medication
entries annotated as “list” in the gold standard incorporated text belonging to “narrative” in a
broad sense. For example, “COUMADIN with target inr of 2.0 , last target 1.6 , then received 10
MG in evening x 2.” is annotated as “list”, despite being clearly more difficult than other
structured cases.

We can see from Table 3 that the Lancet system significantly outperformed the rule-based
system jMerki, increasing the precision from 81.3% to 90.4% and the recall from 58.7% to
66.1% respectively (horizontal system level with exact match). It suggests that machine learning
based methods hold advantages for capturing patterns automatically and accurately over the rule-
based jMerki system on this task. In addition, our Lancet system based on this learning
framework does not rely on expensive manually-curated rules.

Our error analysis provides evidence of the challenges in terms of data sparseness, multiple
medication entries, misspelling and negation, which partially explained the relatively low recall
of our system.

Data sparseness is a common problem for any supervised machine-learning systems. Although
supervised machine-learning systems can be robust, as the systems learn from multiple features,
our results clearly demonstrate that data sparseness has contributed to errors in our system’s
performance. One of our post-hoc experiments shows that performance increased if we
incorporated dictionaries as additional features (from 76.4% to 78.5%, as shown in Table 4, p <
0.005). Data sparseness can additionally explain why the hybrid system had improved
performance (from 76.4% to 79%, as shown in Table 3, p < 0.005). As we have more unlabeled
data available, semi-supervised Conditional Random Fields learning could be used to improve
the performance. [27]

                              Li, a Medication Event Extraction System

Multiple medication entries, which we showed earlier accounted for 5% of medication events,
was an additional source of errors. We have to remove the heuristic rule that is currently used in
the Lancet system and allow multiple instances for each medication field of an event. But this
will also introduce new noise in the form of false positives, and some filtering strategy needs to
be employed for our system to benefit from this change. In addition, we speculate that linguistic
and rule-based approaches may be explored to improve the detection of multiple medication
events, but these must remain for future research.

Our error analysis has concluded that some errors were caused by the misspelling of medication
names. In future work, we will explore an automatic misspelling detection and correction tool,
for example, the Aspell system (http://aspell.net/). We have attempted in several cases to show
that Lancet’s output can be corrected after the misspelling was removed.

Our error analysis has also shown that negation contributed errors. However, in our post-hoc
experiments, we found that when we manually labeled negative medication events, the
performance gained in precision but degraded in the F1 score. We speculate that this paradox can
be explained by the fact that many negated events lack useful contexts for learning and that
negative medications vary from patient to patient, which will confuse the learning model,
particularly on a small set of training data. In addition, the negation detection system we built
had not yet been evaluated, and in future work, we will explore state-of-the-art negation systems,
including NegEx,[28] to improve negation detection.

As shown in Table 3, the error sources discussed above had a more severe influence on the
performance of duration and reason field. We can see that our annotation on those two fields
achieved the lowest agreement (F1 score 36% and 33%, as discussed in the error analysis
section). Even the annotation by the i2b2 organizers had a much lower agreement on duration
(F1 score 61%) and reason (F1 score 68%) than other fields. We thus speculate that the poor
performance on the two categories (duration and reason) is partially due to the lower agreement
that can potentially be obtained in annotation. In addition, another reason might be because of
fewer annotation instances for these fields than for other fields (only 223 and 209 instances vs.
over 2000 instances for other fields). Because the Lancet system was built upon supervised
machine-learning method, its performance is more sensitive and dependent on the consistency

                              Li, a Medication Event Extraction System

and coverage in annotated training data. Furthermore, we observed that compared to other fields,
“reason” is more flexible in the medication event in that it may contain multiple instances,
involve different written styles and occur anywhere in the proximity of the medication name,
which also makes automatic recognition more challenging.

Despite the different sources of errors mentioned earlier, the Lancet system performed with the
highest precision among the top ten teams. We found that most of the top ten systems [29-37]
incorporated extensively manually curated patterns and external dictionaries. In contrast, the
Lancet system was trained only with the annotated dataset and applied few manually curated
rules and no external knowledge resources. We therefore speculate that noise introduced by the
external resources or rules may hurt precision. On the other hand, we found in our
experimentation that a high-quality external dictionary increased both recall and precision. More
investigation is needed as different approaches and systems by different teams are made
available in the future.

Our post-hoc experiments showed that affix features based on WHO nomenclature rules did not
help with the F score although it increased precision. We speculate that although those
medication names related affixes can provide useful evidence for better precision, it might also
introduce some noise in cases in which those affixes are shared by common words. However,
digit normalization slightly improved the system performance (from 76.4% to 76.6% as shown in
Table 4, p < 0.005), which indicates that normalizing digits can to some extent reduce data
sparseness, but data sparseness due to digits is not dominant. In another experiment, instead of
considering one single line as a sequence in the training, we converted the whole article into a
single line which brought some performance gain (from 76.4% to 77.8% shown in Table 4,
p=0.11) as well. This can be explained in that article-level sequences can help catch more useful
dependency information for learning.


We have presented three systems for medication event extraction from patient discharge
summaries: the supervised machine-learning system Lancet, the rule-based system jMerki, and
the hybrid system. We applied Lancet to the i2b2 medication event extraction challenge, and the

                              Li, a Medication Event Extraction System

evaluation results showed that it performed with the highest precision (90.4% and 94.0% F1
scores in exact and inexact match) among the top ten teams.

Our post-hoc experiments show that Lancet and jMerki have different strengths and that the
hybrid system has the best performance, yielding a 79% F1 score (85% precision and 74%
recall). Our error analysis has shown that the source of errors were introduced in part by
inconsistency in annotation and data sparseness, and we therefore speculated that a large scale of
high quality annotated data may further improve the Lancet system’s performance. Another line
of future work is to explore semi-supervised conditional random fields learning, with the hope of
making full use of a large amount of unlabeled data to further boost the system’s performance.

Our current Lancet system incorporates minimum parsing and little external knowledge
resources, yet it achieved the best precision among the top ten teams. The automatic learning
framework also provides a great potential for generalization given appropriate amount of training
data. We speculate that deeper syntactic and semantic parsing may help improve the performance

                          Li, a Medication Event Extraction System


We acknowledge the following grant support: 5R01LM009836,5R21RR024933, and
5U54DA021519. We also thank Qing Zhang and Shashank Agarwal for valuable discussion.

                             Li, a Medication Event Extraction System


1. Gold S, Elhadad N, Zhu X, Cimino JJ, Hripcsak G: Extracting structured medication event
information from discharge summaries. AMIA Annu Symp Proc 2008, :237-41.

2. Diaz E, Levine HB, Sullivan MC, Sernyak MJ, Hawkins KA, Cramer JA, Woods SW: Use of
the Medication Event Monitoring System to estimate medication compliance in patients with
schizophrenia. J Psychiatry Neurosci 2001, 26:325-329.

3. de Klerk E, van der Heijde D, Landewé R, van der Tempel H, van der Linden S: The
compliance-questionnaire-rheumatology compared with electronic medication event monitoring:
a validation study. The Journal of Rheumatology 2003, 30:2469-2475VL - 30.

4. Xu H, Stenner SP, Doan S, Johnson KB, Waitman LR, Denny JC: MedEx: a medication
information extraction system for clinical narratives. J Am Med Inform Assoc 2010, 17:19-24.

5. Mullins IM, Siadaty MS, Lyman J, Scully K, Garrett CT, Miller WG, Muller R, Robson B,
Apte C, Weiss S, Rigoutsos I, Platt D, Cohen S, Knaus WA: Data mining and clinical data
repositories: Insights from a 667,000 patient data set. Comput. Biol. Med 2006, 36:1351-

6. Kuperman GJ, Bobb A, Payne TH, Avery AJ, Gandhi TK, Burns G, Classen DC, Bates DW:
Medication-related clinical decision support in computerized provider order entry systems: a
review. J Am Med Inform Assoc 2007, 14:29-4010.1197/jamia.M2170.

7. Bates DW, Cohen M, Leape LL, Overhage JM, Shabot MM, Sheridan T: Reducing the
frequency of errors in medicine using information technology. J Am Med Inform Assoc 2001,

8. Anderson JG, Jay SJ, Anderson M, Hunt TJ: Evaluating the Impact of Information
Technology on Medication Errors: A Simulation. J Am Med Inform Assoc 2003, 10:292-

9. Jha AK, Kuperman GJ, Teich JM, Leape L, Shea B, Rittenberg E, Burdick E, Seger DL,
Vander Vliet M, Bates DW: Identifying adverse drug events: development of a computer-based
monitor and comparison with chart review and stimulated voluntary report. J Am Med Inform
Assoc 1998, 5:305-314.

10. Pronovost P, Weast B, Schwarz M, Wyskiel RM, Prow D, Milanovich SN, Berenholtz S,
Dorman T, Lipsett P: Medication reconciliation: a practical tool to reduce the risk of medication
errors. J Crit Care 2003, 18:201-5.

11. Sager N, Lyman M, Nhan NT, Tick LJ: Medical language processing: applications to patient
data representation and automatic encoding. Methods of information in medicine 1995, 34:140.

                             Li, a Medication Event Extraction System

12. Friedman C, Alderson PO, Austin JH, Cimino JJ, Johnson SB: A general natural-language
text processor for clinical radiology. J Am Med Inform Assoc 1994, 1:161-174.

13. Evans DA, Brownlow ND, Hersh WR, Campbell EM: Automating concept identification in
the electronic medical record: an experiment in extracting dosage information. Proc AMIA Annu
Fall Symp 1996, :388-92.

14. Bramsen P, Deshpande P, Lee YK, Barzilay R: Finding temporal order in discharge
summaries. AMIA Annu Symp Proc 2006, :81-85.

15. Uzuner O, Goldstein I, Luo Y, Kohane I: Identifying patient smoking status from medical
discharge records. J Am Med Inform Assoc 2008, 15:14-2410.1197/jamia.M2408.

16. Cimino JJ, Bright TJ, Li J: Medication reconciliation using natural language processing and
controlled terminologies. Stud Health Technol Inform 2007, 129:679-83.

17. Jagannathan V, Mullett CJ, Arbogast JG, Halbritter KA, Yellapragada D, Regulapati S,
Bandaru P: Assessment of commercial NLP engines for medication information extraction from
dictated clinical notes. Int J Med Inform 2009, 78:284-91.

18. Uzuner Ö, Luo Y, Szolovits P: Evaluating the State-of-the-Art in Automatic De-
identification. Journal of the American Medical Informatics Association 2007, 14:550-563.

19. i2b2 NLP Rearch Data Sets [https://www.i2b2.org/NLP/DataSets/Main.php Accessed

20. Gillick D: Sentence Boundary Detection and the Problem with the US. In Proceedings of
Human Language Technologies: The 2009 Annual Conference of the North American Chapter of
the Association for Computational Linguistics, Companion Volume: Short Papers 2009:241–

21. Settles B: ABNER: an open source tool for automatically tagging genes, proteins and other
entity names in text. Bioinformatics 2005, 21:3191-2.

22. Frank E, Hall M, Trigg L, Holmes G, Witten IH: Data mining in bioinformatics using Weka.
Bioinformatics 2004, 20:2479-248110.1093/bioinformatics/bth261.

23. Wishart DS, Knox C, Guo AC, Cheng D, Shrivastava S, Tzur D, Gautam B, Hassanali M:
DrugBank: a knowledgebase for drugs, drug actions and drug targets. Nucleic Acids Res 2008,

24. Hersh WR, Bhupatiraju RT, Ross L, Roberts P, Cohen AM, Kraemer DF: Enhancing access
to the Bibliome: the TREC 2004 Genomics Track. J Biomed Discov Collab 2006,

                             Li, a Medication Event Extraction System

25. Uzuner Ö, Solti I, Cadag E: Extracting Medication Information from Clinical Text. Journal
of American Medical Informatics Association, in current issue.

26. Linux.words [http://www.ibiblio.org/pub/linux/libs/linux.words.2.lsm. Accessed 02/12/2010]

27. Dietterich TG, Hao G, Ashenfelter A: Gradient tree boosting for training conditional random
fields. Journal of Machine Learning Research 2008, 9:2113–2139.

28. Chapman WW, Bridewell W, Hanbury P, Cooper GF, Buchanan BG: A simple algorithm for
identifying negated findings and diseases in discharge summaries. J Biomed Inform 2001,

29. Doan S, Bastarache L, Klimkowski S, Denny JC, Xu H: Vanderbilt’s System for Medication
Extraction. In 2009.

30. Grouin C, Deleger L, Zweigenbaum P: A Simple Rule-based Medication Extraction System.
In Third i2b2 Shared-Task Workshop Proceedings 2009.

31. Hamon T, Grabar N: Concurrent linguistic annotations for identifying medication names and
the related information in discharge summaries. In Third i2b2 Shared-Task Workshop
Proceedings 2009.

32. Meystre SM, Thibault J, Shen S, Hurdle JF, South BR: Description of the Textractor System
for Medications and Reason for their Prescription Extraction from Clinical Narrative Text
Documents. In Third i2b2 Shared-Task Workshop Proceedings 2009.

33. Patrick J, Li M: A Cascade Approach to Extract Medication Event (i2b2 challenge 2009). In
Third i2b2 Shared-Task Workshop Proceedings 2009.

34. Shooshan SE, Aronson AR, Mork JG, Bodenreider O, Demner-Fushman D, Dogan RI, Lang
F, Lu Z, Neveol A, Peters L: NLM’s I2b2 Tool System Description. In Third i2b2 Shared-Task
Workshop Proceedings 2009.

35. Solt I, Tikk D: Yet another rule-based approach for extracting medication information from
discharge summaries. In Third i2b2 Shared-Task Workshop Proceedings 2009.

36. Spasic I, Sarafraz F, Keane JA, Nenadic G: Medication Information Extraction with
Linguistic Pattern Matching and Semantic Rules. In Third i2b2 Shared-Task Workshop
Proceedings 2009.

37. Yang H: A Linguistic Approach for Medication Extraction from Medical Discharge
Summaries. In Third i2b2 Shared-Task Workshop Proceedings 2009.

                            Li, a Medication Event Extraction System

Table 1 Definitions of medication name and associated fields

Fields                                            Definition
Medication         Substances for which the patient is the experiencer, excluding food,
                   water, diet, tobacco, alcohol, illicit drugs, and allergic reaction
                   related drugs.
Dosage             The amount of a single medication used in each administration.
Mode/route         Expressions describing the method for administering the
Frequency          Terms, phrases, or abbreviations that describe how often each dose
                   of the medication should be taken.
Duration           Expressions that indicate for how long the medication is to be
Reason             The medical reason for which the medication is stated to be given.

                             Li, a Medication Event Extraction System

Table 2 Features for the medication relationship model

Feature Name                                        Meaning
Same sentence          Whether the medication and field are both in the same sentence,
                       as determined by Splitta.
Same subsection        Whether both elements in a medication field pair are located in
                       the same subsection of the discharge summary.
Numeral                Whether the value of the medication field contains numerals.
Distance               The number of tokens between a medication name and
                       medication field.
Position               Whether the medication field appears before or after the
                       medication name.
Field type             The type of field, such as duration, reason, etc.
Medication between     The number of other medication names between the pair.

                               Li, a Medication Event Extraction System

Table 3 Three systems’ comparison results (F1 score, exact match). Significant outperformance
     is indicated by * (p< 0.05, Wilcoxon rank sum test).

 Two Levels      Granularity              Tags              jMerki        Lancet*   Hybrid*
                   System           Medication event         68.2%        76.4%      79.0%
                   Patient          Medication event         67.2%        74.2%     77.6%
                   System           Medication name          77.2%        80.2%     83.4%
                   Patient          Medication name          76.6%        79.1%     82.9%
                   System                Dosage              67.9%        80.2%     81.8%
                   Patient               Dosage               66%         78.3%     80.6%
                   System                 Mode               70.8%        82.1%     85.0%
   Vertical        Patient                Mode               68.2%         74%      81.9%
                   System              Frequency             66.3%        81.3%     82.4%
                   Patient             Frequency              63%         78.8%      80%
                   System               Duration             8.9%          18%      21.2%
                   Patient              Duration             5.6%          14%      16.5%
                   System                Reason                0†           3%       2.9%
                   Patient               Reason                0†          2.4%      2.4%
†, caused by programming bug.

                             Li, a Medication Event Extraction System

Table 4 Post-hoc experimental results (Horizontal system level, exact match) Significant
     outperformance is indicated by * (p< 0.05, Wilcoxon rank sum test) compared with Lancet.

                      Precision         Recall           F1
     Lancet            90.4%            66.1%          76.4%
    NegPlus*           90.5%            61.8%          73.4%
                       90.2%            66.5%          76.6%
      Affix            91.2%            62.6%          74.2%
   Dictionaries        91.0%            69.1%          78.5%
   Single-line         92.5%            67.1%          77.8%

                               Li, a Medication Event Extraction System


Figure 1 Illustration of medication events in both a narrative and a list. As shown here, each
event includes a medication name and any of its related medication fields. Medical-field
associations are indicated by a dotted line with an arrow. Different font styles indicate different
fields: bold plus underline for medication name; italic for dosage; underline for mode/route;
italic plus bold for frequency; bold for duration; and italic plus underline for reason. The bracket
pair “[ ]” shows the narrative/list attribute.

Figure 2 Flow chart of the Lancet system

Figure 3 Precision for system-level horizontal evaluation of top ten systems: A) Strict evaluation
with exact match; B) Relaxed evaluation with inexact match; C) Strict evaluation with exact
match on list entries only. Dash line indicates the average of the top ten systems.

Figure 4 Analysis of performance variance among discharge summaries


To top