i2b2 FIRST SHARED-TASK ON NATURAL LANGUAGE
CHALLENGES FOR CLINICAL DATA MIT
Ozlem Uzuner1, Peter Szolovits2, Isaac Kohane3
1. Department of Information Studies, University at Albany, SUNY; 2. MIT CSAIL; 3. Partners Healthcare System
ABSTRACT Overview and Goals De-identification Challenge Data
Clinical records contain significant medical information which can complement laboratory data in We replaced PHI in the clinical records with realistic surrogates in order to preserve the real challenges present for
This shared-task and workshop aim to bring many ways. These records provide evidence for hypothesized situations and reveal the automatic de-identification systems. Authentic data contains some uncustomary entries for various PHI, e.g., “J Street”
similarities and correlations among different health problems, medications, treatments, etc. can be a hospital, “011406” can be a date. While generating surrogate PHI, we kept such peculiar cases in the data as
together computational linguists and medical However, the information included in these documents is in the form of unstructured, much as possible.
informaticians interested in automatic ungrammatical, fragmented English text. This makes the linguistic processing, search, and
retrieval of these records very challenging; currently, there are very few tools for automatic Our surrogate generation approach permutes the syllables of existing names obtained from the U.S. Census bureau but
linguistic processing of clinical records such linguistic processing of these records. Existing technologies for processing structured conforms to the exact format of the authentic PHI. This approach to generating surrogate PHI usually produces entries
information such as databases, and grammatical documents such as journal or news articles, such as “Valtawnprinceel Community Memorial Hospital” or “GIRRESNET, DIEDREO A”; note that each of these entries
as medical discharge summaries and have limited utility for processing clinical records. follow the exact format of the PHI they were generated to replace. However, this approach can sometimes generate
radiology reports. Lack of a publicly available One barrier to the development of natural language processing technologies specific to clinical
entries such as “Black” and “John” which themselves can be found in the U.S. Census bureau name lists. We make no
effort to eliminate such surrogates from the corpus.
and standardized data set has been one of the records is the difficulty of obtaining these records. In the absence of a standardized, publicly-
available gold standard, efforts to build appropriate technologies have been limited and Throughout the surrogate generation process, we’ve
barriers to systematic progress of Natural fragmented. The lack of standardized, publicly-available gold standard limits the progress of the
Distribution of PHI Types in Records
tried to protect the integrity of the data as much as
possible. For example, dates in the same record have
Language Processing techniques for clinical state-of-the-art in automatic linguistic processing of clinical records, limits the development of 8000
Number of Phrases
technologies available for search and retrieval of clinical text, and as a result limits the ability to 6000 the same offset and references to the same named
data. make use of the information contained in these records. 4000 entity are replaced with the same surrogate. However,
the methods employed are fairly basic; therefore, the results
As a part of the i2b2 (Informatics for Integrating Biology to the Bedside) project, we have
Within the framework of the i2b2 project, we organized a shared-task and workshop which will bring together medical informaticians, 0
are not perfect.
have generated and released a set of fully de- natural language processing researchers, and medical and clinical researchers. Our ultimate
Ambiguity of PHI
goal is to foster the symbiotic relationship between these research communities so that through Patient doctor location hospital date age phone id
identified medical discharge summaries to the their interactions, they can gain a deeper understanding of possible collaborations that can push
To make the de-identification task challenging, during the
surrogate generation process, we introduced some ambiguity in
research community. We prepared two grand the state-of-the-art forward.
PHI. In other words, we generated surrogate PHI that coincide with medical terms such as diseases, treatments, and
challenge questions around this data: To address the problem of limited availability of data that lies at the root of uncoordinated medical test names. Roughly ~30-40% of all surrogate PHI in our corpus are ambiguous with some disease, treatment,
efforts to develop technologies for automatic linguistic processing of clinical data, we have or test name.
released a set of de-identified clinical records. We have designed this data and
• What are some methods for automatic de- developed the gold-standard to evaluate two particular grand challenge questions: Implications
identification of clinical records and how well 1. What is the state-of-the-art in automatic de-identification of clinical data? Because of the randomly generated surrogate names of people, institutions, and places, dictionary-based approaches will
do they perform? be less successful with this data set than with real data. Note that, in reality, many foreign names (with no dictionary
2. How accurately can automatic methods evaluate the smoking status of patients based on matches) exist and are used in discharge summaries.
their medical records?
• Can we automatically identify the smoking We seek to evaluate various approaches to answering these problems on our standardized,
Because dictionary-based lookup is often used as one source of information in even much more sophisticated
approaches than simple dictionary matching, this dataset is particularly challenging.
status of patients based on their clinical public data set. We have created a training set that could be used for developing systems.
We will compare the performance of submitted systems on a held-out test set. Most identifying words appearing in the corpus that are in the dictionary are also disease names or other medical terms
records? that were introduced to enhance ambiguity.
We are in the process of organizing a workshop that will provide a venue for the grand
We have prepared the gold standard for both challenge participants to demonstrate their systems and to present a short paper or poster
Smoking Challenge Data
discussing the scientific contributions behind their systems. The best performing systems
of these grand challenge questions. We made will also be announced during the workshop.
The data for the smoking challenge were annotated by pulmonologists. The pulmonologists were asked to classify patient
the data set available to interested records into five possible categories based on the information contained in the records and based on their medical
intuitions. Two pulmonologists annotated each record; in the case of disagreements, judgments from another
researchers and invited them to participate in pulmonologist were obtained.
the grand challenge. At the time of writing, 18 Natural Language Processing for Data Preparation
PAST SMOKER: A patient whose discharge summary
teams had committed to participate. The data for these challenges were fully de-identified. This process was conducted in two stages.
asserts either that they are a past smoker or that they were The doctors annotated a total of 1000 records.
In the first stage, an automatic de-identification system was used. 1 Most approaches to de- a smoker a year or more ago but who have not smoked for Agreement between them on the complete
The workshop on Challenges in Natural identification rely heavily on dictionaries and heuristic rules; these approaches fail to remove most at least one year. set of records was around 60% (as measured by
personal health information (PHI) that cannot be found in dictionaries. They also can fail to remove CURRENT SMOKER: A patient whose discharge summary Kappa). The records that the pulmonologists did not
Language Processing for Clinical Data will PHI that is ambiguous between PHI and non-PHI. Our approach showed that we can de-identify asserts that they are a current smoker (or that they smoked agree on were omitted from the challenge (unless a
meet with the Fall Symposium of the medical discharge summaries using support vector machines that rely on a statistical representation without indicating that they stopped more than a year ago) majority vote could identify a clear label for these
of local context. Comparing our approach with three different systems, we showed that a statistical or that they were a smoker within the past year. records).
American Medical Informatics Association in representation of local context contributes more to de-identification than dictionaries and hand- SMOKER: A patient who is either a CURRENT or a PAST
November. This workshop will provide a tailored heuristics; that when the language of documents is fragmented, local context (captured by smoker but, whose medical record does not provide Implications
the words immediately surrounding the target word) contributes more to de-identification than global enough information to classify the patient as either a
venue for the participants of the grand context (captured by the information contained in the complete sentence). CURRENT or a PAST smoker.
The low agreement among the doctors
indicates that identification of the smoking status of
challenge to present their papers and In the second stage, the output of the automatic de-identifier was validated manually. Three manual NON-SMOKER: A patient whose discharge summary patients based on the clinical records is challenging
passes were made over each record. Finally, the identified personal health information (PHI) was indicates that they have never smoked.
demonstrate their systems. replaced with realistic surrogates. UNKNOWN: The patient’s discharge summary does not
even for the educated, human annotators. Note that
evaluation of smoking status based on the medical
mention anything about smoking. intuitions of the doctors is even harder; not only because
Data for the smoking status evaluation challenge was hand annotated by pulmonologists. Second hand smokers are considered NON-SMOKERs for the judgments do not directly rely on the records but
 Sibanda, T., and Uzuner, O. Role of Local Context in Automatic Deidentification of the purposes of this study, unless there is evidence in their also the agreement between the doctors in this case is
Ungrammatical, Fragmented Text. In Proceedings of the North American Chapter of the Association record that they actively smoked. Similarly, as we are only even lower (Kappa=~0.5).
This project is supported by grant U54LM008748. for Computational Linguistics (NAACL), 2006. concerned with tobacco smoking, marijuana smoking
should not affect the patients’ smoking status.