Recognizing the Electronic Medical Record Data from Unstructured Medical Data Using Visual Text Mining Techniques

Document Sample
Recognizing the Electronic Medical Record Data from Unstructured Medical Data Using Visual Text Mining Techniques Powered By Docstoc
					                                                      (IJCSIS) International Journal of Computer Science and Information Security,
                                                      Vol. 9, No. 6, June 2011




Recognizing The Electronic Medical Record Data
 From Unstructured Medical Data Using Visual
            Text Mining Techniques
       Prof. Hussain Bushinak                         Dr. Sayed AbdelGaber                             Mr. Fahad Kamal AlSharif
         Faculty of Medicine                  Faculty of Computers and Information                    Collage of Computer Science
        Ain Shams University                            Helwan University                                  Modern Academy
             Cairo, Egypt                                  Cairo, Egypt                                       Cairo, Egypt

Abstract: Computer systems and communication technologies             2.   Help to derive data directly from the electronic record,
made a strong and influential presence in the different fields             making research-data collection by product of routine
of medicine. The cornerstone of a functional medical                       clinical record keeping.            .
information system is the Electronic Health Records (EHR)
management system. EHR implementation and adoption face               3.   Help to Move from paper-based health care system to
different barriers that slow down its deployment in different              secure electronic medical records which will save lives
organizations. This research focuses on resolving the most                 and reduce health care costs.
public barriers, which are data entry, unstructured clinical
data modifying the physician work flow. This research
                                                                      4.   Help in Early detection of infectious disease by
proposed a solution, which use Text mining and Natural                     advanced data collection, fusion and processing
language processing techniques.This solution tested and                    techniques which would be at the forefront in spotting
verified in four real-world clinical organizations. The                    the emergence of new diseases, and crucial to tracking
suggested solution proved correcteness and perciseness with                the spread of known diseases[2].
91.88%..
                                                                      II.ELECTRONIC HEALTH RECORD ,DEFINITION AND MODELS
Keywords:    Electronic Health   Reacord,  Textmining,                    EHR defined as longitudinal electronic record of
Unstructured Medical Data , medical Data entry, Health                patients' health information generated by one or more
Information Technology.                                               encounters in any care delivery setting. This information
                                                                      includes, but not limited to, patient demographics, progress
                      I.INTRODUCTION                                  notes, examinations details like symptoms and findings,
                                                                      medications, vital signs, past medical history,
    The paper-based medical record is woefully inadequate             immunizations, laboratory data, and radiology reports. The
for meeting the needs of modern medicine. It arose in the             EHR automates and streamlines the clinician's workflow.
19th century as a highly personalized "lab notebook" that             The EHR has the ability to generate a complete record of a
clinicians could use to record their observations and plans           clinical patient encounter as well as supporting other care
so that they could be reminded of pertinent details when              directly or indirectly related activities via interface
they next saw that same patient. There were no bureaucratic           including evidence-based decision support, quality
requirements, no assumptions that the record would be used            management, and outcomes reporting. The EHR means a
to support communication among varied providers of care,              repository of patient data in a digital form stored and
and remarkably few data or test results to fill up the                exchanged securely and accessible by multiple authorized
record’s pages. The record that met the needs of clinicians a         users. [2][3][4]
century ago has struggled mightily to adjust over the
decades so as to accommodate to new requirements as
health care and medicine have changed which leads to the                  There are many EHR architectural models that can be
existence of Health Information Technology (HIT) [1].                 used all over the world. The most two popular EHR models
                                                                      are:
   HIT allows comprehensive management of medical
knowledge and its secure exchange among health care
                                                                      1.   Central Repository Model
consumers and providers. Broad uses of HIT will:
                                                                          The center of EHR model will be the repository, which
1.   Help to eliminate the manual tasks of extracting data
                                                                      will be fed by the existing applications in different care
     from charts or filling out specialized datasheets.
                                                                      locations such as hospitals, clinics, and family physician
                                                                      practices. The feed from these applications will be
                                                                      messaging based on the pre-agreed standards. The
                                                                      messaging needs to be based well-defined standards, for




                                                                 25                               http://sites.google.com/site/ijcsis/
                                                                                                  ISSN 1947-5500
                                                      (IJCSIS) International Journal of Computer Science and Information Security,
                                                      Vol. 9, No. 6, June 2011




example the HL7. Reference Information Model (RIM) for                repository using a shared database or by providing a
which XML could be used as the recommended                            common user interface to all hosted applications and
Implementation Technology Specification (ITS). [5]                    extracting data from these systems using a portal whose
                                                                      authentication and authorization mechanism can also be
                                                                      controlled at the data center level as shown in figure 3. [5]




             Figure 1. EHR Central Repository Model

                                                                                       Figure 3. Shared Services Model

    The event-driven messages that need to be sent and
stored in the repository will essentially be event-based                  III.BARRIERS OF THE ELECTRONIC HEALTH RECORD
summaries as shown in figure (2). The event-based                                          IMPLEMENTATION
summaries stored in the repository can be queried and                     Implementation of EHR faces different barriers, but
retrieved by different clinicians who are treating the                these barriers vary from one environment to another.
patients in different scenarios and by different clinical             Hereafter, the main focus will be on the general barriers
settings. The retrieval and access of data from the                   that exist in most of EHR implementation attempts, these
repository is subject to establishing that the clinicians             barriers are:
legitimately access the data for treating only those patients
who are in their care. The retrieval is done through                  1.   Financial Barriers
messaging which can be done either through synchronous                     Financial barriers are divided into the following points:
or asynchronous messages depending on the urgency,
complexity, and importance of the data that is being                                High Costs: These costs are divided into two
retrieved. [5]                                                                      main parts, initial cost and ongoing cost. [6]
                                                                                    Under-developed business case: This barrier
                                                                                    raised because of the following: Uncertainty
                                                                                    of EHR returns on investment, Financial
                                                                                    benefits are only achieved on the long run and
                                                                                    The main objective and benefits of EHR is to
                                                                                    provide a high quality medical service for the
                                                                                    citizens. [6]


                                                                      2.   Technological Barriers
                                                                           Technological barriers are divided into four points: [7]
                                                                                    Inadequate technical support
                 Figure 2. EHR Message Events                                       Inadequate data exchange
                                                                                    Security and privacy
                                                                                    Lack of standards

2.   Managed Services Model                                           3.   Physicians Attitudinal and Behavioral Barriers in data
                                                                           entry:
   The managed services model is based on hosting
applications for different care providers and care settings in             Many health information system projects fail due to
a data center by a consortium, which may consist of group             attitudes, behaviors, barriers in data entry and lack of
of infrastructure providers, system integrators, and                  systematic consideration of human-centered computing
application providers. The hosted applications can be used            issues such as usability, workflow, organizational change,
to provide an effective EHR by building a common                      and process reengineering. There are two major factors that




                                                                 26                                 http://sites.google.com/site/ijcsis/
                                                                                                    ISSN 1947-5500
                                                      (IJCSIS) International Journal of Computer Science and Information Security,
                                                      Vol. 9, No. 6, June 2011




lead to sluggish performance of this EHR system, these                              Textual Objects: Based on a written or printed
factors are: complexity of the Graphical User Interface                             language, such as clinical reports, nursery
(GUI) and system response time. This forces clinician to                            notes and examination sheets. [11]
see fewer patients and have longer workdays, largely
because of the extra time needed to use the system. [8]
    In 2004,Lisa Pizziferri and others concluded that the                 Using unstructured data for storing clinical data has the
benefits of using EHR system can be achieved and accepted             following limitations:
by physicians if only the physicians do not need to sacrifice                       The data is not consumable from a semantic
their time with patients or other activities during clinic                          level without a compatible interface or
sessions. Physicians recognize the quality improvements                             application.
achieved by EHRs, but their time should be saved by
decreasing the time required for data entry in EHR systems.                         Any technology cannot be necessarily gained
[9]                                                                                 insight into the context of the information
                                                                                    unless it can actually be read.

4.   Organizational Change Barriers
                                                                      6.    Barriers of using unstructured data in Electronic Health
     This category contains many points, these points are:                  Record:
                                                                               Aggregation of information across all the records in
               Design of and alignment with workflow and                    a large repository could bring benefits for clinical
               office integration:                                          research. When physicians work with structured data,
               54.2 percent out of the 5000 respondents                     they could receive alerts of the drugs that have bad
               reported that they are worried about slower                  interaction together which enables them to enhance
               workflow and low productivity according to                   the treatment process and avoid the medication errors;
               the American Academy of Family Physicians                    but this cannot be done with unstructured data [12].
               survey results (American Academy of Family
               Physicians 2004). [10]                                      IV.SURVEYING THE SOLUTIONS OF EHR DATA ENTRY
                                                                                             BARRIERS:
               Migration from paper-based systems:
                                                                              In October 2010, Ergin Soysal, Ilyas Cicekli, and
               Staff training:                                              Nazife Baykal designed and developed an ontology
                                                                            based information extraction system for radiological
5.   The format of Clinical Data store in EHR systems                       reports. [15]

       Generally speaking, there are two main types of                        The main goal of this technique is to extract and
     data store shapes: structured data           and                       convert the available information in free text Turkish
     unstructured data.                                                     radiology reports into a structured information model
                                                                            using manually created extraction rules and domain
               Structured data: Structured data is a data that              ontology. This technique extracts data from the
               has a relational data model and enforce                      radiological reports, which is a free text written by
               composition to the atomic data types.                        physicians and insert it as a structured data into the
               Structured data is managed by technology that                EHR. [13]
               allows for querying and reporting against
               predetermined data types and understood
               relationships, like patient demographics,                      However, this technique has the following
               laboratory tests, etc. [11]                                  drawbacks:
               Unstructured data: Unstructured data consists                          It concentrates mainly on abdominal
               of any data stored in an unstructured format at                        radiology reports.
               an atomic level. That is, in the unstructured                          It does not use a huge and trusted medical
               content, there is no conceptual definition and                         expressions repository, which may reduce
               no data type definition - in textual documents,                        the quality of information extraction
               a word is simply a word. [11]                                          process. Consequently, wrong clinical
                                                                                      information will be recorded.

     Unstructured data consists of two basic categories:                      In September 2010, Adam Wright, Elizabeth S.
              Bitmap Objects: Inherently non-language                       Chen, and Francine L. Maloney developed a technique
              based, such as X-rays, radiology, video or                    for identifying associations between medications,
              audio files.                                                  laboratory results and problems. They developed a




                                                                 27                                http://sites.google.com/site/ijcsis/
                                                                                                   ISSN 1947-5500
                                                (IJCSIS) International Journal of Computer Science and Information Security,
                                                Vol. 9, No. 6, June 2011




knowledge base of medication and laboratory result                            It does not use spelling correction.
problems associations in an automated fashion. It was                         There is no clear structure data model to
based on two data mining techniques; frequent item                            store the extracted data from the clinical
set mining and association rule mining. This technique                        report.
was successfully able to identify a large number of                           It does not use a huge and trusted data
clinically accurate associations. A high proportion of                        source for medical expressions like Unified
high-scoring associations were adjudged clinically                            Medical Language Systems (UMLS).
accurate when evaluated against the gold standard
(89.2% for medications with the best-performing                      In July 2010, another technique for automatically
statistic, chi square, and 55.6% for laboratory results           extracting information needed from complex clinical
using interest) [14]. However, this technique has the             questions was developed by Yong-gang Cao, James J.
following drawbacks:                                              Cimino, John Ely and Hong Yu. They built a fully
                                                                  automated system Ask EHRMES Help clinicians
           The researchers assumed that patients’ data            extract and articulate multimedia information from
           was structured.                                        literature to answer their ad hoc clinical questions.
           Building the knowledge base concentrated               This system automatically retrieves, extracts, and
           only on patient’s problems, medications                integrates information from the literature and other
           and laboratory results, which mean the                 information resources and attempts to formulate this
           other data, such as the patient’s history,             information as answers in response to ad hoc medical
           diagnosis, and procedures are not in                   questions posted by clinicians, all of which can be
           account.                                               achieved within a time-frame that meets their demands
           Data entry is done through traditional GUI.            [17]. This technique succeeds in clinical question
           So, this solution did not enhance the                  answering and in identifying the category of the
           physician workflow.                                    question but in the EHR system adoption process
                                                                  faced the following limitations:
                                                                              This technique extracted the clinical
   In September 2010, a system for misspellings in                            information to identify the question
drug information system queries was developed by                              category but not to store this information in
Christian Senger, Jens Kaltschmidt, Simon P.W.                                the EHR repository.
Schmitt, Markus G. Pruszydlo and Walter E. Haefeli.                           It works only on question answering but
This system attempted to solve the problem of drug’s                          not in the data entry process.
data entry in Drug Information System (DIS). The                              It does not enhance the physician workflow
researchers evaluated correctly spelled and misspelled                        during the examination process.
drug names from all queries of the University Hospital               Although the previous techniques attempted to solve
of Heidelberg. The results identified that search                 the EHR data entry barrier but it has the following
engines of DIS should be equipped with error-tolerant             limitations:
search capabilities. Auto-completion lists might                              These techniques concentrate on specific
expedite searches but might fail regularly due to the                         parts of data, such as diseases and leaves.
high frequency of typographic errors already in initials.                     The used medical expression repository
It improved the DIS data entry by using spelling                              does not contain all the expressions or the
corrected tools to make the drug information                                  semantic relations between them.
understandable and available, but it concentrated only                        Some of these techniques store the EHR
on DIS without examination, history, and procedure                            data as free text (unstructured data form).
data [16].                                                                    The physician workflow has some
                                                                              modifications which, in turn, leads to more
  In august 2010, a technique was developed by                                physical and mental efforts and reduces the
Yong-gang Cao, James J. Cimino, John Ely and Hong                             physician’s productivity.
Yu. It was an automated identification of diseases and
diagnosis in clinical records. This technique presents
                                                             V.    BRIDGING THE UNSTRUCTURED DATA TO STRUCTURED
an approach for a prototyping of a diagnosis classifier
                                                                                       EHR
based on a popular computational linguistics platform
[18]. This technique has the following limitations:                  The suggested idea is to convert the unstructured
            It focuses only on the diseases key words             free text clinical data to structured EHR data without
            to be extracted and ignores other important           modifying the workflow of physicians or adding any
            parts    like    operations,    symptoms,             additional physical or mental effort to them. Figure (4)
            finding…etc.                                          shows the algorithm of the suggested technique.




                                                            28                             http://sites.google.com/site/ijcsis/
                                                                                           ISSN 1947-5500
                                                       (IJCSIS) International Journal of Computer Science and Information Security,
                                                       Vol. 9, No. 6, June 2011




                                                                                    Figure 6 Spell Check input and output



                                                                       Step 3: Text mining with Natural Language Processing
                                                                       Techniques
                                                                         In this step, the resulted data will be cleaned and
                                                                         partitioned into statements. to be classified and coded;
                                                                         Using text mining and NLP all medical data will be
                                                                         classified and coded in the form of multiple statements
                                                                         and remove the unwanted words. This step consists of:
                                                                         [19]
                                                                                      Text preprocessing,
             Figure 4 Objective Technique Steps                                       Part of speech tagging,
                                                                                      Statements Segmentation,
                                                                                      Noun phrase extraction.
Step1: Optical Character Recognition OCR                                 The declaration of each pervious component is
  The physician writes his/her diagnoses as regular on                   showing in the following.
  pen-pad, paper or tablet PC. If the clinical report wrote              1. Text preprocessing: Is called tokenization or text
  on paper, it will need to scan it. The clinical report                      normalization and it does include the following
  data will be stored as image of a free hand text which                      steps: [19]
  can be process. This free hand text image scans with                                Throw away unwanted stuff (e.g.,
  OCR tool to convert to machine encoded text. The                                    unwanted brackets and tags).
  Details of this step represented in figure (5).                                     Word boundaries: white space and
                                                                                      punctuations.
                                                                                      Stemming (Lemmatization): This is
                                                                                      optional. English words like ‘look’ can be
                                                                                      inflected with morphological suffixes to
                                                                                      produce ‘looks, looking, looked’. They
                                                                                      share the same stem ‘look’. Often (but not
                                                                                      always) it is beneficial to map all inflected
                                                                                      forms into the stem. This is a complex
                                                                                      process since there can be many
                                                                                      exceptional cases (e.g., department vs.
       Figure 5 OCR and Handwriting input and output                                  depart, be vs. were). The most commonly
                                                                                      used stemmer is the Porter Stemmer.
Step 2: Spelling Corrector                                                            However, there are many others.
  Machine encoded text may include spelling errors                                   Stop word removal: the most frequent
  which may yield wrong information during the                                       words often do not carry much
  extraction process. So, all the incorrect spelling words
                                                                                     meaning.
  will be correct to move to the next step. This step
  requires a medical dictionary that contains most of the                            Capitalization, case folding: often it is
  medical expressions in different forms such as verbs,                              convenient to lower case every
  adjectives, nouns… etc. Figure (6) represent the                                   character.
  details of this step.
                                                                         2. Part of speech tagging: A Part-Of-Speech Tagger
                                                                            (POS Tagger) is a piece of software that reads text
                                                                            in some language and assigns parts of speech to
                                                                            each word (and other token), such as nouns, verbs,
                                                                            adjectives, etc. [19]
                                                                         3. Statements segmentation: The output of this part
                                                                            divides the clinical text into several statements.
                                                                            [19]




                                                                29                                  http://sites.google.com/site/ijcsis/
                                                                                                    ISSN 1947-5500
                                                     (IJCSIS) International Journal of Computer Science and Information Security,
                                                     Vol. 9, No. 6, June 2011




    4.   Noun phrase extraction: In this part, all noun
         phrases are extracted and the complex noun
         phrase is decomposed into smaller noun phrases.




                                                                                    Figure 8 UMLS expressions coding

                                                                     The pseudo code of UMLS coding algorithm can be:
                                                                          For each Statement S in Statements //in physician
                                                                          sheet
                                                                          Begin
                                                                              For each noun-phrase N in S
                                                                              Begin
                                                                                     If N exists in UMLS then,
                                                                                        Extract N and C // where c is the
               Figure 7 Text mining and NLP tasks                                       UMLS code
                                                                                        Put N with C as pair <N, C>
Step 4: Unified Medical Language System (UMLS)                                       End if
Coding                                                                        End
    To identify the clinical information, there is a need for             End
    a huge repository for all clinical expressions to extract
    the matched clinical expressions. UMLS used to
    achieve this purpose. The UMLS is a compendium of                  Step 5: Classify EHR Components
    many controlled vocabularies in the biomedical                       The suggested technique applied on physician’s
    sciences and created in 1986. It provides a mapping                  examination sheet. The examination sheet contains the
    structure among these vocabularies and allows                        following classes:
    translating among the various terminology systems. It                         History
    may be viewed as a comprehensive thesaurus and                                Examination
    ontology of biomedical concepts. [20]                                         Diagnosis
                                                                                  Procedure
                                                                         Each part treated as a class and all coded clinical data
    UMLS consists of the following components: [20]                      that were produced from the previous steps classified
            Metathesaurus, the core database of the                      into one of the previous classes.
            UMLS, a collection of concepts and terms
            from the various controlled vocabularies                     The first step in the classification process is building a
            and their relationships.                                     collective set of features that is typically called a
            Semantic Network, a set of categories and                    dictionary. The UMLS clinical expressions in the
            relationships that are being used to classify                dictionary form represent the base to create a
            and relate the entries in the Metathesaurus.                 spreadsheet of numeric data corresponding to the
            Specialist Lexicon, a database of                            previous defined classes.
            lexicographic information to be used in
                natural language processing.
               A number of supporting software tools.
    Morphologically analyzed words are compared to the
    UMLS entries to find the best matched expression                                TABLE (1): CLASSES DICTIONARY
    according to its Morphological position. Each noun
    phrase which matches a clinical expression entry in
    the UMLS, put as a pair that contains the noun phrase
    with its UMLS’s clinical codes.




                                                                v



                                                                30                                http://sites.google.com/site/ijcsis/
                                                                                                  ISSN 1947-5500
                                                                (IJCSIS) International Journal of Computer Science and Information Security,
                                                                Vol. 9, No. 6, June 2011




                                                                                  cosine value is close to 1 this means that the clinical
                                                                                  phrase is more similar to the compared class.
      Each row defines a class and each column represents a
      UMLS code. The cell in the spreadsheet represents a                       Step 6: Storing data in EHR Repository
      measurement of the feature corresponding to the                             The classified clinical phrase stored in its class inside
      column and the class corresponding to the row. The                          the EHR database with its matched UMLS code. For
      dictionary of words covers all the possibilities and the                    example, a physician wrote the following:
      number corresponds to the columns. All cells values
      ranged between zero and one depending on whether                        There is enlarged prostate with tender base of the bladder.
      the words were encountered in the Class or not. The
      form of classes’ dictionary is shown in table (1).                          This statement contains two findings, and then this
                                                                                  statement compared with each class. The cosine vector
      The second step is measuring the similarity between                         scores for this statement against each defined class
      extracted expressions and the defined classes then                          according to the previous equations are calculated.
      classify each expression to the most similar class. The                     The winning class will be the high score one. The data
      Cosine algorithm selected to calculate the Similarity                       will store in the winning class with its UMLS codes as
      between the extracted clinical phrases and predefined                       pairs inside EHR repository:
      classes. Steps of Cosine Similarity algorithm are:                                            < enlarged prostate, Finding>
                  Compute the similarity of new clinical                                        < tender base of the bladder, Finding>
                  phrase to all Classes in Dictionary.                            The EHR put in a structured form for analysis and data
                  Select the Class that is most similar to the                    mining operation, or as a perfect resource for decision
                  new clinical phrase.                                            support system.
                  The class which occurs most frequently is
                  the similar one.
                                                                                         VI.     THE EXPERIMENTAL STUDY
                                                                                    The aim of the experiment is to prove the success of
                                                                                  the suggested technique in a real world cases. For any
                                                                                  experiment, there are some hypotheses; the hypotheses
                                                                                  of this experiment are:
                                                                                              Physician has little experience of computer
                                                                                              using.
                                                                                              Physician’s handwriting is readable.
                                                                                              The used medical abbreviations should be
                                                                                              standard.
                                                                                              The experiment applied during the
                                                                                              examination session.
Figure 9: Computing similarity scores for New Clinical Phrase
                                                                                    The required equipments to implement the
                                                                                  experiment are:
      For cosine similarity, only positive words shared by                                   An electronic pen pad.
      the compared phrases are considered. Frequency of                                      A Laptop or personal computer.
      word occurrence is also valued. The clinical phrase is                                 Windows vista or later
      compared with each class by the following equation:                                    SQL server 2008
      [21]                                                                                   Microsoft office 2007 or later (For
                                                                                             applying OCR in Pin pad)
       Norm (P) = W (j): is the weight of the word phrase in                                 .Net framework 4
      class                                                                                  UMLS database system
      Cosine (P1, P2) = wp1 (j) * wp2 (j))/ (Norm (P1) *                                     Medical dictionary (for spelling correction)
      Norm (P2))                                                                    The implementation of the experimental study is
        Wpi: is the weight of the word phrase in class i                          going through the following steps:

      The cosine similarity of two Classes will range from 0                           Step 1: At the nurse office the patient
      to 1. The angle between two term frequency vectors                               demographics data recorded using the following
      cannot be greater than 90°, consequently, when the                               screen.




                                                                         31                                http://sites.google.com/site/ijcsis/
                                                                                                           ISSN 1947-5500
                                                (IJCSIS) International Journal of Computer Science and Information Security,
                                                Vol. 9, No. 6, June 2011




                                                                             Figure 12: Applying OCR on the diagnosis sheet


                                                                       Step 4: After the OCR done, the system starts to
                                                                       checks and corrects the spelling errors of the
         Figure 10: EHR demographics form                              examination data according to the installed
                                                                       medical dictionary through an interaction session
                                                                       with the physician.
Step 2: The physician uses the pen pad to write
the diagnosis.




          Figure 11: Pen pad to Computer Form




The physician has the freedom to erase, add or
modify any partition of his/her diagnosis. This
step helps him/her to work as regular without any                          Figure 13: Applying spell check on the examination text
additional effort. The data is directly recorded on
the computer which will help the physician to
retrieve it easy with its form or as structured data.

Step 3: After the physician finished his/her hand
writing, he/she press OCR button to convert the
diagnosis from image form to machine coded text
as shown in the following figure:                                      Step 5: After the spelling correction done, the
                                                                       physician presses “insert into EHR” button to
                                                                       convert the diagnosis data from unstructured to
                                                                       the structured form. Conversion is done through
                                                                       the following steps:
                                                                               Text preprocessing: All brackets, unwanted
                                                                               stuff, and word boundaries are removed.




                                                         32                                  http://sites.google.com/site/ijcsis/
                                                                                             ISSN 1947-5500
                                                   (IJCSIS) International Journal of Computer Science and Information Security,
                                                   Vol. 9, No. 6, June 2011




        Parts of speech tagging: Assigning parts of                                   o    One tablet twice daily for three
        speech to each word.                                                               months
        Statements segmentation: Examination text                                     o    One tablet
        is split into multiple statements.                                            o    Twice daily
        Phrase tagging: Each phrase is tagged with                                    o    Three months
        the suitable code to identify all phrases                                     o    R3 Depavit B12 ampule
        contained in the diagnosis sheet.
The output of this step is the examination of                             Step 7: All noun phrases are coded with UMLS
words with their parts of speech; this output exists                      codes. The output of this step represented in table
in the following format:                                                  (2).
(TOP (S (NP (DT A) (ADJP (NP (CD 15) (NNS years)) (JJ                   TABLE (2): NOUN PHRASES WITH THEIR UMLS CODES.
old)) (JJ female) (NN patient)) (VP (VBZ complains) (PP (IN
from) (NP (JJ nocturnal) (NN enuresis))) (PP (IN since) (NP
(NN birth)))) (. . .)))
(TOP (S (NP (NP (JJ Plain) (NN X-ray)) (PP (IN of) (NP (DT
the) (NN abdomen)))) (VP (VBD was) (ADJP (JJ free))) (. .)))
  (TOP (S (NP (JJ Abdominal) (NN ultra) (NN sonography))
            (VP (VBD was) (ADJP (JJ free))) (. .)))
(TOP (S (NP (PRP he)) (VP (VBZ has) (NP (NP (NNP
Enuresis)) (SBAR (S (NP (DT The) (NN patient)) (VP (MD
should) (VP (VB receive))))) (: :) (NP (NP (NNP R1) (NNP
Uipam) (NN tablet)) (NP (NP (CD one) (NN tablet)) (NP (RB
twice) (RB daily)) (PP (IN for) (NP (CD three) (NNS
months))))))) (. .)))
(TOP (S (PP (IN R2) (NP (NNP Dipripam) (CD 20) (NN mg)
(NN capsule))) (NP (NP (CD one) (NN tablet)) (NP (RB
twice) (RB daily)) (PP (IN for) (NP (CD three) (NNS
months)))) (. .))) (TOP (S (NP (DT R3) (NNP Depavit) (NNP
B12) (NN ampule)) (. .)))



      Figure 14: Output of Text mining technique




                                                                          Each statement got score according to UMLS
       Noun Phrase Extraction:                                            codes and the class’s dictionary which declared in
       All noun phrases are extracted and                                 table (1). Table (3) shows the statements and their
       compounded. Noun phrases are divided                               scores.
       into a smaller noun phrases, such as the
       following:
            o A 15 years old female patient                                      TABLE (3): STATEMENTS’ SCORE.
            o 15 years
            o Nocturnal enuresis since birth
            o Birth
            o Plain X-ray of the abdomen
            o Plain X-ray
            o The abdomen
            o Abdominal ultra sonography
            o Enuresis
            o The patient
            o R1 Uipam tablet
            o One tablet twice daily for three
               months                                                     Step 8: According to the scores showed in table
            o One tablet                                                  (3), the statements classified into their classes.
            o Twice daily                                                 The predefined classes are:
            o Three months                                                       History
            o Dipripam 20 mg capsule                                             Examination
                                                                                 Diagnosis
                                                                                 Procedure




                                                               33                             http://sites.google.com/site/ijcsis/
                                                                                              ISSN 1947-5500
                                                   (IJCSIS) International Journal of Computer Science and Information Security,
                                                   Vol. 9, No. 6, June 2011




 The classifier uses the COS similarity algorithm                         Table (6) shows the overall precession
 to classify each statement according to the class                        percentage in each of tested department.
 dictionary. Table (4) shows the score of each
 statement relative to nearst class.
                                                                         TABLE (6): RESULTS OF THE EXPERIMENTAL STUDY.
TABLE (4): COS SIMILARITY SCORES FOR EACH CLASS.
                                                                              Department               Overall Precise
                                                                           Surgical Oncology                92.96%
                                                                            Surgery Urology                 91.55%
                                                                               Cardiology                   92.33 %
                                                                             General Surgery                88.61%
                                                                           Overall precession                91.36

                                                                       Some factors affect the results, such as quality of
                                                                     physician hand writing. The effect of this factor clears
                                                                     in the result of experiment four, since it is the lowest
                                                                     precision percentage (91.36 %). High precision OCR
                                                                     tool can minimize the effect of this factor; but it may
 Step 9: After determining the winning class for                     be expensive. The results indicated that the suggested
 each statement, each noun phrase with its UMLS                      technique success with high percentage in a real world
 code saved inside the EHR in the winning class as                   experiment, which means that this technique can be
 a paired tag. Table (5) shows this format.                          applied in the real live in future.
      TABLE (5): DATA THAT INSERTED INSIDE THE EHR
                                                                                 VIII.      CONCLUSION
                                                                        The suggested technique succeeded in working as a
                                                                     bridge between unstructured and structured medical
                                                                     data. The medical data stored inside the EHR system
                                                                     in its right position without any additional physical or
                                                                     mental effort by physician, which in turn satisfy the
                                                                     main objective of this research.


                                                                                         REFERENCES

                                                                 [1] Institute of Medicine. “Review of the Adoption and
                                                                     Implementation of Health IT Standards by the DHHS
                                                                     Office of the National Coordinator for Health
 Step 10: This extracted information compared                        Information
 with the physician manual results to identify the                   Technology”http://www.iom.edu/Activities/Workforc
 suggested technique precision.                                      e/HealthITStandards.aspx

     VII.     RESULTS DISCUSSION                                 [2] Richard Dick, Elaine B. Steen, and Don Detmer, “The
                                                                     Computer Based Patient Record: An Essential
 The experimental study conducted on four                            Technology for Health Care”, National Academy
 Medical departments. In each department 10                          Press, 1997.
 diagnosis sheets tested. The tested departments
 are:                                                            [3] See HIMSS web page for the consensus definition of
       Surgical Oncology                                             an electronic health record.
       Surgery Urology                                               http://www.himss.org/ASP/topics_ehr.asp.
       Cardiology
       General Surgery                                           [4] J.H. van Bemmel and M.A. Musen, “Handbook of
                                                                     Medical Informatics”, Springer, 1997.




                                                            34                                 http://sites.google.com/site/ijcsis/
                                                                                               ISSN 1947-5500
                                                     (IJCSIS) International Journal of Computer Science and Information Security,
                                                     Vol. 9, No. 6, June 2011




[5] K. Ananda Mohan,” National Electronic Health                   [18] Dina Demner-Fushman, James G. Mork, Sonya E.
    Record Models”, Tata Consultancy Services                           Shooshan, Alan R. Aronson ,“UMLS content views
    (TCS),2004.                                                         appropriate for NLP processing of the biomedical
                                                                        literature vs. clinical text”, Elsevierhealth, 2009.
[6] Miller, R. H. and Sim, Ida. “Physicians’ Use Of
    Electronic Medical Records: Barriers And Solutions”.
                                                                   [19] Malgorzata Marciniak,Agnieszka Mykowiecka,”
    Health Affairs, 2004.
                                                                        Aspects of Natural Language
                                                                        Processing”,Springer,2009.
[7] Waegemann, “EHR vs. CPR vs. EMR. Healthcare
    Informatics”, 2003.
                                                                   [20] Catherine R. Selden,Betsy L. Humphreys,” Unified
[8] Himali Saitwala, Xuan Fengb, Muhammad Walji,                        Medical Language System: Current Bibliographies in
    Vimla Patel, Jiajie Zhanga, ”Assessing performance of               Medicine”, National institute of health,1990.
    an Electronic Health Record (EHR) using Cognitive
                                                                   [21] Jiawei Han,Micheline Kamber,” Data mining:
    Task Analysis” , Elsevierhealth, 2010.
                                                                        concepts and techniques”,Diana Cerra,2006.
[9] Lisa Pizziferri, Anne F. Kittler, Lynn A. Volk, Melissa
    M. Honourb, Sameer Gupta, Samuel Wang, Tiffany
    Wang, Margaret Lippincott, Qi Li and David W.
    Bates,” Primary care physician time utilization before
    and after implementation of an electronic health
    record: A time-motion study”, Elsevierhealth,2004.
[10] American Academy of Family Physicians. “Family
     Practice Management Monitor”, AAFP pushes for
     affordable EMR system, 2004.
[11] Oleh Hrycko,” Electronic Discovery in Canada: Best
     Practices and Guidelines”,CCH,2007.
[12] Angus Roberts , Robert Gaizauskas, Mark Hepple,
     George Demetriou, Yikun Guo, Ian Roberts, Andrea
     Setzer,” Building a semantically annotated corpus of
     clinical texts”, Elsevierhealth,2009.
[13] Hanna M. Seidlingab, Marilyn D. Paternoac, Walter E.
     Haefelib, David W. Bates,” Coded entry versus free-
     text and alert overrides: What you get depends on how
     you ask”, Elsevierhealth,2010.
[14] Adam Wright, Elizabeth S. Chenc, d and Francine L.
     Maloney,” An automated technique for identifying
     associations between medications, Laboratory results
     and problems”, Elsevierhealth, 2010.

[15] Ergin Soysal, IlyasCicekli, NazifeBaykal,” An
     ontology based information extraction system for
     radiological reports”, Elsevierhealth, 2010.

[16] Christian Senger, Jens Kaltschmidt, Simon P.W.
     Schmitt,Markus G. Pruszydlo, Walter E.
     Haefeli ,“Misspellings in drug information system
     queries: Characteristics of drug name spelling errors
     and strategies for their prevention”, Elsevierhealth,
     2010.

[17] Yong-gang Cao, James J. Cimino, John Ely, Hong Yu,
     “Automatically extracting information needs from
     complex clinical questions”, Elsevierhealth, 2010.




                                                              35                                http://sites.google.com/site/ijcsis/
                                                                                                ISSN 1947-5500