Effect of Dependency Relationships and Ordered Co-occurrence of by dffhrtcv3


									GAPSCORE: Finding Gene and Protein
    Names one Word at a Time

     Jeffery T. Chang1, Hinrich Schutze2 &
                 Russ B. Altman1

       Department of Genetics, Stanford Medical
               Enkata Technologiesl, USA
     (Bioinformatics, Vol.20(2), pp.216-225, 2004)
• GAPSCORE is to identify gene and protein
  name in text.
• GAPSCORE scores words based on a
  statistical model of gene names that
  quantifies their appearance, morphology
  and context.
• Evaluate the Yapex data set and achieve an
  F-score 82.5% (83.3% recall, 81.5%
  precision) for partial matches and 57.6%
  (58.5 recall, 56.7% precision) for exact
  matches.                                   2
         1. Introduction (1/4)
• Gene and protein name identification
  algorithms use combinations of approaches
  – Dictionary: search from a list of known genes.
  – Appearance: deduce word type based on its
    makeup of characters
  – Syntax: filter words based on POS
  – Context: use nearby words to infer gene and
    protein names
  – Abbreviation: use abbreviation in text to help
    identify names
         1. Introduction (2/4)
• Dictionary
  – Easy to understand and implement.
  – Maintain dictionaries is difficult.
• Appearance
  – Many gene names ‘look like’ other gene names.
  – Some scientific naming conventions, such as
    those for cell lines or viruses, are similar to
    those of genes.

         1. Introduction (3/4)
• Context
  – A NP with a gene name often contains related
    words, such as those that describe molecular
    function or interactions.
  – Such heuristics miss the many occurrences of
    gene names without context clues.
• Morphology
  – The cdk4 and cdk7 genes share the stem ‘cdk’.
  – Morphology is analogous to appearance.

        1. Introduction (4/4)
• GAPSCORE combines syntax, appearance,
  context and morphology using supervised
  machine learning methods: Naïve Bayes
  (NB), Maximum Entropy (ME) and Support
  Vector Machines (SVMs).
• Accessible from the web at

                2. Methods
• GAPSCORE does not distinguish between
  genes and proteins.
• The algorithm consists of five steps
  – Tokenize: Split the document into sentences and
  – Filter: Remove from consideration any word that
    is clearly not a gene name.
  – Score: Using a machine learning classifier.
  – Extend: Extend each word to the full gene name.
  – Match abbreviation: Score abbreviations of the
    gene names identified.
               2.1 Tokenize
• Sentence boundaries: period, question mark
  or exclamation point followed by a space
  and then a capitalized letter is a sentence
  boundary. (Periods in ‘e.g.’ are exceptions)
• Any space and most punctuation are word
• They handle dashes separately since many
  gene names contain them (e.g. c-jun, IL-2).
  Dashes are not boundaries when the
  previous token is a single letter, or the next
  token is a number or roman numeral.
                  2.2 Filter
• Apply Brill’s tagger and remove words that
  are not nouns, adjectives, participles, proper
  nouns or foreign words.
• Discard numbers, roman numerals, greek
  letters, amino acids, seven virus names and
  13 chemical compounds.
• Discard names of organisms found in the
  SWISS-PROT database.
• Discard words from a manually created list
  of 49 regular expression patterns: e.g.
  protein, DNA, peptide, ATP, receptor.
                      2.3 Score
• They score separately two classes of proteins that
  are common and easy to recognize unambiguously:
  enzyme names and cytochrome P450 proteins.
   – -ase: 327 words that end with ‘-ase’ or ‘-ases’ from
     Webster’s Second international dictionary. Then
     manually remove gene names form the list and add
     ‘gases’: 196 words that are not gene names.
   – They use 4 regular expression patterns to recognize
     names with the form: ‘cytochrome P450 2D6’, ‘p450
     IID6’, ‘CYP2d6’, ‘CYPs’.
• Most words do not match these two special cases.
  They encode their appearance, morphology, and
  context as a feature vector for a machine learning
  classifier.                                        11
2.3.1 Appearance (1/2)
              • These features
                encode a 13
                dimension vector
                that describes the
                appearance of a
              • For a specific
                word, the valued
                for each feature is 1
                if it describes the
                word and 0
                otherwise.        12
          2.3.1 Appearance (2/2)
• Recognize gene names that end with ‘-in’
   – Use a generic statistical model that learns variable length
     N-grams to classify phrases.
• To train the N-gram model:
   – Create a training set of words end with ‘-in’ from
     Medical Subject Headings (MeSH): 708 unique words.
   – A word was a protein if it belonged to one of 15 MeSH
• For words that do not end with ‘-in’, assign score 0.
  Otherwise, use the score from the classifier; those
  scores constitute the final dimension of the
  appearance feature vector.
        2.3.2 Morphology (1/2)

• This table shows variations of gene and protein
  names that they score in a feature vector. Each
  variant is either a prefix or suffix of the word
       2.3.2 Morphology (2/2)
• The value of each morphology feature is
   log max (1/1000, #Vars/#Stems)
• Where #Stems is the number of times a stem
  appears by itself in MEDLINE, and #Vars is
  the total number of times the stem appears
  with a variation.
• Empirically, the ratio of these counts, when
  plotted for all words in MEDLINE, follows
  an exponential distribution. Therefore, to
  improve discrimination in machines learning,
  they take the log of that ratio.
             2.3.3 Context (1/3)
• Gene names should appear most often next to
  positive signals and least next to negative ones.
• To find the signal words, they created a training set
  of 1,025 words, which include 574 gene names.
   – They randomly chose 500 nouns that appeared in year
     2001 MEDLINE abstracts containing the word ‘gene’ or
   – To increase the prevalence of gene names, they added
     525 more words that appeared before ‘gene’, ‘protein’ or

             2.3.3 Context (2/3)
• A 2x2 contigency table

• (A) contains # of genes from the training set found
  before ‘expression’ anywhere in MEDLINE, (B) is
  # of genes never found before ‘expression’, (C) is #
  of non-genes found before ‘expression’ and (D) is #
  of non-genes never found before ‘expression’.
• If ‘expression’ is a strong signal that the previous
  word is a gene name, then the ratio of genes to non-
  genes would be higher in the 1st column than the
  2nd.                                                 17
          2.3.3 Context (3/3)
• Calculate the significance of the difference
  in the ratio using a c2 test. =>2,567 words.
• Select the most common signal words: 9
  positive and 9 negative signal words.
• Each feature is the number of times that a
  word occurs with each signal word across all
  of MEDLINE. They calculated the
  distribution across signal words by
  normalizing the feature vector to 1.0.
             2.3.4 Classifier
• From 634 MEDLINE abstracts cited by a
  review article on pharmachognomics, they
  manually categorized each word as either a
  gene named or non-gene.
• For a multiple word gene name: core gene-
  meaning words.
• Include 8,617 words from MeSH.
• 19,952 unique labeled words for training.
• Three types of classifiers: NB, ME and
  SVM.                                         20
            2.4 Extend to NP
• Identify multi-word gene names similar to
  heuristics used in Fukuda et al. (1998)
  – Include the nouns, adjectives and participles
    preceding the putative gene name.
  – Lengthen the name to include the following
    words that are single letters, greek letters and
    roman numerals.
  – Remove extraneous punctuation at the beginning
    or end of the name, except for open or close
    parenthesis characters required to complete a
      2.5 Match Abbreviations
• Search for abbreviations in the document
  using the algorithm described in Chang et al.
• If the long form of an abbreviation has a
  higher score, it transfers that score to the

              2.6 Evaluation
• Yapex test gold standard.
  – Exact match: equivalent to the corresponding
    name in the gold standard.
  – Sloppy match: overlap the name in the gold

           3. Results (1/4)
• Using optimal parameters, compare the
  performance of , ME and SVM on the Yapex
  training, scoring with sloppy matches.

              3. Results (2/4)
• With the best classifier parameters, test the
  algorithm with various modules disabled.

                3. Results (3/4)
• Yapex: 75.4% F-score (70.3% recall, 81.4% precision)
• GAPSCORE: 82.5% F-score (83.3% recall, 81.8%

                3. Results (4/4)
• Yapex: 54.3% F-score (50.1% recall, 59.3% precision)
• GAPSCORE: 57.6% F-score (58.5% recall, 56.7%

                6. Conclusion
• GAPSCORE finds gene and protein names by
  combining novel formulations of features in a
  machine learning framework.
• SVMs slightly outperform other popular methods.
• When applied to the Yapex text collection, the
  method achieves high performance due to its
  sophisticated analysis of single words and the high
  prevalence of single word gene names.
• The algorithm produces confidence scores that can
  be adjusted for either high recall or precision.


To top