									                                  HTML Table Interpretation by Sibling Page Comparison in the Molecular Biology Domain
                                  Cui Tao and David W. Embley
                                  Data Extraction Research Group
                                  Department of Computer Science, Brigham Young University, Provo, UT, 84602

    1. Introduction                                                                                                                                                                                                                             3. Results
Huge evolving number of Bio-databases                                                                                                                                                                                                 tables

  Molecular biology database collection
   2004: total 548, 162 more than 2003
   2005: total 719, 171 more than 2004
Different access capabilities
Syntactic heterogeneity
Semantic heterogeneity
Updated at anytime by independent authorities

GOALS:                                                                                                                                                                                                                                      Values

To help biologists cross-search various resources
Examples:                                                                                                                                                                                                                                   Labels
   “Find genes which are longer than 5kbp, whose products
     have at least two helices, and participate in glycolysis” –
     GenBank, PDB, KEGG

    “Find genes newly annotated after Jan. 2003 in the fly and
     worm genomes” – FlyBase, WormBase

                                                                            2.2. Sibling Page Comparison
Source page understanding – Table interpretation
   Table recognition                                               Sibling pages & sibling tables
   Table pattern generalization                    Focus of this   Table-Interpretation Steps
   Pattern adjustment                                 poster               HTML table → DOM tree
                                                                                                                                                                                                                                       EXPERIMENTAL RESULTS:
Information extraction & semantic annotation                               Tree matching → Find sibling tables                                                                Pattern Combinations
Source location through semantic indexing                                  Variable fields ~ values & Fixed fields ~ labels                                                                                                            Test Set: 10 web sites; 100 sibling pages; 862 HTML tables
Cross-database query processing                                            Infer pattern                                                                                      Matches any pre-defined pattern template?                Table Recognition: correctly eliminated all but 3 non-data tables
                                                                                                                                                                                                                                       Pattern Generation: successfully recognized 28 of 29 patterns
                                                                                                                                            Gene Model
                                                                                                                                            Status                            Generates a specific structure pattern for the table     Dynamic Adjustment: 5 location adjustments; 12 structure
    2. Table Interpretation                                                        td
                                                                                        Gene Model
                                                                                        Nucleotides (coding/transcript)
                                                                                                                                  tr   td
                                                                                                                                            Nucleotides (coding/transcript)
                                                                                                                                            Amino Acids
                                                                                                                                                                                                                                           adjustments  all correct
                                                                                   td   Protein
                                                                                                                                       td   F18H3.5a 1, 2                                                            location
                                                                                                                                                                                                                                                4. Conclusions
                                                                                   td   Swissprot
                                                                                                                                       td   confirmed by cDNA(s)

Input: an HTML table                                               table
                                                                                   td   Amino Acids
                                                                                                                          table   tr   td
                                                                                                                                            1029/3051 bp
                                                                                   td   F47G6.1 1, 2
                                                                                                                                       td   342 aa
                                                                                   td   confirmed by cDNA(s)

Output: a formal table notation (Wang notation)                               tr   td
                                                                                        1773/7391 bp
                                                                                                                                       td   F18H3.5b 1, 2, 3
                                                                                                                                                                                                                                       We can:
                                                                                                                                       td   partially confirmed by cDNA(s)
                                                                                   td   DTN1_CAEEL
                                                                                                                                  tr   td   1221/1704 bp
                                                                                   td   590 aa
                                                                                                                                       td   WP:CE28918                                                                                   Recognize data tables
                                                                                                                                       td   406 aa
                                                                                                                                                                                                                                         Find labels and values
                                                                                                                                                                                                                                         Infer table patterns
                                                                            2.3. Structure Pattern Generation                                                                                                                            Dynamically adjust table patterns

                                                                                                                                                                                                                                       Domain Generality: work for other domains
              label                      value                     Pre-defined structure templates

                                                                                                                                                                              Dynamically adjust the structure pattern                          Contact Information
    2.1. Table Recognition                                                                                                                                                                                                              Data Extraction Research Group
                                                                                                                                                                                                                                        Department of Computer Science
Consider all tagged tables                                                                                                                                                                                                              Brigham Young University
                                                                                                                                                                                                   An optional                          Provo, UT 84602
Unnest                                                                                                                                                                                                label                             Cui Tao, ctao@cs.byu.edu
Filter out tables containing no data:
                                                                                                                                                                                                                                        David W. Embley, embley@cs.byu.edu
   Sibling table match percentage: max match score / tree size
       > the high threshold: exact match or near exact match
       < the low threshold: false match                                                                                                                                                             An optional
       In between: sibling tables                                                                                                                                                                     value

