Ontology Translation on the Semantic Web by yurtgc548


									      CIS607, Winter 2009
   Data Mining and Data
Integration in Bioinformatics

   Instructor/Organizer: Dejing Dou

          Week 1 (Jan. 9)

              About this Seminar
   Introduction Lectures
    – Week 1: General Introduction and Basic Knowledge (Data Mining, Data
      Integration, Bioinformatics, Semantic Web)
    – Week 2: Introduction to Each Topic

   Paper Presentations and Discussions
    – Week 3 to Week 9: Surveys of Data Mining, Data Integration in
      Bioinformatics. Data Integration, Data Fusion for Biomedical Data , Data
      Mining for Biomedical Data , Ontologies and Semantic Web for
   Week 10, Review and Discussions

   Attendance: 20%
    – However, 2 Absences or 4 Lateness without excuse  Fail

   Paper Reading and Discussion: 30%
    – Summary and Question Preparations (homework)
    – Asking Questions to Paper Presenter or Instructor
   Paper Presentation: 30%
    – 35-40 Minutes Presentation
    – 10-15 Minutes Question Answering
   Final Report: 20%
    – Survey paper or a small implementation

   Data Mining, Knowledge Discovery

   Data Integration, Data Fusion, Data Sharing

   Bioinformatics, Neuroinformatics, Biomedical

   Ontologies and the Semantic Web

          Why they are related to each other ?
            What Is Data Mining?
   Data Mining (knowledge discovery from data)
    – Extraction of interesting (non-trivial, implicit, previously
      unknown and potentially useful) patterns or knowledge
      from huge amount of data
    – Data Mining (a misnomer?) vs. Knowledge Mining vs.
      Gold or Diamond Mining from rocks or sand
   Alternative names
    – Knowledge discovery (mining) in databases (KDD),
      knowledge extraction, data/pattern analysis, data
      archeology, data dredging, information harvesting,
      business intelligence, etc.
                Simple Example
   Market Basket Analysis from Transaction Data
     About 90% customers always buy some items

            {pen, ink}, {milk, juice}

      only 10% customers always buy some other items
              {pen, milk}

       70% customer always buy chip and ice cream
    together in summer:
                  {chip, ice cream}@ summer            6
          Simple Example (cont’d)
   Market Basket Analysis from Transaction Data
     Then we can use some “rules”/patterns to express the

           {pen} => {ink} (confidence = 95%)
           {milk} => {juice} (confidence = 85%)
                 {juice} => {milk} (confidence = ? )
    {date is in summer} => {chip} => {ice cream}
                  (confidence = 70%)

          What Motivated Data Mining
   Data explosion problem

    – The availability of huge amounts of data in industry,

      business, and research labs because of automated data
      collection and database technology.

    – The need for tuning such data into useful information
      and knowledge (data with a degree of certainty or
      community) to help data analysis.

                  Bio-Medical Data Mining
       DNA and Protein Sequence Analysis
    –     Sequence comparison, similarity search and pattern finding
       Genome Analysis
    –     Prediction of Gene Structure, Clusters of Zebrafish Genes, Mapping
          between Phenotype and Genotype data

       Pathway Analysis
    –     Build, model and visualize the biological social networks formed   by
          processes in a cell and among gene products

       Microarray Analysis
    –     Monitor genome-wide patterns of gene expression

       Pharmacy and Medical Data Mining
    –     E.g. Data Mining Clinical Data to Detect Adverse Drug Events
            Data Mining: A KDD Process

   – Data mining — core of                Pattern Evaluation
      knowledge discovery
                                   Data Mining

                    Task-relevant Data

      Data Warehouse         Selection

Data Cleaning

          Data Integration

        Databases                                              10
         Data Integration and Exchange
   Integrate data from distributed resources to a merged (mediated)
    ontologies or schemas.
                             Data in M_A_B

                       Data in OA      Data in OB
   Exchange/Translate data from one ontology (schema) to another

                       Data in OA     Data in OB

   There are some commercial Enterprise Information Integration
    systems but not good at semantic heterogeneity [Halevy etal 05]
       Approaches for Data Integration

   LAV: local-as-view
    – The mappings define the concepts of local databases/schemas as
      views based on the concepts of the global (target) schema.
   GAV: global-as-view
    – The mappings define the concepts of the global schema as the
      views based on the concepts of local database/schemas.
   GLAV: global-local-as view
    – The mappings combine both GAV and LAV.

                       Data Fusion

   Use of techniques that combine data from multiple
    sources and gather that information in order to achieve
    more accurate conclusion than if they were achieved by
    means of a single source.

   Data fusion vs. data integration
    – The data resources for integration must be heterogeneous but in
      the same domain (semantic model). The data resources for
      fusion can be homogeneous or in different (disjoint) domain.

                      Data Sharing

   It is a vague concept from computer science point of

   However, it is natural to biomedical domain. Different
    labs want to share their data in some common framework
    or platform in order to achieve better scientific discovery,
    which may involve data fusion, integration and mining.

   Narrow definition: Bioinformatics is the application of
    information technology to the field of molecular biology.

   The creation and advancement of databases, algorithms,
    computational and statistical techniques, and theory to
    solve formal and practical problems arising from the
    management and analysis of biological data.

   Common activities in Bioinformatics include mapping
    and analyzing DNA and protein sequences, aligning
    different DNA and protein sequences to compare them
    and creating and viewing 3-D models of protein
   Neuroinformatics is a research field that encompasses
    the development of neuroscience data and application of
    computational models and analytical tools.

   Neuroinformatics stands at the intersection of
    neuroscience and information science. Other sciences,
    like genomics, have proved the effectiveness of data
    sharing through databases and by applying theoretical and
    computational models for solving complex problems in
    the field (Data sharing in Bioinformatics).

   The word Biomedical informatics is more general.
                    Current WWW
   The majority of data resources in WWW are in human readable
    format only (e.g. HTML).

               The Semantic Web
 One major goal of the Semantic Web is that web-based agents can
  process and “understand” data[Berners-Lee etal01].
 Ontologies formally describe the semantics of data and web-based
  agents can take web documents (e.g. in RDF, OWL) as a set of
  assertions and draw inferences from them.


                                         SW                18

To top