Docstoc

CS4200-2010-DD-Grp-21-SeMap

Document Sample
CS4200-2010-DD-Grp-21-SeMap Powered By Docstoc
					    University of Moratuwa
    Computer Science & Engineering




    SeMap – Mapping Dependency
    Relationships into Semantic
    Frame Relationships
    Software Design & Architecture Document
    Version 1.0




Internal Supervisor:            070085B   N. H. N. D. de Silva
        Dr. Shehan Perera       070125B   C. S. N. J. Fernando
External Supervisor:            070298F   M. K. D. T. Maldeniya
        Dr. Ben Goertzel        070548A   D. N. C. Wijeratne
SeMap                                                  Version:     1.0

Software Design & architecture Document                Date:        03/Dec/2010




Revision History

        Date         Version                  Description                   Author

 03 / 12 / 2010         1.0      Initial Design & Architecture Document     SeMap




                               ©University of Moratuwa, 2010                  i|Page
SeMap                                                                                             Version:                   1.0

Software Design & architecture Document                                                           Date:                      03/Dec/2010




Table of Contents

Revision History ............................................................................................................................................. i
1. Introduction .............................................................................................................................................. 1
   1.1 Purpose ............................................................................................................................................... 1
   1.2 Document Conventions ...................................................................................................................... 1
   1.3 Intended Audience .............................................................................................................................. 2
   1.4 Scope ................................................................................................................................................... 2
   1.5 Definitions, Acronyms, and Abbreviations.......................................................................................... 3
   1.6 References .......................................................................................................................................... 4
   1.7 Overview ............................................................................................................................................. 5
2. System Overview....................................................................................................................................... 6
   2.1 Overview ............................................................................................................................................. 6
   2.2 OpenCog Framework .......................................................................................................................... 6
   2.3 RelEx .................................................................................................................................................... 8
   2.4 RelEx2Frame ....................................................................................................................................... 9
   Functionality ............................................................................................................................................. 9
   2.5 SeMap ............................................................................................................................................... 11
3. Design considerations ............................................................................................................................. 15
   3.1 Operating Environment .................................................................................................................... 15
   3.2 End-User Environment ...................................................................................................................... 15
   3.3 Performance Requirements .............................................................................................................. 16
4. Architectural Strategies .......................................................................................................................... 17
   4.1 Overview ........................................................................................................................................... 17
   4.2 Design Decisions ............................................................................................................................... 17
5. Architectural Overview ........................................................................................................................... 19
   5.1 Overview ........................................................................................................................................... 19
   5.2 RelEx2Frame Architecture ................................................................................................................ 19


                                                      ©University of Moratuwa, 2010                                                               ii | P a g e
SeMap                                                                                            Version:                   1.0

Software Design & architecture Document                                                          Date:                      03/Dec/2010

       5.2.1 Rationale .................................................................................................................................... 19
       5.2.2 Object (Static) Model Analysis ................................................................................................... 21
       5.2.3 Dynamic Model Analysis ............................................................................................................ 23
   5.3 Learning Agent .................................................................................................................................. 30
       5.3.1 Rationale .................................................................................................................................... 30
       5.3.2 Object (Static) Model Analysis ................................................................................................... 31
       5.3.4 Dynamic Model Analysis ............................................................................................................ 32
   5.4 Data Mining Architecture.................................................................................................................. 34
6. Policies and Conventions ........................................................................................................................ 37
   6.1 License............................................................................................................................................... 37
   6.2 Discussions ........................................................................................................................................ 37
   6.3 Coding Conventions .......................................................................................................................... 37
Appendix A .................................................................................................................................................... v
   Current Class structure for the proposed RelEx2Frame Architecture ...................................................... v




                                                      ©University of Moratuwa, 2010                                                            iii | P a g e
SeMap                                                                                    Version:                1.0

Software Design & architecture Document                                                  Date:                   03/Dec/2010


Table of Figures

Figure 1: OpenCog Framework ..................................................................................................................... 7
Figure 2: RelEx Semantic Dependency Relations ........................................................................................ 11
Figure 3: Modified RelEx2Frame Architecture ............................................................................................ 20
Figure 4: Sequence diagram of RelEx2Frame ............................................................................................. 23
Figure 5: Activity diagram for sentence processing .................................................................................... 24
Figure 6: Activity diagram for request handling ......................................................................................... 25
Figure 7: Activity diagram for scheduling ................................................................................................... 26
Figure 8: Activity diagram for indexing ....................................................................................................... 27
Figure 9: Activity diagram for index querying ............................................................................................. 28
Figure 10: Activity diagram for knowledge base loading ............................................................................ 29
Figure 11: Block diagram for learning agent ............................................................................................... 30
Figure 12: Activity diagram for learning agent ........................................................................................... 32
Figure 13: Flow diagram for learning agent ................................................................................................ 33
Figure 14: Data Mining Architecture........................................................................................................... 34




                                                  ©University of Moratuwa, 2010                                                   iv | P a g e
SeMap                                                    Version:        1.0

Software Design & architecture Document                  Date:           03/Dec/2010



1. Introduction

1.1 Purpose

    Purpose of this document is to identify the design requirements of research project
    “SeMap”, and to provide a clear view on the expected architecture and design
    constraints of the project to the intended audience.

    The objective of SeMap is to develop an improved framework for mapping semantic
    relationships drawn from English sentences to sets of semantic frames. The project
    consists of two primary phases and an additional phase which is expected to be carried
    once the primary phases are completed.

     This document provides a high level architecture of the project and also the expected
    architecture to follow in terms of the statistical natural language learning agent. Details
    of the learning agent architecture have been provided to current understanding of the
    statistical learning concepts. In addition, the existing architecture of the OpenCog
    framework and the RelEx architecture are provided for better understanding of the
    entire project.

    This document is to be used for the purpose of understanding the basic functionality of
    SeMap to facilitate extendibility or change and to guide the implementation of the
    system for the developers.



1.2 Document Conventions

    The following document conventions have been used to ensure ease of readability.

           Major Headings                16pt, Cambria, Dark Blue, Bold
           Sub Headings                  14pt, Cambria, Blue, Bold
           2nd Sub Level Headings        13pt, Cambria, Blue, Bold
           Other Headings                13pt, Cambria, Black, Underline
           All Body Text                 12pt, Cambria, Black




                               ©University of Moratuwa, 2010                         1|Page
SeMap                                                     Version:    1.0

Software Design & architecture Document                   Date:       03/Dec/2010


1.3 Intended Audience

    This document is intended for the following audience.

           Project supervisors
           Course coordinator and support staff
           Project designers and development team
           Researchers in the field of general Purpose AI development and machine
            learning



1.4 Scope

    This document contains a detailed portrayal for each and every important aspect
    which is related to the project SeMap. It describes the design considerations that have
    been taken in to consideration when developing the architectures both for the entire
    system as well as for the learning agent. The overall system architecture, the learning
    agent architecture and the data mining architecture are provided such that ultimate
    output would meet all the intended requirements.

    In Addition class diagrams, activity diagrams and sequence diagrams are provided for
    the project SeMap and existing architectures of OpenCog and RelEx are provided for
    the intended audience to gain a clear understanding about the overall design of the
    project and how SeMap integrates in the RelEx at the end.




                                  ©University of Moratuwa, 2010                  2|Page
SeMap                                                  Version:     1.0

Software Design & architecture Document                Date:        03/Dec/2010


1.5 Definitions, Acronyms, and Abbreviations

           SeMap            Semantic Mapping
           OpenCog          Open Cognitive
                             Relationship Extractor – A system developed under OpenCog
           RelEx
                             Project
           AI               Artificial Intelligence
                             Relationship Extractor to Frame – A system developed to map
           RelEx2Frame      dependency relationships to frame relationships (Expected system
                             to be replaced with SeMap)
                             Natural Language Generator – A system developed under OpenCog
           NLGen
                             Project
           I/O              Input and Output
           FIFO             First In First Out
           IRC              Internet Relay Chat
           FrameNet         Output of RelEx2Frame




                               ©University of Moratuwa, 2010                   3|Page
SeMap                                                    Version:     1.0

Software Design & architecture Document                  Date:        03/Dec/2010


1.6 References

   [1]    (2010, Aug.). “The Open Cognition Project,” [Online article]. Available:
          http://wiki.opencog.org/w/The_Open_Cognition_Project

   [2]    (2010, May). “RelEx Dependency Relationship Extractor,” [Online article].
          Available : http://wiki.opencog.org/w/RelEx

   [3]    “The Stanford Parser: A Statistical Parser,” [Online article]. Available:
          http://nlp.stanford.edu/software/lex-parser.shtml

   [4]    (2009, Jun.). “RelEx2Frame,” [Online article]. Available:
          http://wiki.opencog.org/w/RelEx2Frame

   [5]    “Open Source Business Rules Management System,” [Online]. Available:
          http://openrules.com/index.htm
   [6]    “Drools Expert,” [Online]. Available: http://www.jboss.org/drools/drools-
          expert.html
   [7]    M. Proctor. (2007, July). “Drools Success Stories - quotes from the mailing
          list,” [Online article]. Available: http://blog.athico.com/2007/07/drools-
          success-stories-quotes-from.html
   [8]    “Google sets labs,” [Online]. Available: http://labs.google.com/sets
   [9]    (2010, Oct.). “Rete Algorithm” [Online article].Available:
          http://en.wikipedia.org/wiki/Rete_algorithm
   [10]   Ian Sommerville. Software Engineering. International Computer Sciences
          Series. Addison-Wesley, Harlow, UK, 8th edition, 2006.
   [11]   K. Mhashilkar. “Data Mining Technology,” [Online]. Available:
          http://www.executionmih.com/data-mining/technology-architecture-
          application-frontend.php
   [12]   J. Grzymala-Busse, “Three strategies to rule induction from data with
          numerical attributes,” presented at the International Workshop on Rough
          Sets in Knowledge Discovery (RSKD 2003), associated with the European
          Joint Conferences on Theory and Practice of Software 2003, Warsaw,
          Poland, April 5–13, 2003.
   [13]   “Rule Learner,” [Online]. Available: http://openrules.com/RuleLearner.htm




                               ©University of Moratuwa, 2010                     4|Page
SeMap                                                   Version:       1.0

Software Design & architecture Document                 Date:          03/Dec/2010


1.7 Overview

    This software design & architecture document is structured in the following manner.

           System Overview – A background overview of the system, both existing
            architectures of OpenCog and RelEx and High level architecture of SeMap

           Design Considerations – Provides information on operating. End user and
            performance requirements taken in to an account in designing SeMap.

           Architectural Strategies – The design decisions taken in to consideration in
            development and implementation of SeMap are outlined under this section.

           Architectural Overview – The architectural designs of the core sections of SeMap
            are provided through block, activity, sequence and flow diagrams.

           Policies and Conventions – Policies and conventions that need to be adhered in
            implementation of SeMap are identified under this section.




                                ©University of Moratuwa, 2010                     5|Page
SeMap                                                    Version:       1.0

Software Design & architecture Document                  Date:          03/Dec/2010


2. System Overview
2.1 Overview

    Project SeMap is developed as a part of the entire OpenCog AI framework. It is
    expected to be integrated in to the RelEx system which is a core part of the OpenCog
    Framework.

    The chapter outlines the existing design and architectures of OpenCog framework,
    RelEx, RelEx2Frame systems and the proposed high level architecture of SeMap.



2.2 OpenCog Framework

    OpenCog is an open source project on development of an artificial general intelligence
    framework where most of the development work is carried out by the nonprofit
    organization OpenCog Foundation. It is ambition is to develop an artificial intelligence
    on the same level as the human intelligence in the future to be used in satisfying
    growing needs of the world [1].

    OpenCog is separated in to two major parts as the core framework and other separate
    project which are developed on top of the core framework.


           OpenCog Framework – This provides an operating system (OS) like infrastructure
            and APIs for the development of separate components. The framework is
            implemented in C++ with the following libraries. (Figure 1)


                  AtomSpace: A shared library for in-memory knowledge representation
                   with data structures.

                  CogServer: A container and scheduler for plug-in cognitive algorithms




                                 ©University of Moratuwa, 2010                     6|Page
SeMap                                                  Version:   1.0

Software Design & architecture Document                Date:      03/Dec/2010




                               Figure 1 - OpenCog Framework



           Separate projects – These projects are associated with OpenCog and
            communicate with the framework. RelEx, Link grammar, CogBot etc.




                               ©University of Moratuwa, 2010               7|Page
SeMap                                                  Version:      1.0

Software Design & architecture Document                Date:         03/Dec/2010


2.3 RelEx

    RelEx, which is a plug-in, developed for the OpenCog framework is an English-language
    semantic dependency relationship extractor, built on the Carnegie-Mellon Link
    Grammar parser [2]. Subject, object, indirect object and many other syntactic
    dependency relationships between words in a sentence can be identified by RelEx.
    This is accomplished by applying a sequence of rules, based on the local context, and
    thus resembles constraint grammar in its implementation. Unlike other dependency
    parsers, RelEx attempts a greater degree of semantic normalization: for questions,
    comparatives, entities, and for prepositional relationships, whereas other parsers
    (such as the Stanford parser [3]) stick to a literal presentation of the syntactic
    structure of text. This is one factor which makes RelEx well-suited for question-
    answering and semantic comprehension/reasoning systems.

    Compared to Stanford parser RelEx parses text four times faster and in addition it
    provides a compatibility mode wherein it can generate the same relations as the
    Stanford parser.


    Semantic frame outputs

    A higher level abstraction and semantically more traceable description of the parsed
    sentence is provided by semantic framing. This allows inferring certain facts from the
    information in the sentence itself with relatively small no of semantic framing rules
    rather than having a large common base database.


    Framing Rules

    The rules are implemented as “IF..THEN” rules with simple forward chaining evaluator.
    The rules are hard coded and are expected to be replaced by an open source rule
    engine.


    Relex is being used for projects NLGen and NLGen2.




                               ©University of Moratuwa, 2010                    8|Page
SeMap                                                     Version:     1.0

Software Design & architecture Document                   Date:        03/Dec/2010


2.4 RelEx2Frame

    This is the primary system on which the project SeMap is carried out. The objective of
    SeMap is to replace current architecture of RelEx2Frame with new learning agent
    architecture to facilitate learning and extendibility [4].


    Functionality

    The basic function of RelEx2Frame is to map output of RelEx into set of relationships
    using hand coded set of rules (>5000 rules). It currently uses two semantic resources
    in the process but has the facility to extend it to more resources in the future.


         i. FrameNet
        ii. Custom relationship names from Novamente

    The reason for use of two semantic resources is that certain relationships are available
    in only one resource.

    E.g.:
                            ^2_inheritance($var1,$var2)


     Inheritance relationship above is only part of Novamente's ontology, not in FrameNet.


    Input & Output

    Input to RelEx2Frame is the output of RelEx on a given sentence. Thus accuracy of the
    system heavily depends on the accuracy of RelEx.

    E.g.: Consider the sentence

                  Put the ball on the table

    The input for RelEx2Frame (Output of RelEx) would be,

                                    imperative(Put) [1]
                                    _obj(Put, ball) [1]
                                    on(Put, table) [1]
                                    singular(ball) [1]
                                    singular(table) [1]


                                  ©University of Moratuwa, 2010                   9|Page
SeMap                                                  Version:     1.0

Software Design & architecture Document                Date:        03/Dec/2010

    In order to identify the semantic relationships of the above input the following hand
    coded FrameNet mapping rules would be used.


$var0 = ball
$var1 = table
# IF imperative(put) THEN ^1_Placing:Agent(put,you)
# IF _obj(put,$var0) THEN ^1_Placing:Theme(put,$var0)
# IF on(put,$var1) & _obj(put,$var0) THEN ^1_Placing:Goal(put,$var1) \
^1_Locative_relation:Figure($var0) ^1_Locative_relation:Ground($var1)




    The following output is generated through FrameNet mapping.


                     ^1_Placing:Agent(put,you)
                     ^1_Placing:Theme(put,ball)
                     ^1_Placing:Goal(put,table)
                     ^1_Locative_relation:Figure(put,ball)
                     ^1_Locative_relation:Ground(put,table)




                               ©University of Moratuwa, 2010                  10 | P a g e
SeMap                                                      Version:        1.0

Software Design & architecture Document                    Date:           03/Dec/2010


2.5 SeMap


     RelEx Semantic                                        RelEx2Frame Rule
  Dependency Relations                                           Base




                            Core Framework for                              Statistical Linguistics
                             Mapping Semantic                                Based Learning AI
                          Relationships to semantic
                                   Nodes



     Human Knowledge
    Concept Data Mining
        Component



                                                    OpenCog Artificial
                                                   General Intelligence
      “Common Sense”                                  Framework:
      Knowledge Base                                  Frame2Atom



                          Figure 2 - RelEx Semantic Dependency Relations




The Semantic Dependency Relation Extractor RelEx which has been developed as a module
for the OpenCog Framework provides the semantic relations of a given English sentence in
a standard format that is compatible with Link Grammar output.




                                 ©University of Moratuwa, 2010                        11 | P a g e
SeMap                                                  Version:   1.0

Software Design & architecture Document                Date:      03/Dec/2010

E.g.:

Sample Sentence:

Alice looked at the cover of Shonen Jump.


RelEx Output:


                        at(look, cover)

                        _subj(look, Alice )

                        tense(look, past)

                        of(cover, Jump)

                        DEFINITE-FLAG(cover, T)

                        noun_number(cover, singular)

                        _amod(Shonen, Jump)

                        DEFINITE-FLAG(Shonen, T)

                        noun_number(Shonen, singular)

                        DEFINITE-FLAG(Jump, T)

                        noun_number(Jump, singular)

                        DEFINITE-FLAG(Alice , T)

                        gender(Alice , feminine)

                        noun_number(Alice , singular)

                        person-FLAG(Alice , T)




                               ©University of Moratuwa, 2010              12 | P a g e
      SeMap                                                  Version:      1.0

      Software Design & architecture Document                Date:         03/Dec/2010


      RelEx2Frame Rule base

      The rule base for mapping the Semantic Relations to semantic nodes will comprise of rules
      compatible with the Drools Rule Engine which is the rule engine selected for the project.
      The following is a sample mapping rule in the format expected to be implemented.



rule "1"
     when
               p: Processor( eval(p.existence("_predobj(be,$atLocation)")) )
       then
               eval(p.AppendRule(" ^1_Existence:Place($atLocation,$atLocation)"));
end




      The rule verifies the presence of a combination of semantic relations in a given RelEx
      output and maps it to the corresponding semantic node



      Core Mapping Framework

      The mapping framework will receive the RelEx output from a given body of text and match
      these against the rule base and deliver the relevant semantic nodes. The primary
      component of the framework will be the Rule Engine which would be responsible for
      “firing” the rules in the rule base against the RelEx output. The process of verifying the
      conditions (semantic relation combinations) has been divided in to a number of
      independent operations based on the types of semantic relations produced by RelEx as
      output and will also include support for extracting and matching concept variables in the
      RelEx output to values in the concept variable value store in RelEx.




                                     ©University of Moratuwa, 2010                   13 | P a g e
SeMap                                                  Version:       1.0

Software Design & architecture Document                Date:          03/Dec/2010


Statistical Linguistics Based Learning AI

This artificial intelligence component would be responsible for analyzing the RelEx output
in the context of the existing rule base and using statistical linguistic based learning to
extend the rule base as well as the concept variable store. This is an experimental
component which forms the bulk of the research to be carried out in the project and a
suitable architecture is currently being researched.



Human Knowledge Concept Data Mining Component

This Data mining module will receive the semantic nodes from the core framework as input
and use these to extract probabilistic relationships among human knowledge concepts for
the purpose of automatically developing a “common sense knowledge base” like Cyc. The
module would be developed as optional since this is a specific extended function which can
be ignored in the normal use of the RelEx framework




                               ©University of Moratuwa, 2010                    14 | P a g e
SeMap                                                    Version:      1.0

Software Design & architecture Document                  Date:         03/Dec/2010


3. Design considerations

3.1 Operating Environment

        "SeMap" is developed in Linux based operating systems such as Mint and Ubuntu,
         using Java programming language.

        RelEx is the main development environment of "SeMap", since RelEx2Frame is a
         component of RelEx, which is a dependency relationship extractor. Thus all the
         dependencies that are needed to use RelEx are dependencies of “SeMap” other
         than its own dependencies.

        RelEx is a narrow artificial intelligence component of the OpenCog artificial
         intelligence framework, thus the foundation of the development is OpenCog.

        "SeMap" can be operated in any platform which RelEx can operate, such as
         Windows or Linux based operating systems. Users of “SeMap” should set up their
         environment with the relevant dependencies of “SeMap”, such as Drools’ binaries.



3.2 End-User Environment

        End-user should use RelEx in order to use "SeMap". All the dependencies of RelEx
         should be installed by the users of “SeMap”. Other than that “SeMap” is dependent
         on few other external binaries, especially in Drools. Following Drools binaries are
         used by “SeMap”, and end-user environment should be setup to point those
         binaries by adding “DROOLS_HOME” environment variable.

                drools-core-5.1.1.jar

                drools-compiler-5.1.1.jar

                drools-api-5.1.1.jar

                lib/antlr-runtime-3.1.3.jar

                lib/ecj-3.5.1.jar

                lib/mvel2-2.0.16.jar

                lib/xstream-1.3.1.jar

                                 ©University of Moratuwa, 2010                   15 | P a g e
SeMap                                                    Version:       1.0

Software Design & architecture Document                  Date:          03/Dec/2010

        End-user may use Linux or Windows based operating systems. Tests may not
         carry out in the Windows environment, therefore it is preferred that end-users use
         Linux based operating systems.



3.3 Performance Requirements

        Since the system is expected to be used in applications like chat bots, text
         critiquing, and information retrieval etc. execution time is very important. The
         existing Relex2Frame component shows an average of 500ms latency, for
         approximately 20 RelEx outputs. Since the Project objective of SeMap is to replace
         Relex2Frame, it is required to perform fairly closely or beyond the existing level of
         performance.

        After applying statistical learning to RelEx2Frame, it is expected to recognize more
         RelEx outputs than current RelEx2Frame and provide more accurate frame
         outputs.

        Commonsense Knowledge base developed using data mining techniques should
         contain reasonable set of rules, with coherent probabilistic weightings, with an
         acceptable error rate.




                               ©University of Moratuwa, 2010                      16 | P a g e
SeMap                                                  Version:       1.0

Software Design & architecture Document                Date:          03/Dec/2010


4. Architectural Strategies

4.1 Overview

    This section describes the design decisions and strategies that affect the overall
    architecture of the system. It also focuses on the alternative technologies present and
    reasons employed for choosing the selected technology.



4.2 Design Decisions

    Programming Language Selection

    RelEx2Frame is a component of RelEx. Thus the programming language of RelEx
    should be the programming language of RelEx2Frame. RelEx is programmed using Java
    programming language. Therefore RelEx2Frame should be programmed in Java.



    Rule Engine Selection

    Hand coded rules which are mapping RelEx relations with FrameNet and Novamante
    frames are replaced with a standard rule engine as a part of the project. Thus the
    selection of a rule engine is a major design decision.

    Since this is an open source project we should select an open source rule engine, also
    since the project programmed in Java programming language, the selected rule engine
    should be accessible using Java.

    Among few existing rule engines we have considered OpenRules [5] and Drools [6]
    Expert which are most widely used and proven to be well-performed. Another
    constraint we had was the convertibility of the existing hand coded rules to the format
    that the rule engine expected. “When..Then” type rule format is an accepted way of
    implementing rules in Drools Expert rule engine, but OpenRules accepts spreadsheet
    formatted rules [5]. Since the hand coded rules are in “IF..THEN” format they are more
    easier to convert to “When..Then” format than spreadsheet type rule format. That was
    the main inspiration of selecting Drools Expert over OpenRules.




                               ©University of Moratuwa, 2010                    17 | P a g e
SeMap                                                   Version:       1.0

Software Design & architecture Document                 Date:          03/Dec/2010

    After Drools 5 onwards they are supporting backward chaining, which was an added
    value. Success stories [7], highly active community of developers, comprehensive
    documentation strengthened our decision of selecting Drools Expert rule engine.



    Caching Knowledge Bases

    Since the knowledge base which consists of more than five thousand rules, is huge,
    caching the knowledge base is a must. There can be many approaches of caching a
    Drools knowledge base such as caching the whole knowledge base as a single file or
    splitting the rules into multiple knowledge bases and caching all of them etc. Tests
    were carried out for the above two options and the latter option proven to be better.



    Statistical Learning of concepts

    In the current RelEx2Frame there is a significant limitation of the concepts or the
    words that are detected. Statistical learning methods can be used to reduce this
    limitation. A decision will be taken after analyzing the appropriateness of new words
    detected for concepts generated, by using an existing application and by using an
    implemented statistical learner. Google Sets [8] is one of the existing applications that
    we would consider.



    Selecting the best suited data mining algorithm

    The selected data mining algorithm should be capable of inducting new rules with
    coherent probabilistic weightings. RIPPERk, C4.5 and MLEM2 will be analyzed and the
    most appropriate algorithm will be chosen.




                               ©University of Moratuwa, 2010                      18 | P a g e
SeMap                                                    Version:       1.0

Software Design & architecture Document                  Date:          03/Dec/2010


5. Architectural Overview

5.1 Overview

    The core architecture of the project SeMap can be divided into three core sections as,

            Modified RelEx2Frame architecture

            Statistical learning agent architecture

            Data mining architecture
    The following provides comprehensive details for the proposed designs of the above
    sections with activity, sequence, class and block diagrams.



5.2 RelEx2Frame Architecture

5.2.1 Rationale
        The proposed architecture is the result of evolution of the basic architecture
        considered in the software requirements specification based on a number of
        prototypes used to analyze performance based on architecture. The results from the
        prototypes provided conclusive evidence that the incorporation of a standard rule
        engine using the Rete's Algorithm [9] with over 5000 rules in the rule base results in
        significant degradation of performance. In order to achieve performance comparable
        to the framework with the native rule engine it's necessary to incorporate
        concurrency in to its operation as well as to use techniques such as indexing,
        buffering and batch processing. The proposed architecture either includes or
        supports a number of concurrency based concepts and optimization techniques
        referred above.




                                 ©University of Moratuwa, 2010                     19 | P a g e
SeMap                                                    Version:       1.0

Software Design & architecture Document                  Date:          03/Dec/2010




                         Figure 3 – Modified RelEx2Frame Architecture




                               ©University of Moratuwa, 2010                    20 | P a g e
SeMap                                                     Version:        1.0

Software Design & architecture Document                   Date:           03/Dec/2010

     Where,

              ki     Knowledge Base File (i= 1…5)
              ri     Relation (i=1…..n)
              ei     Evaluator (i=1…..n)
              si     Sentence(i=1…..n)


5.2.2 Object (Static) Model Analysis

        Sentence

        Sentence represents an execution unit of the Relex2Frame framework in that it
        represents a single input sentence to the RelEx Framework and retains the RelEx
        output related to the sentence as a collection and is fed in to the RelEx2Frame
        framework in blocks. Sentence plays a key role in the proposed architecture in that
        it’s designed as an autonomous unit that “processes” it with necessary services
        requested from management objects. It is responsible for generating a knowledge
        base claim list for execution against it using the Condition Index, requesting and
        acquiring knowledge bases from the Execution Manager as well as retaining the list
        of semantic nodes that fit the represented sentence.


        Evaluator

        The evaluator is responsible for comparing the RelEx relations for a sentence with
        the relations or relation families required to be present in the rules for satisfaction.
        The evaluator categorizes the space of rules in RelEx2Frame in to four primary
        categories and evaluates the presence of unique relationships using an index of
        concept variables and a working memory that refreshes per rule to hold temporary
        variables.


        Knowledge Base Manager

        The Knowledge Base Manager is a “wrapper” for the standard Knowledge Base object
        of the Drools Rule Engine designed to serialize the object on creation (if the
        serialized version doesn't exist) in order to minimize the time taken to load the
        object. Currently each Knowledge Base Manager represents a knowledge base with a
        hundred mapping rules though this may change based on future performance testing.


                                 ©University of Moratuwa, 2010                      21 | P a g e
SeMap                                                    Version:       1.0

Software Design & architecture Document                  Date:          03/Dec/2010


        Condition Index

        As previously mentioned initial performance testing with the integrated Drools Rule
        Engine indicated a significant degradation of performance. It is proposed that if the
        sentences were compared against the knowledge bases that will have at least one
        rule that would be satisfied, the number of times each knowledge base would need to
        be loaded (this has been identified as an expensive operation) could be reduced
        improving performance. The Condition Index is an index of all relationships or
        relationship families in the space of mapping rules where the values pointed to by a
        particular key (relationship) are pointers to the knowledge bases that included the
        rules which contained that key. Additional functionality including providing a
        disjunction or conjunction of multiple keys to acquire only the matching knowledge
        base pointers is considered since most mapping rules contain more than one


        Execution Manager

        The Execution Manager represents the (limited) centralized control mechanism of
        the proposed architecture. It’s responsible for scheduling the loading of knowledge
        bases in to the working memory of the framework as well as handling the requests
        from concurrently processing Sentences for knowledge bases. The two functions are
        closely related in that the scheduling of the knowledge bases would be significantly
        influenced by the order of the requests.




                                ©University of Moratuwa, 2010                     22 | P a g e
SeMap                                                     Version:       1.0

Software Design & architecture Document                   Date:          03/Dec/2010


5.2.3 Dynamic Model Analysis


        Core Activities




                            Figure 4 – Sequence diagram of RelEx2Frame


        The run time operation of the framework of RelEx2Frame framework will involve
        multiple Sentences processing themselves by requesting the relevant Knowledge
        Bases. The sequence of operations involved in the processing of a single Sentence is
        indicated in Figure 4.




                                ©University of Moratuwa, 2010                    23 | P a g e
SeMap                                                       Version:        1.0

Software Design & architecture Document                     Date:           03/Dec/2010

        The processing activity from the perspective of a Sentence object may be visualized
        as follows.




                          Figure 5 – Activity diagram for sentence processing




                                ©University of Moratuwa, 2010                       24 | P a g e
SeMap                                                      Version:         1.0

Software Design & architecture Document                    Date:            03/Dec/2010

        The Execution Manager has a key role in the architecture as a control entity that is
        responsible for handling Sentence requests for Knowledge Bases as well as
        scheduling Knowledge Base arrivals in to the buffer. For each Sentence block that is
        transferred to RelEx2Frame the Execution Manager initializes the buffer with a set of
        knowledge bases that will be most requested by the Sentences in the block thus
        reducing the volume of I/O activity. The request handling activity of the Execution
        Manager may be visualized as show in Figure 6




                              Figure 6 – Activity diagram for request handling



        The Execution Manager employs two logs to maintain deferred requests where one
        the primary log relates to deferred requests related to Knowledge Bases present in
        the buffer at the time of the request while the secondary log relates to deferred
        requests for Knowledge Bases not available in the buffer at the time of the request.
        These logs essentially serve as FIFO queues which allow the Execution Manager to

                                ©University of Moratuwa, 2010                       25 | P a g e
SeMap                                                      Version:    1.0

Software Design & architecture Document                    Date:       03/Dec/2010

        honor deferred requests in the event a particular Knowledge Base is released by a
        Sentence.




                          Figure 7 – Activity diagram for scheduling


                               ©University of Moratuwa, 2010                   26 | P a g e
SeMap                                                        Version:      1.0

Software Design & architecture Document                      Date:         03/Dec/2010


        Supporting Activities

        The core process of the framework is supported by a number of secondary activities
        including relationship indexing and querying service as well as the Knowledge Base
        loading and serialization mechanism.

        The indexing service carries out indexing and supports queries based on the
        processes demonstrated in the Figures 4 & 6.




                                Figure 8 – Activity diagram for indexing




                                 ©University of Moratuwa, 2010                     27 | P a g e
SeMap                                                       Version:      1.0

Software Design & architecture Document                     Date:         03/Dec/2010




                         Figure 9 – Activity diagram for index querying



        Once a Sentence block is transferred to RelEx2Frame each sentence will generate its
        Knowledge Base Request list by querying the index service based on its relationship
        collection. These request lists will be used by the Execution Manager in initializing
        the Knowledge Base buffer prior to Sentence processing.

        The Knowledge Base loading and serialization mechanism has been designed to
        minimize the I/O operations as well as the performance degradation due Knowledge
        Base objects being created during run time. This mechanism can be optionally
        executed independently of the framework to generate serialized versions of the
        Knowledge Bases.



                                ©University of Moratuwa, 2010                     28 | P a g e
SeMap                                                     Version:        1.0

Software Design & architecture Document                   Date:           03/Dec/2010




                       Figure 10 – Activity diagram for knowledge base loading




                               ©University of Moratuwa, 2010                      29 | P a g e
SeMap                                                       Version:        1.0

Software Design & architecture Document                     Date:           03/Dec/2010


5.3 Learning Agent

5.3.1 Rationale

        The proposed concept variable algorithm is based on a weighting system which is to
        be applied upon the candidate (temporary) concept variables. This algorithm
        manages the weight of a given candidate concept variable considering the frequency
        of which it would appear in RelEx outputs in the place of the considered concept
        variable class. The reason for adapting this algorithm is the fact that it is simple as
        well as highly extendable. Since the weight system works with float values for
        weights and an float threshold, it is possible to plug in sub-algorithms that would
        look in to issues with smaller granularity than the currently considered frequency of
        occurrence and modify the weights in amounts proportional to the significance of the
        factor that the said sub algorithm would look in to, resulting in a more fine tuned
        over all algorithm.




                             Figure 11 – Block diagram for learning agent




                                 ©University of Moratuwa, 2010                      30 | P a g e
SeMap                                                     Version:        1.0

Software Design & architecture Document                   Date:           03/Dec/2010


5.3.2 Object (Static) Model Analysis

        Input handler

        This is the component that receives the current concept variable and the class that
        the said variable is supposed to fall in to. When there is such input, the input handler
        will pass the vales to the Validater. Otherwise it can choose to halt or terminate the
        algorithm execution.


        Validater

        The validater’s task is to report the current standing of the given candidate concept
        variable to the learning agent. For this task it consults the concept variable data
        source and the temporary concept variable data source.


        Learning Agent

        Based on the validater’s output, the learning agent decides what to do with the
        candidate concept variable. For an example if it is reported that the particular
        candidate concept variable is in neither the concept variable data source nor the
        temporary concept variable data source, it will decide to add it as a new entry to the
        temporary concept variable data source.




                                 ©University of Moratuwa, 2010                      31 | P a g e
SeMap                                                       Version:           1.0

Software Design & architecture Document                     Date:              03/Dec/2010


5.3.4 Dynamic Model Analysis




                             Figure 12 – Activity diagram for learning agent


        As evident from the above activity diagram, the algorithm will add the candidate
        concept variable to the temporary concept variable data source if it is not already in
        any of the data sources. When doing this, the algorithm will initiate the weight of the
        said candidate concept variable to a predefined low value. If the candidate concept
        variable was already present in the temporary concept variable data source, it will

                                 ©University of Moratuwa, 2010                         32 | P a g e
SeMap                                                       Version:      1.0

Software Design & architecture Document                     Date:         03/Dec/2010

        increase the weight by adding a predefined value. Then depending on the new weight
        of the candidate concept variable and the predefined threshold value, it is decided
        whether to update the temporary concept variable data source itself or to promote
        the candidate concept variable to a permanent concept variable by moving it to the
        concept variable data source.

        All the details that are explained in the above two diagrams are collectively depicted
        in the following flow diagram.




                            Figure 13 – Flow diagram for learning agent




                                 ©University of Moratuwa, 2010                     33 | P a g e
SeMap                                                        Version:   1.0

Software Design & architecture Document                      Date:      03/Dec/2010


5.4 Data Mining Architecture



     CORPUS
     OF TEXT

              Parsing the Text
                Using RelEx


        RelEx

               RelEx Outputs

                                                                              Data Mining Results
  RelEx2Frame            Frame Outputs          DATA MANAGER




                                                  DATA MINING

                                                        TOOL

                                 Figure 14 – Data Mining Architecture


    Data Manager

    As the name suggests, this layer manages the data frames that are outputted by
    RelEx2Frame and controls the data flow for data mining purpose. It has following
    functionality [11].

    Manage Frame Sets: The data manager layer will aid in dividing the frames into
    multiple sets so that it can be utilized during various stages of the Data Mining task.


                                   ©University of Moratuwa, 2010                    34 | P a g e
SeMap                                                  Version:       1.0

Software Design & architecture Document                Date:          03/Dec/2010

    Same is the case with results of the Data Mining task, which might be utilized for
    further processing.

    Input Data Flow: Also the data flow needs to be controlled as per the Data Mining task
    requirements i.e. frame by frame or bulk load. The Data Mining task may also require
    data in specific format. A few transformation routines will be necessary to transform
    the data frames into the required format as per the requirement of data mining
    algorithm.

    Output Data Flow: The results generated by the Data Mining task will need to be
    managed and alter according to the needed format.



    Data Mining Tool

    This is the heart of the complete data mining architecture. The Data Mining Tool will
    contain different tasks. The prime functionality of the task will be analysing the data
    and generate the results.

    Data mining algorithm is the core part in this component. An appropriate data mining
    algorithm will be selected. The selected data mining algorithm should be a rule
    induction algorithm. There is few rule induction data mining algorithms currently exist
    such as RIPPERk, C4.5 [12], MLEM2 [13] etc.



    Input to the Data Miner

    Input to the data mining tool component is set of frames which are outputted by
    RelEx2Frame.

    E.g.: 1_Grasp: Cognizer (understand,$var0)




                               ©University of Moratuwa, 2010                    35 | P a g e
SeMap                                                   Version:     1.0

Software Design & architecture Document                 Date:        03/Dec/2010


    Data Mining Results

    Data mining tool would learn rules like



        IF ^1_Mental_property(stupid) & ^1_Mental_property:Protagonist($var0)

        THEN ^1_Grasp:Cognizer(understand,$var0) <.3>

        IF ^1_Mental_property(smart) & ^1_Mental_property:Protagonist($var0)

        THEN ^1_Grasp:Cognizer(understand,$var0) <.8>



      with coherent probabilistic weightings. The learnt rule means that stupid people
    mentally grasp less than smart people do [4].




                               ©University of Moratuwa, 2010                    36 | P a g e
SeMap                                                    Version:       1.0

Software Design & architecture Document                  Date:          03/Dec/2010


6. Policies and Conventions

6.1 License

    RelEx is distributed under Apache License version 2.0. Thus RelEx2Frame,
    Frame2RelEx and all the components developed by us will be distributed under
    Apache License version 2.0.



6.2 Discussions

    It is preferred to discuss the RelEx and existing RelEx2Frame related matters in the
    #opencog IRC channel where most of the developers available in most of the time.

    Project idea was given by Dr. Ben Goertzel, so clarifications about the project idea
    related matters are discussed with him.



6.3 Coding Conventions

    Following standards were given by OpenCog.org, for the developers. Those will be
    adhered in our implementations.

           Use 4 spaces for indentation

           Never use tabs, only spaces (for instance 8 spaces instead of 2 tabs)

           Use 80 columns max

           Use Unix-style end-of-lines (all decent win32 editors support it)

           Use definition-block brackets on a new line and command-block brackets on the
            same line

           Use spaces between expression operators ("i + 1" instead of "i+1")




                                ©University of Moratuwa, 2010                       37 | P a g e
SeMap                                                  Version:   1.0

Software Design & architecture Document                Date:      03/Dec/2010



Appendix A
Current Class structure for the proposed RelEx2Frame Architecture



        




                               ©University of Moratuwa, 2010               v|Page

				
DOCUMENT INFO
Shared By:
Categories:
Tags:
Stats:
views:50
posted:3/28/2011
language:English
pages:43