Spatial Association Rules by mudoc123


									Data Mining Query Languages

             Donato Malerba

     Dipartimento di Informatica
     Università degli studi di Bari
A database perspective on

Most current KDD systems offer isolated
 discovery features using tree inducers,
 neural nets, and rule discovery algorithms
They cannot be embedded into a large
 application and typically offer just one
 knowledge discovery feature
True also for OLAP tools
  This is the first generation of KDD tools

  DMQL – Prof. D. Malerba
Short term research program
 Efficient DM algorithms on top of large
  databases and utilizing the existing DBMS
1. Realization of C4.5 on top of a large database requires
    tighter coupling with the DBMS and intelligent use of
    indexing techniques.
2. Exploitation of caching techniques for association rule
3. Exploitation of special indexing techniques for
See IBM‟s Intelligent Miner
   DMQL – Prof. D. Malerba
Long term research program
 KDD should follow one of the key DBMS
  paradigms: building interpreters for query
  languages and compilers for ad hoc queries
  and embedding queries in application
  programming interfaces (API)
 Focus: increasing programmer productivity for
  KDD application development
  Knowledge and Data Discovery Management Systems
     (KDDMS) are the second generation KDD systems.

  DMQL – Prof. D. Malerba
Imielinski & Mannila’s view
 KDD object
    Rule: probabilistic formula or multidimensional
X.Diagnosis=“heart disease” and X.Age <50  X.BMI > 29 [300, 0.80]
    Classifier: decision trees, neural network,
     multidimensional regression
    Clustering: collection of objects
 KDD query: a predicate which returns a set of
  objects that can either be KDD objects or
  database objects (records or tuples)

   DMQL – Prof. D. Malerba
Imielinski & Mannila’s view
    The KDD objects typically will not exist a priori, thus
     querying the KDD objects requires their generation at
     run time.
    KDD objects may also be pre-generated and stored in
     a “inductive” database, such as metadata.
    In such cases querying can be reduced to retrieval.
    KDDMS should be able to persistently store and
     manage the KDD objects as well as provide the ability
     to query them
    Querying involves
         The generation of new KDD objects
         Retrieval of the ones which were generated before

    DMQL – Prof. D. Malerba
Imielinski & Mannila’s view
 Closure principle: the result of a query is a
  relation that can be queried further.
 A result of a KDD query may be an argument
  of another compatible type of KDD query.
 In principle a KDD query can be nested within
  a regular relational query.
 KDD queries can be embedded in a host
  programming environment just as SQL queries
  can be embedded in host languages.

  DMQL – Prof. D. Malerba
Imielinski & Mannila’s view

    Generate a decision tree on a user-defined training set
     (specified through a database query) with user-
     defined attributes and user-specified classification
     categories. Then find all records in a database wrongly
     classified using that classifier as a training data for
     another classifier.
    Generate all rules with consequent values computed
     by an SQL query (KDD queries may not be completely
     known at a compile time!).
    Find tuples that belong to the largest cluster in a
     clustering constructed according to a user-specified
     distance metrics.
    DMQL – Prof. D. Malerba
Imielinski & Mannila’s view

Research program:
1. A KDD query language has to be formally defined
2. Query optimization tools would be developed to
    compile queries into reasonably efficient execution
Very challenging!
    KDD queries are much more powerful than SQL

   DMQL – Prof. D. Malerba
Imielinski & Mannila’s view

Patient(Age, Sex, City, Diagnosis, Height, Weight,
     ClaimAmount, …)
City(State, Population, …)
X.Diagnosis=“heart disesase” and Sex=“male” 
     X.Age>50 [1200,0.70]
The user wants to see all the rules about a patient with
     heart disease such that the consequent of this rule
     says something about the age of the patient, there are
     at least 1,000 cases which the rule body applies, and
     the confidence of the rule is at least 65%.
   DMQL – Prof. D. Malerba
Imielinski & Mannila’s view
In M-SQL (Imielinski et al., Proc. KDD‟96)
WHERE R.Body={(Diagnosis=“heart disesase”)} AND
    R.Consequent = {(Age=*)}
R.Support > 1000
R.Confidence > 0.65
R renames MINE(T)
MINE(T) is an operator that takes a class T and generates
    all propositional rules about T
         Rule discovery: Another type of querying!
   DMQL – Prof. D. Malerba
Imielinski & Mannila’s view

Rules are not necessarily the final product of KDD
A proper API, which embeds a rule query
    language in a more expressive, general
    purpose, host programming environment is
   Iterate over a collection of rules

  DMQL – Prof. D. Malerba
KDD query languages
Imielinski, Virmani, Abdulghani. Discovery board application
   programming interface and query language for database mining.
   Proc. KDD96
Imielinski and Virmani. MSQL: A query language for database mining.
   Journal of Data Mining and Knowledge Discovery, 3(4), 1999.
Meo, Psaila, and Ceri. A new SQL-like operator for mining association
  rules. Proc. VLDB, 1996.
Han, Fu, Koperski, Wang, and Zaiane. DMQL: A Data Mining Query
  Language for Relational Databases„, Proc. SIGMOD'96 Workshop.
   on Research Issues on Data Mining and Knowledge Discovery
   (DMKD'96), 1996.
Shen, Ong, Mitbander, and Zaniolo. Metaqueries for Data Mining. In:
  Fayyad et al. Advances in Knowledge Discovery and Data Mining,
  AAAI Press, 1996.
    DMQL – Prof. D. Malerba
KDD query languages
Giannotti, Manco. Querying Inductive Databases via Logic-Based User-
   Defined Aggregates. PKDD 1999
De Raedt. An Inductive Logic Programming Query Language for Database
   Mining. AISC 1998
De Raedt. A Logical Database Mining Query Language. ILP 2000
De Raedt. Query execution and optimization for inductive databases. Proc.
   EDBT Workshop on Database Technologies for Data Mining, 2002
Boulicaut, Klemettinen, Mannila. Querying inductive databases: a case
   study on the MINE RULE operator. In: Proceedings of the Second
   European Symposium on Principles of Data Mining and Knowledge
   Discovery PKDD'98, LNAI 1510, 1998
Elfeky, Saad, Fouad. ODMQL: Object Data Mining Query Language. In
   Dittrich et al. (eds), Objects and Databases 2000, LNCS 1944, 2001
Johnson, Lakshmanan, Ng. The 3w model and algebra for unified data
   mining.–Proc. VLDB, 1998
     DMQL Prof. D. Malerba
KDD query languages

Han, Koperski, Stefanovic. GeoMiner: A System Prototype for
  Spatial Data Mining. SIGMOD Conference 1997
Malerba, Appice, Ceci, Vacca. SDMOQL: An OQL-based Data
  Mining Query Language for Map Interpretation. Proc. EDBT
  Workshop on Database Technologies for Data Mining, 2002

    DMQL – Prof. D. Malerba
  DMQL: just some syntactic
sugar on top of DM algorithms?
 A user can formulate a DM task without paying attention
   Logical and physical representation problems
   The correct procedural order in which some DM steps should be
 The development of decision support applications is
  easier, just as SQL make implementation of operational
  information systems easy
 A casual user can find patterns by means of a DMQL in
  the same way he can find data by means of a SQL
  query: no development of ad hoc applications
 A DMQL provides a foundation on which a GUI can be
  DMQL – Prof. D. Malerba
                 Spatial Data Mining

    Spatial Data Mining: the extraction of spatial
     patterns from both spatial and aspatial data,
     possibly stored in a spatial database
    Spatial Pattern: a pattern showing the interaction
     of two or more spatial objects or space-depending
     attributes according to a particular spacing or set
     of arrangements

IF a large town intersects the motorway A14
       THEN it is also close to the Adriatic sea (13%, 90%)

    DMQL – Prof. D. Malerba
Spatial Data Mining & GIS

Geographical Information Systems (GIS) offer an
important application area where spatial data mining
techniques can be effectively used
Example: topographic map interpretation

  DMQL – Prof. D. Malerba
 Interpreting Topographic Maps
 Topographic map: large scale
  (1:10000 to 1:100000) composite
  map showing relief, vegetation and
  man-made features of a portion of
  a land surface.
 Interpreting the colored lines,
  areas, and other symbols is the first
  step in using topographic maps.
 Easy! Symbols correspond univocally to concepts
  explicitly modelled by the map creator.
 Difficult! locating in a map some geographical objects not
  explicitly modelled (e.g., industrial area)
   DMQL – Prof. D. Malerba
 Interpreting Topographic Maps
 Solution: embedding intelligent capabilities in geo-based
 Knowledge-based GIS use
   spatial reasoning capabilities
   available domain knowledge
  to support map interpretation
 But operational definitions of some complex concepts
   are difficult to elicit
   are not portable on different data models
   depend on the scale of the map

     DMQL – Prof. D. Malerba
      Data Mining to Support Map
      Interpretation Tasks

Data Mining tools and techniques to find
 spatial patterns of interest.
INGENS (INductive GEographic iNformation
 System) = GIS + Data Mining Server + …
Training functionality
The user can train the system by providing
 instances of geographical objects to be
 recognized in a map

   DMQL – Prof. D. Malerba
INGENS Architecture

                              GUI (Web Browser)                    The interface
                                                                   Suite tools
                                                                   layer of the for
                                                                    Allows any
                                                                   integration user
               Map Converter       Map Editor
 Application                                         Query         import/export of
                                                                   implements a
                                                                 Responsible fora
                                                                  Ato formulate
                                                                   mapswhich is
                                   Data mining
                                                                   GUI, of data
                                                                 the automated
                                                                    queries in
                                                                   Java applet. of
                                                                  mining systems
                   Map Storage
                                                                 generation of
                                                                  Is the only access
                    Subsystem                                     that can be run
                                                                   Manages logic
                                                                 first-order bydata
                                                                  path to the by
                                            Deductive DBMS
 Manager        ObjectStore DBMS
                                                                 descriptions of to
                                                                  contained in the
                                                                   means users
                                                                  multipleof the
                                                                 some Repository
                                                                  Map INGENS
                                                                  train Converter
                                                                 objects. in
                        Map                      Knowledge
                      Repository                 Repository
                                                                  storing, updating
                                                                  and retrieving
   DMQL – Prof. D. Malerba
The data model for the map

Hybrid tessellation-topological model
Tessellation model: a map is decomposed
 according to a regular grid of cells
Topological model has two structural hierarchies:
  physical (describes the geographical objects by means
   of the most appropriate geometric entity);
  logical (expresses the semantics of geographical

  DMQL – Prof. D. Malerba
    The object-oriented data model
    in UML                                                                                                                                     Lower scale

                                                                                                                                        Map               0..*

                                                                                                        N/NE/NW/S/SE/SW/E/W                                               Gif
                                                                                                                                               Grid              1
                                                                                                                    0..1         0..1
                                                                                                                                          1..*        1
                                                                                                Logical structure                                                                                               1..*
                                                                                                                             1                                                   Physical structure
                                                             Logical Object 1..*                                                        Cell          1                                                                          Physical Object




                                                                                                                                                                                                                       Point                 Line             Region 1..*

                                                                                                                                                                                                                   1..*        1..*   1..*      1      0..1       0..*

   Hydrography           Orography        Land Adm inistration         Vegetation             Adm inistrative Boundary Ground Trasportation Net.                      Construction                     Built-up Area
                                                                                                                                                                                                                                Line vertex      Boundary

River          Canal         Lake       Parcel       Park         Cultivation   Forest                      Road           Ropeway       Railway                 Bridge           Hamlet        Town       Chief Town          Regional Capital     Capital

        Font           Sea          Contour Slope   Slope        Level point    City      Province      County       State        Building            Airport        Wall       Power Station    Factory      Boat Station        Deposit

                       DMQL – Prof. D. Malerba
Different technologies: what
    support for the user?
 Problem: The user should not suffer from problems
  related to the integration of different technologies, such
   Data mining
   Deductive databases
 Solution: A data mining query language (DMQL)
  interfaces users with the whole system and hides the
  different technologies.

   DMQL – Prof. D. Malerba

 DMQL is the data mining query language define by Han
  et al. (1996) for relational databases
 GMQL (Geo Mining Query Language) is a language for
  spatial data mining, based on DMQL (Koperski 1999)
 Both inspired to SQL and the relational model  not
  appropriate for an OO information system like INGENS
 SDMOQL (Spatial Data Mining Object Query Language)
  is a spatial mining query language for INGENS users
  based on OQL

   DMQL – Prof. D. Malerba
Data Mining primitives
A DMQL must incorporate a set of DM primitives
 designed     to    facilitate efficient, fruitful
 knowledge discovery.
Primitives include:
  The specification of portions of the database in which
   the user is interested;
  The kinds of knowledge to be mined
  Background knowledge useful in guiding the
   discovery process;
  Interestingness measures of pattern evaluation
  How the discovered knowledge should be visualized
  DMQL – Prof. D. Malerba
Task-relevant data specification
    In traditional DM applications, it is sufficient to specify
     Database attributes or
     Datawarehouse dimensions
since: 1. No interaction between objects of assumed,datathat
       2.        complex transformation is stored so is
            each object can be effectively described by a tuple
    Not in in the relation
            spatial data mining, where working at the level of
          in spatial data geometric representations of the
    Notstored data, that is mining, where attributes (points,
       neighborsregions) of geographic objects isinterest may
       lines and of some spatial object of undesirable.
       influence the object itself.
     The user is interested in working at higher conceptual
         Data where mine cannot be properties and
     levels, set to human-interpretable straightforwardly
       relations between geographical objects are expressed
       represented by means of a relational table, where
       distinct tuples refer to distinct, independent objects.
    DMQL – Prof. D. Malerba

Two roads can cross each other, or run parallel,
 or can be confluent, independently of the fact
 that they are represented by one or more tuples
 of a relational table of “lines” or “regions”

  DMQL – Prof. D. Malerba
A solution
SDMOQL interpreter allows user to select the
 geographical objects that are relevant to the
 data mining task, and then it invokes the Map
 Descriptor to produce their high level conceptual
Conceptual descriptions are based on first-order
 logic language, where both properties and
 relations of selected geographical objects can be
 easily represented.

  DMQL – Prof. D. Malerba
  WHERE x->num_cell = 11
contain(x1,x2)=true, …, contain(x1,x70)=true,
type_of(x1)=cell, …, type_of(x4)=vegetation,…,
subtype_of(x2)=cultivation,…, subtype_of(x7)=cart_track_road,…,
color(x2)=black, …, color(x70)=black,
extension(x7)=111.018,…, extension(x33)=1104.74,
geographic_direction(x7)=north, …, geographic_direction(x68)=north,
line_shape(x7)=straight,…, line_shape(x33)=cuspidal,…,
altitude(x19)=106.00,…, altitude(x43)=102.00,
area(x2)=187525.00, …, area(x62)=30250.00,
density(x2)=high, …, density(x62)=low,
line_to_line(x7,x68)=almost_parallel, …, region_to_region(x2,x21)=meet,…,
distance(x7,x68)=5.00, line_to_region(x8,x27)=adjacent, …,
       DMQL – Prof. D. Malerba
Describing topographic maps
 33 geographical objects: contour_slope, slope, river, canal,
  primary_road, farm_road, interfarm_road, main_road, …
 16 descriptors: contain(x, y), type_of(y), subtype_of(y),
  color(y), area(y), density(y), extension(y),
  geographic_direction(y), line_shape(y), altitude(y),
  line_to_line(y), distance(y, z), region_to_region(y,z),
  line_to_region(y,z), point_to_region(y,z)
 Defined together with town planners, the set of descriptors
  is quite general and can capture geometric, topological and
  directional features of geographical objects in a topographic
    DMQL – Prof. D. Malerba
Task-relevant data specification

 In SDMOQL the selection of geographical objects is
  performed by means of simplified OQL queries with a
  SELECT-FROM-WHERE structure.
 Example 1: cell-level query
The user selects cell 26 from the topographic map of Canosa
  (Apulia, Italy)
FROM x IN Cell
WHERE x->num_cell = 26 AND x->part_map->map_name =
The Map Descriptor generates the description of all the
  objects in this cell.
   DMQL – Prof. D. Malerba
Task-relevant data specification

 Example 2: layer-level query
The user selects the layer Horography from the
  topographic map of Canosa and the layer Construction
  from any map.
FROM x IN Horograhy, y IN Construction
WHERE x->part_map->map_name = “Canosa”

The Map Descriptor generates the description of the objects
  in these layers.

   DMQL – Prof. D. Malerba
Task-relevant data specification

 Example 3: object-level query
The user selects the objects of the logic class River and the objects
  of type motorway (instances of the class Road), from cell 26 of
  the topographic map of Canosa.
FROM x IN River, y IN Road
WHERE x->part_map->map_name = “Canosa” AND
       y->part_map->map_name = “Canosa” AND
       x->log_incell->num_cell = 26 AND
       y->log_incell->num_cell = 26 AND
       y->type_road = “motorway”
The Map Descriptor generates the description of these objects.

    DMQL – Prof. D. Malerba
Task-relevant data specification
   Example 4: Semantically ambiguous query
FROM x IN Cell, y IN River
WHERE x->num_cell = 26 AND
        y->log_incell->num_cell = 26
This query selects the object cell 26 and all rivers in it. However, it is
     unclear whether the Map Descriptor should describe
1. the entire cell 26 or      Formulate a cell-level query
2. only the rivers in it, or  Formulate an object-level query
3. both.                      (unusual) case, anyway the problem can be
                                  solved by the UNION operator, applied to
                                  the cell-level query and the object-level
    DMQL – Prof. D. Malerba
Task-relevant data specification
The following constraint is imposed on SDMOQL:
the selected data must belong to the same level (cell, layer or
    logic object).
More formally the FROM clause can contain either a group of
   Cells or a set of Layers, or a set of Logic Objects, but
   never a mixture of them.

   DMQL – Prof. D. Malerba
The kind of knowledge to be
<Spatial_Data_Mining_Statement> ::=
<Kind_of_Pattern> ::=
<Classification_Rules> | <Association_Rules>

<Classification_Rules> ::=
        classification as <Pattern_Name>
        for <Classification_Concept>{,<Classification_Concept>}
         [analyze <Descriptor> {, <Descriptor>}]
The analyze clause indicates that the descriptions of selected data is
  based on spatial/aspatial descriptors in the list

   DMQL – Prof. D. Malerba
FROM x in Cell
WHERE x->num_cell >= 5 AND x->num_cell <= 12
mine classification as MorphologicalElements
for class(_)=system_of_farms, class(_)=fluvial_landscape
analyze      contain/2, type_of/1, subtype_of/1,
             area/1, density/1, extension/1,
             line_shape/1, geographic_direction/1,
             line_to_line/2, distance/2, line_to_region/2,
             region_to_region/2, point_to_region/2

   DMQL – Prof. D. Malerba
Defining background knowledge

 In SDMOQL the BK is defined as a set of definite clauses.
 Example:
define knowledge
  close_to(X,Y)=true :- region_to_region(X,Y)=meet.
  close_to(X,Y)=true :- close_to(Y,X)=true.

   DMQL – Prof. D. Malerba
Defining schema hierarchies

 Define a total or partial order among attributes in the database
  schema.                                                  Activity
 Example:

                                                   business_activity     other_activity

                                       low_business_activity   high_business_activity

define hierarchy Activity as
  level1:{business_activity, other_activity} < level0: Activity;
  level2:{low_business_activity,high_business_activity} < level1:
    DMQL – Prof. D. Malerba
Defining set-grouping
 Organize values for given attributes or dimensions into groups of
  constants or range of values                    Distance
 Example:

                                                 far     near

                                 2 Km .. + Km            0 m … 1,999 m

define hierarchy Distance for distance/2 as
  level1:{far, near} < level0: Distance;
  level2:{0, 1999} < level1: near;
  level2:{2000, +inf} < level1: far;
    DMQL – Prof. D. Malerba
Interestingness measure
 threshold values: e.g. the user can set thresholds such
   as confidence and support as follows:
           ThresholdParameter threshold Value
 search biases in the hypotheses space: The user can
   specify a number of preference criteria, such as
   maximization of the number of covered examples or
   minimization of the number of variables in the body of a
   learned clauses, according to the following syntax:
 preference criteria (minimize | maximize ) Criterion
                    with tolerance Value.
 generic input parameter of a data mining algorithm:
                 ParameterName = Value
   DMQL – Prof. D. Malerba
An example

Problem: Localize a “sistema poderale” (system
 of farms) in Apulian maps.
The user browses the maps with INGENS and
 finds some examples of system of farms …

  DMQL – Prof. D. Malerba
An example: the data
… and some

   DMQL – Prof. D. Malerba
An example: the DM query
 Formulate a data mining task through SDMOQL:
SELECT x FROM x in Cell
WHERE(x->num_cell>=1 AND x->num_cell<=6) OR x->num_cell=11
   OR x->num_cell=34 OR (x->num_cell>=15 and x->num_cell <= 17)
mine classification as MorphologicalElements
for class(X)=system_of_farms
analyze contain/2, type_of/1, subtype_of/1, color/1, altitude/1,
   area/1, density/1, extension/1, line_shape/1, geographic_direction/1,
        line_to_line/2, distance/2, line_to_region/2,
        region_to_region/2, point_to_region/2         with preference
minimize negative_example_covered with tolerance 0.6,
maximize positive_example_covered with tolerance 0.4,
minimize cost with tolerance 0.4
number_of_rules threshold 15, consistent threshold 500
   DMQL – Prof. D. Malerba
An example: the process
 QUERY OF                          DATA MINING
  SPATIAL                          ALGORITHMS
   DATA                                              KNOWLEDGE
  MINING               MAP
                    DESCRIPTOR      DESCRIPTIONS
                 OBJECT ORIENTED


 DMQL – Prof. D. Malerba
An example: results

class(S1)=system_of_farms 
contain(S1,S2)=true, region_to_region(S2,S3)=meet,
area(S2)[68437.5 .. 187525],
   region_to_region(S4,S3)=meet, type_of(S1)=cell,
   type_of(S2)=parcel, type_of(S4)=parcel,

there are two pairs of adjacent parcels (S2, S3) and (S4,
  S3), one of which is relatively large (the area is between
  68437.5 and 187525 m2)

   DMQL – Prof. D. Malerba
An example:results

class(S1)=system_of_farms 
   contain(S1,S2)=true, region_to_region(S2,S3)=disjoint,
density(S3)=high, region_to_region(S2,S4)=meet,
region_to_region(S4,S5)=meet, region_to_region(S2,S5)=meet,
type_of(S1)=cell, area(S2)[12381.2 .. 25981.2], type_of(S2)=parcel

there are three adjacent regions (S2, S4, S5), one of which is certainly
   a medium-sized parcel (the area is between 12381.2 and 25981.2
   m2), and there is a fourth region (S3) with a high density
   (presumably vegetation), disjoint from the parcel S2

    DMQL – Prof. D. Malerba
An example: use of results
 The user asks INGENS to find all cells in the Canosa map
  that are classified as system of farms and contain a main
   FROM M in Map, C in Cell, R in Road
   WHERE M->name = “Canosa” AND C->map = M AND R->log_incell = C AND
     R->type_road=“main_road” AND class(C) = system_of_farms
 To    check       condition defined by the predicate
  class(C)=system_of_farms,     the     Query     Interpreter
  generates the symbolic description of each cell in the map
  and asks the Query Engine of the Deductive Database to
  prove the goal class(C)=system_of_farms given the logic
  program previously learned.
    DMQL – Prof. D. Malerba
Conclusions and future work
  A query language for spatial data mining based on OQL
  A solution to the problem of integrating different
   technologies (OODBMS, Deductive database, DM, …)
  Differences with respect to traditional DMQL
  Implementation of the interpreter in INGENS.
                        Future Work
  Extension of the set of descriptors automatically
   extracted from a vectorized map
  Extension to other spatial data mining tasks supporting
   quantitative interpretation of maps

    DMQL – Prof. D. Malerba

To top