Trends in Spatial Databases

Document Sample
Trends in Spatial Databases Powered By Docstoc
					            Chap8: Trends in DBMS

8.1   Database support for Field Entities
8.2   Content-based retrieval
8.3   Introduction to spatial data warehouses
8.4   Summary
Learning Objectives
    Learning Objectives (LO)
       LO1: Learn about field data
         • Why learn about field data type?
         • What is field data type? How is represented in SDBMS?
         • What are common operations on fit?
       LO2 : Learn about storage and retrieval of field data
       LO3: Learn about spatial data warehouses
    Mapping Sections to learning objectives
       LO1        -         8.1.1
       LO2        -         8.1.2, 8.2
       LO3        -         8.3
Why learn about Field data-sets?
  Field data is timely and abundant
      Sensors (e.g. satellite based ones) provide periodic snapshot of Earth
      Most up-to-date data about current events (e.g. fires, flood)
  Field data are useful
      in creating, revising and evaluating vector data sets
      digital archival of fragile historical paper maps
      to manually get details not captured in vector interpretations
  Example: Location selection for a facility (e.g. a grocery store)
      Consider a set of Aerial photographs of different locations
      Vector interpretation includes roads, water bodies, elevation
      What other information can aerial imagery reveal for construction planning?
       • Trees (types and location), buildings, …
What are Field data-sets?
  Field data set examples
      Satellite images, aerial photographs
      Digitized paper maps
      Earth Science data-sets, e.g. rainfall, temperature maps
  Data types of Spatial field data sets
       • Satellite based, e.g.
       • Aerial photographs
       • Measurements from a Geo-registered sensor networks, e.g. weather
      Video, i.e. time series of images
      Audio data
  Focus: Primarily images,
      though some discussion will apply to other data types
Fields and Rasters: An Sampling of Field values
• Definitions
    • Field: a mapping from a spatial domain
    to a value domain
    •Image: a mapping from a rectangular
    grid to a value domain
    • A rectangular grid is a collection of cells
    called pixels
    • Raster is geo-registered image, i.e. grid
    axis have absolute spatial locations

• Fields are often approximated as rasters
    • Example: Figure 8.1
    • Identify spatial domain, field, rectangular
    grid, raster approximation

• Fields can be approximated as images if
relative spatial locations are adequate             Fig 8.1
Computing with field data
  Field data manipulated using operations of
     map algebra
     image algebra
  An Algebra is a mathematical structure consisting of
     Operands and Operations.
  Map Algebra
     Operand: rasters
     Operations: Can be classified into four groups
      • Local, Focal, Zonal and Global
  Image Algebra
     Operand: images
     Operations: crop, zoom, rotate
Local Operation
A local operation maps a raster into another raster such that the value of a
cell in the new raster depends only on the value of that cell in the original

Examples: unary operation : thresholding
         binary operation: point wise addition

                                  Fig 8.2
Focal Operation
In a focal operation, the value of a cell in the new raster is dependent
on the values of the cell and its neighboring cells in the original raster.
Examples: unary operations: focal sum, gradient, …

                                                   Neighborhoods: Rook, Bishop and Queen

                                                            Fig 8.3
Zonal Operation
In a global operation, the value of a cell in the new raster is a function of
the location or values of all cells in the original or another raster.
Examples: zonal sum, zonal average, ...

                                              Fig 8.4
Global Operation
In a zonal operation, the value of a cell in the new raster is a function of the
value of that cell in the original layer and the values of other cells which
appear in the same zone specified in another raster.

Example: distance from nearest facility

                               Fig 8.5
Image Operations:Trim

•Image Operations
    • ignore the absolute locations of pixels.
    • come from image processing literature
         •Ex. smoothing, low pass filter, high pass filter,

• Example: A trim operation extracts an axis-aligned subset of the original raster.

                            Fig 8.6
Learning Objectives
    Learning Objectives (LO)
       LO1: Learn about field data
       LO2 : Learn about storage and retrieval of raster data
         •   How is raster data stored on secondary storage?
         •   What query families are used for retrieval?
         •   What is content based retrieval (CBR)? Why is it interesting?
         •   How is CBR computationally approached?
       LO3: Learn about spatial data warehouses
    Mapping Sections to learning objectives
       LO1          -          8.1
       LO2          -          8.1.2, 8.2
       LO3          -          8.3
Storage and Retrieval of Raster Data - 1
    Traditional Approach
       store raster data in a file system
       use custom software to retrieve data-items of interest
       Example: personal photographs stored on MS Windows
         • Q? What attributes can one attach to digital photographs ?
         • Q? Is there an easy way to retrieve all pictures taken in San Francisco?
       Rigid schema
         • Limited ability to add and manage additional attributes
       Canned Queries only
         • Limited ability to support ad-hoc queries
       Data quality
         • Limited ability to identify duplicates or similar data-items
Storage and Retrieval of Raster Data in a SDBMS
    A database approach
       Database tables store
        • raster data items
        • attributes (i.e. meta-data), e.g. creation date, geo-location, subject, ...
       use SQL like query language to retrieve desired data-items
        • retrieve all raster data-items overlapping with city of San Francisco (Q1)
        • retrieve latest raster data-item within city of Paris (Q2)
        • retrieve raster data-items similar to a given image (Q3)
        •   table schema definition allows user defined attributes
        •   improve ability to pose ad-hoc queries (Ex. Q1, Q2)
        •   improve data reliability and quality
        •   Example: Query Q3 may be used for duplicate reduction
Storage and Retrieval of Raster Data - Challenges
    Challenges in database based approach
       storage: size( raster data item) > size (disk blocks)
       retrieval: raster has rich content
        • A picture is worth a thousand word!
    Approaches to storage challenge
       1. Delegate storage to DBMS
        • Use Binary Large Object (BLOB) data-type
        create table my_picture(
           image: BLOB;
           creation_date: date;
           place: point;
       2. Do-it-yourself
        • Divide a raster data-item into smaller slices
        • Q? Which way of slicing reduce disk I/Os for common queries?
8.1.2 How is raster data stored on secondary storage?
 • Slicing approaches
      • Linear, e.g. one row per disk block (see Fig. 8.8(b))
      • Tiling - see Fig. 8.8(c )
 • Tiling is preferred
       • for queries extracting rectangular sub-images
       • Example -

                                                                Fig 8.8
8.2 How is raster data queried?
  Retrieval challenge of rich content
     A. Meta-data approach
     B. Content based retrieval
  Meta-data approach
     select a set of descriptive attributes
       • simpler SQL data types, e.g. numeric, string, date, ...
       • Example: source, location, time stamp, subject, resolution, ...
     Store values of descriptive attributes for each raster data-item
     Allow SQL queries on the descriptive attributes

  Limitation of meta-data approach
     Restricts queries to content captured by descriptive attributes
     Does not support “Similarity” based queries
       • Ex. Find all raster data-items similar to a given raster data item.
8.2 Content Based Retrieval (CBR)
    Q1. Find all raster data-items similar to a given raster data item
    Q2. Locate a photograph of a river in Minnesota with trees nearby.
    Q3. Find all images of state parks which have a lake within them, are within a radius of
    one hundred miles from Chicago, and are southwest of Chicago.
• State of the Art
    • However, few robust implementations of CBR are available as of 2002
         • Several research prototypes address similarity query Q1
    • Result quality is similar to those of web searches (e.g.
         • Some of the retrieved raster data-item are useful.
         • Many similar data item are not retrieved in the result
         • Usable in application domains such as publishing
    • Our goal is to understand a current approach to similarity queries
         • involving spatial similarities
 8.2 Content Based Retrieval (CBR)
• Spatial Similarity
     • Consider a pair of raster images with common objects (e.g. parks, lakes)
     • Spatial similarity between raster images can be defined based on
          • similarity of spatial relationships (e.g. topological, directional)

• Q? Which pairs exhibit higher similarity?
     •P1: (inside, disjoint)               or          P2: (inside, covered by)
     •P3: (disjoint, touch)                or          P4: (disjoint, inside)
     •P5: (north west, north)              or          P6: (west, east)

• A graph framework for comparing spatial relationships
     • Nodes = spatial relationships ; Edges = connect most similar nodes
     • Similarity metric = number of edge on shortest path between 2 nodes
     • See Figures 8.9 and 8.10
8.2.1 Topological Relationship Similarity
• Study Fig. 8.9, pp. 234
    • Nodes = topological relationships
    • Edges = most similar
    • Similarity measure = path length

• Inference from Model
    • P2: (inside, covered by) more
    similar than P1: (inside, disjoint)
    • Do you agree?
    • Review Figure 2.3 (pp. 30)

                                          Fig 8.9
8.2.2 Direction Relationship Similarity
• Study Fig. 8.10, pp. 235
    • Nodes = topological relationships; Edges = most similar
    • Similarity measure = path length
    • Inference: P5 (north-west, north) more similar than P6 (west, east)

                                                                Fig 8.10
8.2.3 Distance Similarity
• Distance similarity is based on
    • Euclidean distance between the centroids of the objects.
    • Example: Image R is more similar to P than Q in Fig. 8.11 (pp. 235)

                                     Fig 8.11
8.2.4 A Computational Approach to CBR
• Attribute Relation Graph (ARG)
    • Node = objects in a raster
    • Edges = relationships
    • Ex. Raster of Fig. 8.12(a)
    • ARG in Fig. Fig. 8.12(b)
         • Point object O3
         • Rectangles O1, O2
         • Edge (O1, O2) shows that
         they are disjoint, at 61
         degree direction and 5.2
         units distant.

• Vector representation of ARG
    • Lists objects and edge
                                      Fig 8.12
    • Ex. In Fig. 8.12
A Computational Approach to CBR
• Steps:

1. Represent each raster data
item by its ARG vector
2. Map query raster data item by
its ARG vector
3. Find most similar raster data-
items in the database by
comparing ARG vector
     • Use a distance metric
     • Use a multi-dim. Index

• Comment: Result quality is
similar to those of web
searches. Some of the retrieved
raster data-item are useful.
                                    Fig 8.13
Learning Objectives
    Learning Objectives (LO)
       LO1: Learn about field data
       LO2 : Learn about storage and retrieval of field data
       LO3: Learn about spatial data warehouses
         • What are data warehouses? Why are they interesting?
         • What are aggregate functions? Which ones are easy to compute?
    Mapping Sections to learning objectives
       LO1        -        8.1
       LO2        -        8.1.2, 8.2
       LO3        -        8.3
8.3 Why are Data Warehouses Interesting?
 • Data Warehouse facilitate group decision making
     • Consider a dataset
         • 1 measure (i.e. Sales)
         •3 dimensions (e.g. Company, Year, Region)
 • Analysis questions
      • Q1. Rank Regions by total sales.
      • Q2. Rank years by total sales.
      • Q3. Where are sales consistently growing?
 • Cross tabulates summaries reports used to analyze the trends
      • Example:
8.3 Generating cross-tabulation summaries
 • Traditional Approach
     • Use custom software pulling data out of a DBMS
     • Limitations: redundant of work, inefficient use of resources

 • Data Warehouse approach
     • Cross-tab. Can be generated using a set of simple report
          •Each report is generated from a SQL “Select ... group by” statement
     • Example: Fig. 8.19 (pp. 244) and Table 8.3 (pp. 245)
          • Cross-tab example in last slide is a union of
               • SALES-L0-A, SALES-L1-A, SALES-L1-B and SALES-L2
          • Table 8.3 shows SQL queries to compute each part
     • Advantage
          • Rest of SQL is available for pre/post processing of data
          • Performance gains by eliminating unnecessary copying of data
Example Data Warehouse (Fig. 8.19)

Fig 8.19
8.3.4 Cross-tabulation hierarchy
 • Spreadsheet view of a report
      • Views a report a N-dim. Spreadsheet
      • N = number of dimension attributes
      • Each cell contains value of “measure”
 • Cross-tabulation view of a Report hierarchy
      • Example: report hierarchy for
      • SALES-L0-A, SALES-L2-A, SALES-L1-B, SALES-L2, Fig. 8.19 (pp. 244)
8.3 What is a Data Warehouse?
 • Data Warehouse is a special purpose database
     • Primarily used for specialized data analysis purposes
      • Facilitates generation and navigation of a hierarchy of reports
 • Special purpose data-sets and queries
      • Data consists of
           • a few measure attributes
           • a set of dimension attributes
      • The measure attribute depends on dimension attributes
      • Queries generate reports
           • Report measure for selected values of dimensions
           • Aggregate measure for given subset of dimensions
 • What is a spatial data warehouse?
     • Data warehouses with spatial measures or dimensions
     • Example: census data - census tract is a spatial dimension
     • Example: logistics data - route is a spatial dimension
8.3.4 Data Warehouse Operations
 • Operations on a data warehouse
     • Roll-up, Drill-down
     • Slice, Dice
     • Pivot
 • Roll-up
     • Inputs: A report R, A subset S of dimensions in R
     • Output: A sequence of reports summarizing R
     • Example 1: R = SALES-Base, S = (Year, Region) in Fig. 8.19 (pp. 244)
           • Output consists of reports SALES-L0-A, SALES-L1-B, SALES-L2
     • Example 2: R = SALES-Base, S = (Region, Year)
           • Output consists of reports SALES-L0-A, SALES-L1-A, SALES-L2
 • Drill-down
     • Inputs: A report R, A dimension D not in R
     • Output: A reports detailing R on D
     • Example: R = SALES-L1-B, D = Region in Fig. 8.19 (pp. 244)
          • Output : report SALES-L0-A
8.3.4 Data Warehouse Operations
 • Slice, Dice
     • Reduce dimensions in a table- (Fig. 8.7, pp 232).
     • Inputs: A report R, A value V for a dimension D in R
     • Output: A subset of R where D =V
     • Example: R = SALES-L0-A, D = Year, V = 1994 in Fig. 8.19 (pp. 244)
          • Output: Table 8.5 (pp. 246)
          • includes tuple (ALL, 1994, America, 35)

 Fig 8.7
8.3.4 Data Warehouse Operations
 • Pivot
     • For a spreadsheet view of reports
     • Transposes a spreadsheet

 • Example
     • Inputs: A spreadsheet view of a report R
     • Output: A transposed spreadsheet
     • Ex.: R= SALES-L0-A, Fig. 8.19 (pp. 244)
Logical Data Model of a DWH

 • Purpose of a logical data model
     • Specify a framework to specify computational structure
     • Allow extension of SQL to model new needs
 • Cube operation
     • Input : A fact table
     • Output: A set of summary reports covering all subsets of dimension columns

          • Equivalent to union of all tables and reports in Fig. 8.19 (pp. 244)
     • Ex. Fig. 8.18, pp. 243
     SELECT Company, Year, Region, Sum(Sales) AS Sales
     FROM             SALES
     GROUP BY CUBE Company, Year, Region
Physical Data Model of a DWH
 • Purpose: Computationally efficient implementation
 • Ideas:
     • Pre-computation -
            • pre-compute some of reports and use those to compute other reports
     • New indexing methods, e.g. bit-map index
     • Query Processing Strategies
            • Strategies for aggregate functions
            • New strategies for multi-table joins

 • Let us look at strategies for aggregate functions
DWH Physical Model: Aggregate function strategies
 • Aggregate Functions
     • Compute summary statistics for a given set of values
     • Examples: sum, average, centroid (Table 8.1, pp. 238)

 • Strategies for efficient computation
     • Characterize easy to compute aggregate functions
     • 3 categories
          • Distributive
          • Algebraic
          • Holistic
     • First 2 categories can be computed easily in one scan of the dataset
Definitions of Aggregate Function Categories
• Notation:
    • F, G, G1, G2, … Gn are aggregate functions where n is small
    • S is a set of values, e.g. S = (1, 2, 3, 4)
    • P = (S1, S2, …, Sp) is a partition of S, e.g. P = (S1, S2), S1 = (1, 2), S2 = (3, 4)
•Distributive( F ) if there exists a G such that
    • F( S ) = G ( F(S1), F(S2), …, F(Sn) )
    • Example: sum is distributive
    • Illustration: sum(1, 2, 3, 4) = sum ( sum(1, 2), sum(3, 4)
• Algebraic( F ) if there exists G1, …, Gn, (where n is small) and
    • F( S ) = G ( G1(S1), …, Gn(S1), G2(S1), …, Gn(S2), …, G1(Sp), …, Gn(Sp) )
    • Example: average is distributive
    • Illustration: average(1, 2, 3, 4)
         = { count(1, 2) * average(1, 2) + count(3, 4) * average } / { count(1,2) +
         count(3,4) }
Example: Distributive Aggregate Function
• Examples in cross-tabulation scenario (Fig. 8.14, pp.238):
    • Example 1. Min is distributive
    • Example 2. Count is distributive

                                  Fig 8.14
Examples: Algebraic Aggregate Functions
• Examples in cross-tabulation scenario (Fig. 8.15, pp.239):
    • Average and Variance are algebraic                       Fig 8.15
Discussion - Spatial Data Warehouse
• Example
    • Consider the example in Fig. 8.16, pp. 241
    • A map interpretation may be attached to each report
           • Each row has a spatial footprint, which can be aggregated by geometric-union
    • The collection of maps may be called a mapcube
    • What is needed in OGIS standard to support map-cube operation?
    • Hierarchical collection of maps in mapcube
           • What is an appropriate cartography to convey the relationship among maps?
Spatial Data Warehouses and Mapcube
                 Fig 8.16
  Field data
      useful in many applications due to rich content
      Represented as raster or image
      Operations can be categorized into local, focal, zonal, and global
  Field data storage and retrieval
      Tiling is a preferred way to divide raster data into disk blocks
      Meta-data based query is often used for retrieval
      Content based retrieval may be used for similarity searches
  Data warehouses support analysis e.g. cross-tabulation reports
      SQL CUBE operator support generation of DWH reports
      Distributive and Algebraic aggregate functions can be computed easily

Shared By: