Towards Keyword-Driven Analytical Processing

Document Sample
Towards Keyword-Driven Analytical Processing Powered By Docstoc
					Towards Keyword-Driven
Analytical Processing

   Ping Wu, Yannis Sismanis, Berthold Reinwald

   Presented By – Amit Goyal (
   Date: 21st Nov 2007

   Background
   Motivation
   Solution Framework
       Differentiate
       Explore
   Experiments
   Conclusions

   Problem:
       Given: multiple data sources
       Find: patterns (such as?)
   What were the sales volumes by region and product
    category for the last year?
   Which orders should we fill to maximize revenues?
   Will a 10% discount increase sales volume

   Users don‟t know how to specify what they
    want, but they know it when they see it
   Keyword queries can have different
    semantics based on the intent of user

   Decision Support
       Decision support systems are a class of computer-based
        information systems including knowledge based systems
        that support decision making activities
   Data Warehouse
       Main repository of an organization's historical data for
        analysis purposes
       It contains the raw material for management's decision
        support system
   OLAP (online analytical processing)
       Interactive process of creating, managing, analyzing and
        reporting of data
Facts and Dimensions

    Database is a set of facts (points) in a multidimensional space

   Fact:                                Dimension:
       Measures performance                 Specifies a fact
        of a business                    Example dimensions:
   Example facts:                           Product, customer data,
       Sales, budget, profit,                sales person, store
        inventory                        Example dimension
   Example Fact Table:                   table:
       Transactions (timekey,               Customer (ckey,
        storekey, ckey, units,                firstname, lastname,
        price)                                address, dateOfBirth,
                                              occupation, …)
   Order No                         ProductNO
   Order Date                       ProdName
                    Fact Table      ProdDescr
                    OrderNO         Category
  Customer No                       CategoryDescription
  Customer Name                     UnitPrice
  Address           ProdNo          Date
                    DateKey         DateKey

                    CityName        Date
                    Quantity         City
  SalespersonName   Total Price

   Dataspace DS: entire multi-dimensional
    dataspace in an OLAP database
   Subspace DS‟: subset of dataspace
   Star Join: set of joins when a fact table is
    joined to two or more dimension tables.
    Different interpretations of the keywords
    reflect different star join expressions
Solution Framework
   Phase Differentiate
       Generation of candidate subspaces using user keywords
   Phase Explore
       The system first calculates for the subspace the aggregated values for some
        predefined measures
       Then dynamically finds for each dimension the top-k interesting group-by
        attributes to partition the sub-dataspace

   Application specific
   Focus in the paper:
       Surprising aggregates
       Correlated aggregates
Differentiate Phase

   First generate candidate subspace
   Then, organize them effectively by ranking
   Then, ask user to select one of candidate
   Proceed to Explore Phase
Ambiguity of Keyword Queries

   In large and complex OLAP dataspaces, a
    keyword almost always matches different
    attribute domains in different dimensions
       Creates large number of possible query
   Example: Consider query “Columbus LCD”
       Columbus – holiday or city?
       LCD – projectors or TV or monitor?
Ambiguity of Keyword Queries: WHY?

   Correctness: Different interpretation of keywords
    may result in completely different subspaces. Thus,
    correct interpretation may eliminate error
    propagation to subsequent phases in the system
   Performance: Computation of all possible
    interpretations may be expensive

    Users know exactly the semantic meaning of their
    keywords. Put them in query processing loop
Candidate Interpretation Generation

   Problem Stmt: Given a keyword query q={k1, k2,}, generate candidate interpretations CI
    ={C1,C2, ...,Cm} of q.
   For each keyword ki, the CI generator first probes
    the full-text index to obtain the Hit Set Hi. Hi = {hi1,
    hi2, .. , him}
   Each hit hij represents a triplet of relation name,
    attribute name and the attribute instance value
    {hij.R, hij.Attr, hij.Val}.
   Within a hit set, hits can be organized in Hit Groups
    if they have same relation name and attribute name.
Hit Group : Example

   Consider a query “Columbus LCD”
   The hit set for keyword “Columbus” has 3
    hits: Loc/City/Columbus,
    Holiday/Event/(“Columbus Day”),
    Holiday/Event/(“Columbus Week”).
   Thus, hit group for the keyword “Columbus” is
    {Holiday/Event/(“Columbus Day” OR
    “Columbus Week”)}.
Star Seed and Star Net

   Star Seed (SS): For query q, SS is defined as a set
    of n hit groups, each of which is drawn from a
    different hit set
       E.g. For query “Columbus LCD”, one candidate SS could
        be {{Holiday/Event/(“Columbus Day” OR “Columbus
        Week”)}, PGROUP/Group Name/”LCD Projectors”}
   Star Net (SN): For SS, SN is defined as a join that
    connects all the hit groups from the SS
   Note: A single SS could correspond to multiple SN
Candidate Star Net Generation
Ranking the Candidate Star Nets
   Number of all interpretations may be large. Thus, ranking is necessary
   SCORE(SN,q) =

   Sim :- string matching similarity function between the query q and attribute
    value of hit hij
   |HGk| is the number of hits in the hit group
   ∑ over hits in a Hit Group divided by |HGk| is the average hit similarity value
   Avg hit similarity value is further normalized to penalize hit groups with
    many matched attribute instances.
       “California” – state or a large of distinct address on “California Street”
   Finally, summation of hit group scores are divided by |SN|2 to prioritize
    smaller Star Nets.
       Score(Star Nets with “San Jose{city}”) > Score(Star Nets with “San Antonio{city}”
        and “Jose{Customer First Name}”)
Handling of Phrase Queries

   Consider Query: “San Jose”
   Output hit set: “San Jose”, “San Antonio” etc
   Score function does not take into consideration the
    fact that “San Jose” perfectly matches two
    keywords, thus should be ranked higher.
   Solution: Merge the two hit groups from two hit sets,
       Both groups are from same attribute domain
       The intersection between two groups is not empty
   Can be generalized to phrases containing more than
    2 keywords
Explore Phase

   Till now, a unique sub-dataspace DS‟ has been
    identified by the user
   And the system computes the group-by aggregates
    over the measure from all qualified fact points in DS‟
   Rank the group-by attributes
       Categorical Attributes
       Numerical Attributes
   Organizing Attribute Instances
       Categorical Attributes
       Numerical Attributes
Automatic Facet Construction For Sub-
   After a unique sub-dataspace DS‟ has been
    identified by the user, the system dynamically
    constructs a multi-faceted search (MFS) interface for
    the user to explore detailed level aggregation in DS‟.
   In real world databases, the number of dimensions
    may be large. And each dimension may have many
       So, need to rank group-by attributes dynamically based on
        interestingness of the resulting partitions.
Ranking Group-by Attributes

   Rank the group-by attributes based on
    interestingness of their resulting partitions
   Roll-up Partitioning (RUP)
Roll-up Partitioning

   By looking at the sub-dataspace alone, it is
    impossible to define interestingness of a
    certain partition in a robust way
   Dimensions are hierarchical, lets use it !!
   Roll-up along some dimensions, compare the
    two partitions.
   The more similar the two partitions are, the
    less the candidate group-by attribute is
    considered as surprising

   Determine whether the attribute zipcode of
    dimension store is an interesting group-by
    attribute for the subspace associated with
    Product Television
   Roll-up to bigger space along the Product
    dimension to Home Entertainment
       If distribution deviates -> surprising
       If correlated -> bellwethers
Roll-up Partitioning

     SCORE(attrij, DS‟) = - E((X-µx)(Y-µy))/(σx σy)

X = aggregation values on partition PAR(DS‟, attrij)
Y = aggregation values on partition PAR(RUP(DS‟),
E = expected value
µ = mean
σ = standard deviation
Ranking both Categorical and Numerical
   Categorical: Easy. Previous score function can be
    directly applied.
   Numerical: Correlation depends on how the
    numerical domain is bucketized
       First split the domains of the candidate attributes into
        “sufficiently” many buckets or basic intervals
       Tuples in same bucket are aggregated together to produce
        new attribute values
   Intuition behind splitting is that the correlation value
    of two distributions can be preserved as the bucket
    number becomes large and the interval range
    becomes small
Organizing Attribute Instances

   Till Now, we have ranked attributes
   How to organize the values within each
    attribute domain?
   Categorical Attribute:
   SCORE(attrij.catp, DS‟) =

   G is the aggregate function
Organizing Numerical Attribute Instances

   Given „m‟ basic intervals, merge adjacent intervals
    into „k‟ numeric categories
   3 objectives:
       Number of resulting intervals should not be large (suitable
        for navigation)
       Number of merged intervals should not be skewed.
        Number of intervals in largest range should not exceed L
        times the number of intervals in smallest range
       The merged partition should preserve the original
        interestingness value, i.e. correlation value from basic
Organizing Numerical Attribute Instances


                            • What is neighbor?
                            • How to generate
                            • SCORE function is
                            not defined

   AdventureWorks data warehouse
   Divided in two separate databases:
       AW_ONLINE – 5 dimensions, 3 hierarchical, 10
       AW_RESELLER – 7 dimensions, 4 hierarchical,
        13 tables
Qualitative Sample Results

   Keyword Query: “California Mountain Bike”
   Phase1: System returns a list of star nets
   Analyst selects the first Star Net
Evaluation of Subspace Ranking
   Manually written 50
    keywords queries
   X-axis: Rank of the results
   Y-axis: %age of the queries
   4 ranking methods
   Relevance is checked
   Note that group size
    normalization is not
Effects of Bucket Number in Group-By
Attribute Ranking
Test the assumption that with a “sufficiently” large number of basic intervals,
the actual correlation value can be captured

                                                AW_Online database
                                                Numerical Attributes:
                                                    Yearly-Income from
                                                     Customer table
                                                    Dealer-Price from Product
                                                Roll-up operations:
                                                    StateProvinces to
                                                    Subcategory to Product
Effects of Bucket Number in Group-By
Attribute Ranking
                       AW_Reseller Database
                       Numerical Attributes:
                           AnnualSales,
                            AnnualRevenue from
                            Reseller table
                           NumberOfEmployees
                       Roll-up operations:
                           Product Subcategories to
                       Error percentage is
                        computed by the deviation
                        from the ideal case (where
                        each distinct value has its
                        own bucket)
Study of Numerical Partitioning Methods

   Integrate keyword search with the efficient
    aggregation power of OLAP
   Provides an efficient and easy-to-use solution
    for business analyst
   Ambiguity problem has not been addressed
    by previous research
   Current research on keyword search over
    RDBMS uses indexes on a tuple level instead
    of an attribute level
   Poorly written paper
     Typo mistakes.
           In section 6.3, para 1, it should be “Section 4.3” instead of “Section
           In section 6.5 para 1, it should be “Figure 7(a)” instead of “Figure
       In eqn1, it is not clear what is |HGk|, |SN|
     In eqn2, it is mentioned G is an aggregate function, but didn‟t
        specify it
     In Algorithm 2, the “SCORE” function used is not defined

     In Algorithm 2, the notion of “neighbor” is not defined

   First, they say score function (section 4.3) does not take into
    account that “San Jose” matches two keywords and therefore
    should be assigned much higher score than “San Antonio”; then
    in the next section, they claim that normalization factor |SN|2
    takes this problem into consideration.

Shared By: