The LC Model for OLAP
1. INTRODUCTION ------------------------------------------------------------------------------ 4 2. DISARRAY IN THE OLAP SPACE ------------------------------------------------------ 6
2.1 2.2 Open issues --------------------------------------------------------------------------------------------------- 6 Critique of implicitly existing OLAP models ---------------------------------------------------------- 7
3. ATTRIBUTES OF AN IDEAL MODEL ------------------------------------------------ 12
3.1 3.2 3.3 3.4 3.5 3.6 3.7 Groundedness -----------------------------------------------------------------------------------------------12 Coverage -----------------------------------------------------------------------------------------------------13 Syntax Completeness, Consistency and Efficiency --------------------------------------------------13 Multi-schema support -------------------------------------------------------------------------------------14 Integration of all DSS functions -------------------------------------------------------------------------14 Analytical awareness --------------------------------------------------------------------------------------14 Summary -----------------------------------------------------------------------------------------------------15
4. DATA MODELS AND FOUNDATIONS----------------------------------------------- 16
4.1 4.2 4.3 Logic and the Relational model--------------------------------------------------------------------------16 Basic OLAP Features and the Relational model -----------------------------------------------------17 OLAP Features and Canonical logic -----------------------------------------------------------------18
5. OVERVIEW OF A NEW APPROACH ------------------------------------------------- 21
5.1 5.2 LC Logic: A Functional Approach --------------------------------------------------------------------21 Summary of the LC model for OLAP ------------------------------------------------------------------21
6. DEFINING TYPE STRUCTURES ------------------------------------------------------ 24
6.1 6.2 General Definitions ----------------------------------------------------------------------------------------24 Syntax --------------------------------------------------------------------------------------------------------33
7. TYPE STRUCTURING OPERATIONS---------------------------------------------- 36
7.1 Cross or Cartesian Product: -----------------------------------------------------------------------------37 Version 0.93, last saved 1/23/2009 7:29:00 PM
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Page 1/80
Copyright 1998 Dimensional Systems, all rights reserved 222 Third St. suite 3122 Cambridge, MA 02139
The LC Model for OLAP
7.2 7.3 7.4 7.5 7.6 1-to-1 correspondence -------------------------------------------------------------------------------------38 Correspondence --------------------------------------------------------------------------------------------39 Insertion and Deletion -------------------------------------------------------------------------------------40 Traditional Set Operations -------------------------------------------------------------------------------41 Type Definition Syntax ------------------------------------------------------------------------------------41
8. SCHEMATA BETWEEN TYPES AND DATA --------------------------------------- 44
8.1 8.2 8.3 8.4 8.5 8.6 Schemata in general ---------------------------------------------------------------------------------------44 Minimal schemata ------------------------------------------------------------------------------------------45 Rich schemata -----------------------------------------------------------------------------------------------46 A Single Schema with Multiple Data Sets -------------------------------------------------------------48 Sparsity implications of minimal versus rich schema -----------------------------------------------48 Schema Connector Syntax --------------------------------------------------------------------------------49
9. QUERIES -------------------------------------------------------------------------------------- 50
9.1 9.2 9.3 9.4 9.5 9.6 9.7 Primitive Tokens -------------------------------------------------------------------------------------------50 The Basic Form of a Query -------------------------------------------------------------------------------50 Basic Queries ------------------------------------------------------------------------------------------------51 Queries with Modifiers ------------------------------------------------------------------------------------52 Nested queries -----------------------------------------------------------------------------------------------53 Iterative Queries --------------------------------------------------------------------------------------------55 MD queries --------------------------------------------------------------------------------------------------56
10.
10.1 10.2 10.3
REPRESENTATIONS ------------------------------------------------------------------- 58
The LC structure of a table representation -----------------------------------------------------------58 The LC structure of a line graph representation ----------------------------------------------------59 The LC structure of a hypercube representation ----------------------------------------------------59
11.
11.1
CALCULATIONS ------------------------------------------------------------------------- 60
Basic examples ----------------------------------------------------------------------------------------------60 Version 0.93, last saved 1/23/2009 7:29:00 PM
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Page 2/80
Copyright 1998 Dimensional Systems, all rights reserved 222 Third St. suite 3122 Cambridge, MA 02139
The LC Model for OLAP
11.2 11.3 11.4 11.5 11.6 Examples with application ranges ----------------------------------------------------------------------60 Aggregation functions that apply to hierarchical dimensions -------------------------------------61 Examples that specify aggregations on other dimensions ------------------------------------------62 Moving functions -------------------------------------------------------------------------------------------63 Generalized tokens and Inheritance --------------------------------------------------------------------63
12.
12.1 12.2 12.3
MULTI-SCHEMA SITUATIONS ------------------------------------------------------ 65
A Single Data Set with Multiple Rich Schemata -----------------------------------------------------65 Example of Multischema integration -------------------------------------------------------------------65 Relating Hypercube and Series cube schemata ------------------------------------------------------69
13. 14.
ANALYTICAL SESSION --------------------------------------------------------------- 73 CONCLUSION ---------------------------------------------------------------------------- 77
Version 0.93, last saved 1/23/2009 7:29:00 PM
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Page 3/80
Copyright 1998 Dimensional Systems, all rights reserved 222 Third St. suite 3122 Cambridge, MA 02139
The LC Model for OLAP
1. Introduction
Over the past twenty five years, and under a variety of different names, OLAP tools have brought many benefits to the decision support marketplace. Relative to spreadsheets and relational databases, they provide fast, flexible specification and computation of multilevel multivariate multidimensional calculations. For analysts and managers, OLAP tools offer dynamic reporting capabilities with the ability to drill down into detail data, select arbitrary data slices and reorganize views with a click of the mouse. OLAP tools share some of the characteristics of both spreadsheets and databases. To the degree that they offer a cell- and dimension-based logic, focus on calculations and work out of memory, they are spreadsheet-like. To the degree that they store and manage large amounts of disk-based data in a physically optimized multi-user environment, they are database-like. Although spreadsheets work fine for small amounts of calculations, they bog down as the number of derived cells and the number of inputs to each derived cell become large. Major reasons for this include the absence of any spreadsheet data model beyond that of a simple cell and the absence of associated design methods. A data model1 is a body of structuring and operational definitions that apply to all the data that could ever be described or manipulated by software adhering to the model. Data models
1
To contextualize the concept of data model, consider the spectrum of conceptual levels relative to which data models exist. Figure 1 is an illustration of that spectrum. There is a products level; this refers to all the software products available in the marketplace and capable of being used. There is a common product language level; this refers to any industry-standard API specification, (such as ODBC in the relational world and, for the OLAP world, the soon to be OLAP API and Tensor). There is a data model level; this refers to things like the Relational model. And, finally, there is a logico-mathematical level; this refers to things like matrix algebra and predicate calculus. So you can see that a data model is sandwiched between individual products and standard APIs on one end, and math and logic on the other. Figure 1 The relationship between products, data models and foundations.
For example, SQL database products, such as the ones produced by Oracle, Sybase, and Informix, can be driven through ODBC - an industry standard API specification - and they are all grounded in a common logical data model: the “Relational model” of data. The Relational model was first introduced by E.F. Codd in his seminal 1970 paper “A Relational Model of Data for Large Shared Databanks” 1. It introduced a set of primitive data structures including domains, relations, tuples and attributes. It also introduced a set of primitive operations including joins, products, restrictions, and projections. These primitive structures and operations are grounded on the primitive structures and operations of canonical logic (and theory of relations) including functions, arguments, and connectives. While the Relational model was developed and debated before any products were created that claimed to be relational, multidimensional products were developed in the absence of any standard data model. (There was, of course, geometry which had become multidimensional in the early 1800s, and the matrix algebra which appeared later. These were almost certainly a part of the awareness of individuals who created OLAP products.)
Version 0.93, last saved 1/23/2009 7:29:00 PM
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Page 4/80
Copyright 1998 Dimensional Systems, all rights reserved 222 Third St. suite 3122 Cambridge, MA 02139
The LC Model for OLAP
benefit the community of users, product developers and researchers by providing a common language of expression independent of any particular tool, and in terms of which any tool may be easily understood. In addition, for researchers especially, the language becomes a thing in its own right to be thought about independently of any particular product or problem. A data model is like a map that can be reasoned with independently of the terrain it depicts. This is the essence of abstract data model-based research. Most commercial database products follow the relational model to some degree. And although it may be the case today that no product is „perfectly‟ relational, the fact that they are more or less relational gives database programmers a common language relative to which they can understand any particular database. Additionally, since the relational model is grounded in logic, it offers a formal, logic-grounded method of database design. There is currently no abstract or logic-grounded model for OLAP. As a result, OLAP applications run the risk of falling into even greater disarray than spreadsheets. Try to imagine N-dimensional spaghetti-like application code! The purpose of this document is to introduce a logically and mathematically grounded model for OLAP. Toward that end it will 1. 2. 3. 4. 5. 6. 7. 8. Identify OLAP issues where the lack of a data model causes problems, Critique the product-based OLAP models that implicitly exist, Describe the ideal features of an OLAP model, Describe why neither the relational model nor canonical logic provides an adequate foundation for OLAP functionality, Give an overview of the LC model for OLAP, Describe in detail the structural aspects of the LC model, Describe in detail the query and calculation aspects of the LC model, and Outline the benefits of adopting the LC model.
Since the focus of this document is the presentation of a software data model for OLAP (and not a theory of logic or mathematics), while some of the key hurdles to articulating such a model stem from problems with canonical logic and mathematics, two endnote-like essays are included on the need for, description, and justification of an alternative, or functional, approach to logic and mathematics.
Version 0.93, last saved 1/23/2009 7:29:00 PM
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Page 5/80
Copyright 1998 Dimensional Systems, all rights reserved 222 Third St. suite 3122 Cambridge, MA 02139
The LC Model for OLAP
2. Disarray in the OLAP Space
2.1 Open issues There are a variety of open issues associated with OLAP tools and the lack of a common OLAP model across products. Lack of common terminology
1. Is all structural metadata definable in terms of generic dimensions, or is there a justifiable basis for distinguishing between measures/variables and dimensions? There is not an obvious answer to this question If you look to matrix algebra for guidance, you might think that all structural metadata can be defined in terms of generic dimensions. On the other hand, if you look to the relational model or the predicate calculus, you might be inclined to support the dimension/variable distinction. Some products such as Analysis Server, Express, and Metacube distinguish variables (possibly called measures, facts or metrics) as separate from dimensions. While other products, such as Essbase, Holos, and TM1, purport to treat all dimensions as generically equivalent. These latter products treat variables as members of a variables dimension.
2. Are hierarchies (single or multiple), an intrinsic or necessary part of a dimension, or are hierarchies created by relating members from different dimensions? Although most products support the existence of one or more hierarchies within a dimension, a number of products such as Express from Oracle and Cross Target from Dimensional Insight do not have a built-in notion of intradimensional hierarchy. If dimensions should support hierarchies, what constrains such hierarchies? Could any collection of linked nodes in an N-1 relation constitute a hierarchy or are there certain assumptions that need to be true about the relationship between the nodes? 3. What do empty cells (or sparse regions of cells), mean for an OLAP model? Does an empty cell mean that data is missing but potentially forthcoming as in a late sales report? Or does an empty cell mean that data could never apply to the cell as would be the case in a two dimensional model where one dimension was defined by employee names and the other dimension was defined by employee attributes and you were looking at the cell defined by the intersection of an unmarried employee and the attribute „spouse‟s name‟. Or does an empty cell simply mean that all zeros were suppressed as is typically the case with retail transaction tables where only those products which were sold are reported? Unfortunately, in the OLAP world, the term „sparse‟ has been indiscriminately used to refer to cells that are missing, meaningless, and zero. Given the degree to which we depend on OLAP tools to provide accurate calculations and given that confusing missing, meaningless and zero-valued
Version 0.93, last saved 1/23/2009 7:29:00 PM
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Page 6/80
Copyright 1998 Dimensional Systems, all rights reserved 222 Third St. suite 3122 Cambridge, MA 02139
The LC Model for OLAP
data can produce calculation errors2, an OLAP model should clearly distinguish between the various types of sparse data. 4. Most tools provide for some dimension-based calculations as well as some additional rules or calculation scripts. Is this distinction logically-grounded, in which case what is the logical basis for making this distinction? Or is this a matter of convenience? And if so, is this the end user‟s convenience or the product developer‟s convenience? 5. To what degree should OLAP tools support user-defined data types? This covers for example media issues such as text data, and special purpose variables with fixed numbers of potential values which are useful in segmentation analysis and simulation modeling. 6. How can an OLAP model support multiple schemata connected to a single data set? For example, relative to a single sales data set, one analyst may want to look at changes in sales as a function of changes in time or store. While another analyst may want to see a scatter plot of sales and costs for a particular time. Yet a third analyst may want to look at the actual times or perhaps the count of times that occurred for each unique sales amount. Currently, each of these analytical views would be considered a separate schema requiring a separate model. Could there be a schema-neutral structure? As we enter an age of increasingly collaborative computing, it is desirable for multiple analysts, each of whom may have their own analytical schema, to be able to work from and share information within a common data set. 2.2 Critique of implicitly existing OLAP models 2.2.1 any approach that purports to treat all dimensions generically: hypercubes, multicubes The biggest problem with multicubes (such as Holos and TM1), and hypercubes (such as Essbase), is this thing called the data dimension. The data dimension is an implicit dimension created by treating variables as members of a variable dimension. The data dimension (with a declared range of possible values), is generally unaccounted for in a generic dimensional model3. Such models only account for organizing dimensions. Thus, the data being modeled (i.e., the values of the variables) are not accounted for by any domain or metadata construct. This is shown below. Example 1 represents a typical multicube/hypercube expression and shows how the data is not accounted for by any dimension values. 1. data = dim1val1, dim2val2, dim3val3
2
Thomsen OLAP Solutions, pages 179-185 Thomsen Ibid pages 431-2. Version 0.93, last saved 1/23/2009 7:29:00 PM
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
3
Page 7/80
Copyright 1998 Dimensional Systems, all rights reserved 222 Third St. suite 3122 Cambridge, MA 02139
The LC Model for OLAP
Example 1 essentially says that each unique intersection of a value from every dimension identifies or is associated with an element of data. For instance, the following expression would fit this framework: 500 = Measures:Unit_sales , Time:January , Stores:Cambridge The relationship between the data and metadata that this model implies is illustrated in figure 1.
Time
Geog
Prod. Measures
?
Time Geog Prod Jan Jan Jan
Measures Values 350 375 500
Camb. Shoes Sales Camb. Socks Sales Camb. Hats Costs
Example 2 represents models that explicitly account for data as values of some system dimension. 2. dimyvaly = dimx1val x1, dim x2val x2, dim x3val x3 For instance, the following expression would fit this framework: 500:Units_sold = Time:January , Stores:Cambridge In contrast with example 1, example 2 says that each unique intersection of a value from every dimension identifies or is associated with some value of another dimension (dimyvaly).
Version 0.93, last saved 1/23/2009 7:29:00 PM
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Page 8/80
Copyright 1998 Dimensional Systems, all rights reserved 222 Third St. suite 3122 Cambridge, MA 02139
The LC Model for OLAP
The relationship between data and metadata that this model implies is illustrated in figure 2.
Time
Geog
Prod.
Sales Costs
Time Geog Prod Jan Jan Jan
Sales Costs 350 375 500
Camb. Shoes 500 Camb. Socks 450 Camb. Hats 725
Among other things, the first approach makes it impossible to switch schemata and use the so-called data values to be a dimension as when using the unique values of sales or costs to form a dimension. In defense of the generic approach, one might argue that the data dimension is really implicit. Aside from being disingenuous in so much as one is thereby recognizing dimensions and variables; (simply, the data dimension is the variable), and although such a strategy may work when all the variables have the same data type (which, in a relational sense, is to say when all the variables are defined on the same domain), when there are different types (such as string and integer), the data dimension becomes awkward. Although supposedly a single type, it really represents multiple types or domains; one for each data type. This is because the potential values of the data dimension are a function of the specific values of other dimensions. For example, the potential values for sales are different than the potential values for color or address. Since the value range for the data dimension is bound to specific collections of values (qua variables), from other dimensions, it can‟t be reused with different data sets independently of the variables dimensions with which it is associated. Another problem with generic dimensions is their inability to identify assertions in a dataset. Assertions are what are capable of being true or false. For example in a third normal form
Version 0.93, last saved 1/23/2009 7:29:00 PM
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Page 9/80
Copyright 1998 Dimensional Systems, all rights reserved 222 Third St. suite 3122 Cambridge, MA 02139
The LC Model for OLAP
relation, every non-key attribute field is an assertion about the union of the key fields. If the keys were store and time and the attributes were sales and costs, each sales field and each cost field would be an assertion about a particular store-time combination. In contrast, the store and time fields would not be assertions. They define the „that‟ (i.e., the logical subject), about which the attribute fields (i.e., the logical predicates) constitute assertions. Finally, by leaving out the dimension/variable distinction, an important means of describing data sets is unnecessarily lost. For example, there is a huge qualitative difference between datasets that have a large number of data points and a small number of variables, such as a census study, and datasets that have a large number of inter related variables but only a small number of data points per variable, such as a typical econometric model. 2.2.2 Approaches that stuff everything into a single cube: hypercubes Strictly speaking, a hypercube makes no distinction between different types of data. (In practice, hypercube products may allow the members of a measures dimension to have a data type attribute. But then, they are not pure hypercubes.) From a hypercube perspective, two variables that share no dimensions in common and which have no way to be compared could still fit as adjacent members of a variables dimension. Simply, the dimensions within the hypercube that are relevant to each variable are different. A consequence of treating the world as a single (hyper), cube is that the greater the variety of data types, the greater the amount of sparsity and the harder the cube is to understand. Hypercubes would seem to violate what is intuitively the most natural criteria for defining a cube: intercomparability or interactivity, meaning that a logical cube defines a contiguous analytical space where the values of any variables (or data associated with the members of a variables dimension), may be meaningfully compared. While I doubt that anyone would advocate putting all the data for an enterprise in a single hypercube (thus implicitly recognizing that hypercubes are not a general model), hypercubes have no natural chunking or segmentation principle. 2.2.3 Approaches that referentially distinguish dimensions and variables: series cubes Although series multicubes (such as Express and Analysis Server), can chunk data into zones of common dimensionality, the problem with series multicubes is that the dimensions and variables are referentially defined. This means that at the metadata level, there are two kinds of primitive structures: dimensions and variables. And, a set of data elements (i.e., a column of sales figures or a column of store names), connects either to a dimension or to a variable. (Hypercubes are also referential as they treat data elements as either dimension members or values.) Thus, there is no generalized notion of structure or type (or domain) covering both dimensions and variables such that sales, for example, would be a type that could be used
Version 0.93, last saved 1/23/2009 7:29:00 PM
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Page 10/80
Copyright 1998 Dimensional Systems, all rights reserved 222 Third St. suite 3122 Cambridge, MA 02139
The LC Model for OLAP
sometimes as a variable and sometimes as a dimension. This referential or static notion of metadata reduces the flexibility of any schema.
2.2.4 In summary OLAP should be the backbone of DSS because it is capable of providing an efficient, aggregation-savvy information infrastructure within which all types of analysis may flourish. Yet, without a consensus as to what constitutes an OLAP model, the technology will remain a fragmented collection of niche specific features incapable of fulfilling any larger DSS role. OLAP products will remain just another category of value-added tool. Although OLAP is capable of simplifying DSS application code relative to what that code looks like in a spreadsheet (or relational) environment, without an underlying model, it runs the risk of degenerating into the same morass of garbled application code as did spreadsheets where no two applications of the same type look alike because ways of defining problems and solving them are arbitrary. Furthermore, students and researchers have no way to express multidimensional concepts outside the vernacular of particular products. Consequently, there needs to exist a solidly (i.e., logically and mathematically), grounded model for OLAP. But what exactly should such a model do? What should it look like? In the next section we describe the attributes that an OLAP model needs to have. Following that, we step back and explore the relationship between models and foundations, which is to say the abstract sciences upon which software models are built, in order to ensure that any proposed OLAP model is indeed well grounded.
Version 0.93, last saved 1/23/2009 7:29:00 PM
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Page 11/80
Copyright 1998 Dimensional Systems, all rights reserved 222 Third St. suite 3122 Cambridge, MA 02139
The LC Model for OLAP
3. Attributes of an Ideal Model
What follows are a set of criteria for an OLAP model capable of supporting the full range of current products and of pointing to useful future product development not currently embodied in any product. Note, this is not a feature list. A comprehensive feature list can be found in chapter 17 of my book OLAP Solutions. 3.1 Groundedness Groundedness is the degree to which a model is based on (or grounded in) logic and mathematics. It is one of the most important attributes of the relational model. A model that is inconsistent with either logical or mathematical principles is of limited value for decision support. 3.1.1 Logic-based An OLAP model needs to provide for logic-based reasoning. In addition to supporting the usual array of logical connectives, this means that a model needs to be able to identify valid assertions (i.e., have some notion of a well formed formula), and correctly process missing and meaningless assertions in addition to valid assertions. Yet the different implicit models for OLAP as described above imply different definitions of a well formed formula or valid assertion. For example, generic dimensional approaches imply that any combination of dimension members defines a valid assertion. On the other hand, first order predicate logic requires at least one subject (argument) and one predicate (function). So, it would seem that the ubiquitous data dimension needs to play the role of predicate. But this complicates the model. Approaches that stuff everything into a single cube would also seem to have logical difficulties. At the very least, they would need to distinguish valid from invalid intersections. Approaches that distinguish variables and dimensions would seem to have a cleaner interpretation in first order logic namely that the union of the dimension elements at an intersection define the subject and each variable defines a predicate. Though they are not absolved from issues of sparsity. 3.1.2 Mathematical extensibility An OLAP model should provide for a wide amount of mathematics. Explanatory and predictive modeling may rely on sophisticated mathematics: neural nets, sets, algebra, calculus, groups, chaos. They may also rely on the ability to define data types with specified cardinality. For example, in a simulation model one might need to define a large number of state variables where each state variable has a (possibly different) number of potential states. If the model is not open to different mathematical techniques, then data would need to be exported to a different
Version 0.93, last saved 1/23/2009 7:29:00 PM
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Page 12/80
Copyright 1998 Dimensional Systems, all rights reserved 222 Third St. suite 3122 Cambridge, MA 02139
The LC Model for OLAP
mathematical environment for analysis. And the results of that analysis would not be fed back into the OLAP environment. Although the calculation language provided by an OLAP model should be rich enough to support a wide variety of mathematical constructs, at the very least it would seem necessary to provide logical places to store the results of value-added mathematical calculations.
3.2 Coverage All the data needs to be described by the model. (Using a relational analogy, all the data needs to fall within some domain.) If the model does not describe all the data, then some of the data is extrasystemic, or is unaccounted for by any domain. As extrasystemic, there would be no way to use that data as a dimension or create queries conditional upon the values of that data. For example, if time, store product and variable were the dimensions, then „value‟ would be extrasystemic. If „value‟ is extrasystemic, then there would be no way to use values, say cost values, as the basis for a dimension. Nor would there be any way to condition a formula based on a cost value. Most systems that officially treat „values‟ as extrasystemic, implicitly treat values as a variable capable of being ranked. This is the only way to formulate queries of the form ”find me the top ten selling products or find me the stores in the top dectile of shoe sales” The implicit value variable only works so long as all the variables qua dimension-members share the same data type. When the different variable-dimension members have different data types, (whether its Boolean and integer or string and float), then the values in the value variable are conditioned by what the member is of the variables dimension. For example, if the variable is „has spouse‟, the potential values are binary. If the variable is „sales‟, the potential values may be integer.
3.3 Syntax Completeness, Consistency and Efficiency An OLAP model needs to provide a DDL, DML and a query language capable of specifying multivariate, multilevel, multidimensional relationships. This covers all possible formula references including hierarchical and positional offsets, conditional overrides, measured and derived values, even sortings. If an OLAP model can‟t express these relationships, then its language is semantically incomplete. Efficiency has two meanings: simplicity of expression and irreducibility of operators. OLAP models need to efficiently express complex relationships. For example, if a model has no notion of hierarchy, then it may inefficiently require the referencing of declared dimensional relationships for aggregation expressions. And, if a model has no notion of multiple hierarchies within a single dimension, then it may inefficiently require multiple dimension declarations to model multiple hierarchies. And, the primitive OLAP structures and operators need to be irreducible.
Version 0.93, last saved 1/23/2009 7:29:00 PM
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Page 13/80
Copyright 1998 Dimensional Systems, all rights reserved 222 Third St. suite 3122 Cambridge, MA 02139
The LC Model for OLAP
OLAP models also need to be consistent. This means that it should never be possible to deduce an incorrect output given a set of correct inputs. And (if you ever want to create a working parser), the meaning of syntactic tokens should be consistent with a minimum amount of context. 3.4 Multi-schema support An OLAP model needs to provide for multiple schemata operating simultaneously on the same data with the same metadata. If the same model cannot account for multiple schemata then each schema needs to be represented as a separate model. This is inefficient since it means there will be duplicated metadata. And it means that it would require an additional metadata layer (metametadata?) in order to translate between different schemata changes made within each schema. For example, if one analyst were looking at changes in sales across time and another analyst were looking at changes in the count of times per unique sales, and the first analyst made some projections which added new sales values, these, in a collaborative environment should be passed through to the second analyst. But if the model does not support multiple schemata then it would require a separate translation to update the second metadata schema given changes to the first.
3.5 Integration of all DSS functions OLAP is not some specialized function like vector imaging. OLAP is the architectural backbone of DSS. Thus, an OLAP model needs to support, (though not necessarily incorporate), all DSS functions. This includes structural operations, (operations that function on dimensional structures such as concatenate), statistical operations, non-parametric „data mining‟ operations, mapping data types and operations, text operations (including text searches and summarizations), and visualization. 3.6 Analytical awareness An OLAP model should offer subtle analytical help. This is because such a model would, to some degree, formalize the analytical process. For example, unless one is testing the degree to which stores are independent from time, one does not want to analyze the correlation between stores and time. In a model that distinguishes dimensions and variables, (functionally), one would not analyze (as a part of a routine analysis), the correlation between dimensions. In contrast all analyses of variables by their dimensions may be useful. In a model that does not distinguish dimensions and variables, any analysis is as equally useful as any other. For example, in a hypercube model, it makes as much sense to ask for the variance in sales across stores and time as to ask for the variance in March across measures and stores. By formalizing the analytical process, an OLAP model would open the way for more intelligent automation.
Version 0.93, last saved 1/23/2009 7:29:00 PM
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Page 14/80
Copyright 1998 Dimensional Systems, all rights reserved 222 Third St. suite 3122 Cambridge, MA 02139
The LC Model for OLAP
3.7 Summary Having seen the need for and lack of an OLAP model as well as the attributes of an ideal OLAP model, and having seen that one of the critical attributes of an ideal model is groundedness, let‟s turn to the relationship between data models and foundations to see whether we can define an OLAP model on top of existing foundations.
Version 0.93, last saved 1/23/2009 7:29:00 PM
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Page 15/80
Copyright 1998 Dimensional Systems, all rights reserved 222 Third St. suite 3122 Cambridge, MA 02139
The LC Model for OLAP
4. Data Models and Foundations
In this section, we focus on the two nearest candidates for grounding an OLAP model: the relational model, and canonical logic, and show why neither provides an adequate grounding for OLAP. 4.1 Logic and the Relational model Logic, of any provenance, is concerned with information elements: the smallest bearers of meaning. In the same way that all molecules and compounds are composed of physical elements, all data sets and publicly exchanged thought are composed of information elements. Whereas logic offers a theory of how any one information element can be true or false, and how arbitrary groups of information elements may connect and be operated on in a truth-state consistent manner, the Relational model focuses on a particular group of information elements called relations, and how relations may be operated on in such a way that the output of any relational operation performed on a relation is always another relation. This is crucial to the Relational model, because the benefits of the model derive from the benefits of representing data in terms of relations. Chief among the benefits is the preservation of truth-states or consistency across update operations. In other words, no edit operation to a relational table should ever unintentionally add information, nor unintentionally remove any information, nor unintentionally change the value of any existent information. Because relations are groups of information elements, the laws of logic must apply to relations. Canonical logic4 is the name for that body of primitives which is accepted by a majority of practicing logicians and analytical philosophers. Most current logical procedures (with the exception of those going back to Aristotle) can be traced to the symbolic innovations of Gottlob Frege‟s Begriffsschrift5 and the Principia Mathematica6 by Bertrand Russell and Alfred North
4
More precisely, by „Canonical‟ I mean „standard‟, non-deviant, predicate logic (Principia-like) and axiomatic set theory (Zermelo-Fraenkel-like) with intended reference to W. V. Quine‟s „canonical notation‟ for refining scientific discourse. In the context of this (rather skeletal) overview, I refer to predicate logic and set theory interchangeably (such that Fx can be expressed as x F and vice versa). Also, I use „set‟ and „class‟ interchangeably. („Set‟ is usually given as a restricted version of class; but the reasons for these restrictions are precisely the ones addressed in L-C logic as found in Thomsen, Shavel [1990]).
5
Begriffsschrift, a formula language, modeled upon that of arithmetic, for pure thought. Gottlob Frege, 1879 reprinted in A Source book in Mathematical Logic, 1879 - 1931 Edited by Jean van Heijenoort Copyright 1967 by the President and Fellows of Harvard College 4th printing Version 0.93, last saved 1/23/2009 7:29:00 PM
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Page 16/80
Copyright 1998 Dimensional Systems, all rights reserved 222 Third St. suite 3122 Cambridge, MA 02139
The LC Model for OLAP
Whitehead. Both works are from around the turn of the century. The Relational model is based, in large part, on Canonical logic.
4.2 Basic OLAP Features and the Relational model Consider the following question. Can an OLAP model be derived from the Relational model? Clearly, if the answer is yes, there is no need to define an independent OLAP model. We simply need to show how multidimensional structures and operations follow from relational primitives. If the answer is no, why is it no. Let‟s focus for the moment on basic OLAP features to see whether the Relational model supports even them. The basic features of any MD model need to include: multi-level or hierarchical dimensions (however defined), formulas that attach to dimension members for intra- and inter- dimensional operations
These features would have to be a part of anything attempting to be a standard multidimensional data model. Consider it a litmus test. And, as I will describe below, most of these features are hard, if not impossible to provide given the features offered by the Relational model. Whereas OLAP products focus on specifying and calculating derived data, the Relational model focuses on structural operations (the basic operations of relational algebra, such as joins, Cartesian products, projections and restrictions, are all structural operations), and on base data. With Relational products, the mathematical functions used to define the values of one attribute of a relation in terms of the values of another attribute of the same relation attach to domains. Functions in Relational products most naturally apply in a relation- and column- specific sense. Oddly enough, the main differences between the Relational model and a multidimensional model have nothing to do with the multiplicity of dimensions. From the moment you have multiple columns in a relation, the relation is multidimensional. The key differences are that the multidimensional model focuses on hierarchical relationships (that is, its dimensions have a multi-level structure), and that it focuses on efficiently defining data derivations by attaching formulas to dimension members. Let‟s look at these differences more closely. Consider a multidimensional structure at outlined in figures 3.14 and 3.15 on pages 55-57 of the book OLAP Solutions. They represent a four- dimensional model consisting of stores, time, products, and variables. Note the hierarchical structure in each of the non-variables dimensions. Assume, for the moment, that all the hierarchies define simple aggregations. What does this
Harvard University Press Cambridge, MA
6
Principia Mathematica
Version 0.93, last saved 1/23/2009 7:29:00 PM
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Page 17/80
Copyright 1998 Dimensional Systems, all rights reserved 222 Third St. suite 3122 Cambridge, MA 02139
The LC Model for OLAP
mean? It means that any user of the information can calculate and view any combination of derived values simply by asking for or navigating to them. The aggregations are built in, so to speak, in the definition of the dimensions. In the relational world, these aggregations would have to be defined on a view-by-view basis. The fact that sales sums across stores and time would have to be restated for each aggregate sales table that was created.
Nearly all multidimensional products offer a variety of methods for specifying functions including both dimension-based and cell-based. They are crucial for the efficient specification of large numbers of derived values, and are not directly supported by the relational model. It would take a shift from attribute-oriented formula specifications (where formulas are attached to columns in a table) to unique domain value-oriented formula specifications (where formulas are attached to rows in a table or to all rows containing a particular value) for the Relational model to provide the same efficiencies for specifying derivations. Such a shift would not be minor; and one could argue that the Relational model would cease to be relational if such changes were made. SQL, for example, is not equipped to define intra-dimensional comparisons because they are generally performed between rows within a column. It is far better suited to defining inter-attribute calculations (which generally involve inter-column comparisons). SQL is so weak on this point that in order to perform inter-row calculations in SQL, people frequently transform, in a roundabout fashion, the inter-row problem into an inter-column problem, and then solve the inter-column problem. Thus, basic features of OLAP products are not supported by SQL. Although the domain calculus, which is defined in the Relational model but poorly supported in relational products, would appear to offer support for intra-dimensional operations, the notion of „domain‟ does not appear sufficiently structured at this time to account for the richness of dimensional hierarchies.
4.3 OLAP Features and Canonical logic Canonical logic has three main problems trying to support basic OLAP features (in addition to other problems that it has). It offers no support for multilevel dimensions. It offers no support for multiple schemata into the same data. And, it lacks a theory of meaning capable of being used as a procedure for parsing missing and inapplicable data. This creates problems for the relational model (where debate is still ongoing as to how best to process invalid data) as much as for OLAP where sparsity is a significant issue. [In the discussion that follows, set theory will be treated synonymously with (or as extensionally equivalent to) logic.] 4.3.1 Multiple Levels
Version 0.93, last saved 1/23/2009 7:29:00 PM
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Page 18/80
Copyright 1998 Dimensional Systems, all rights reserved 222 Third St. suite 3122 Cambridge, MA 02139
The LC Model for OLAP
On the one hand, it could be argued that level or resolution is the principle ordering feature of set theory, to the degree that it characterizes the primitive relationship between element and set. On the other hand, multiple iterations of a many-one relationship do not necessarily generate a resolutional order. Thus, for example, set theory has no way to distinguish between a 1-6 hierarchical relationship, such as a state and its political subunits, and a 1-6 relationship that defined an object, such as a table, and six sibling-like neighbors such as chairs. It is precisely the primitive nature of the set-membership function that obstructs the construction of a resolutional axis for relating elements, sets and sets of sets. In addition, in ordinary set theory, the set-membership relation isn't transitive: to be a member of a set is not, by virtue of that relation, to be a member of a set of that set. Mereology is the only school of axiomatic set theory that even addresses these concerns7. What is clear is that the primitive distinction made in the OLAP world between ancestral relationships of parent and child versus sibling relationships is not made as a primitive distinction in Canonical logic. 4.3.2 Multiple schemata Canonical logic defines a referential or static relationship between metadata and data. If an item, such as sales, is defined as a function, then it is always a function. This prohibits the same underlying metadata, such as information about possible stores, times, or sales, from being used in multiple ways or schemata on a single data set. Although the referential approach has a long history, it is a major source of problems for canonical logic. For example, canonical logic has gotten itself into paradox when it tried to invert terms (a form of schema change), and have functions operate as arguments and vice versa. 4.3.3 Handling Sparsity: Theory of Meaning Canonical logic avoided trying to define what constitutes an information element or a unit of meaning. (And this avoidance, I believe, was an error.) Typically speaking, valid information elements are defined according to some syntactic conditions wherein only certain (or “well-formed”) formulae are considered valid. And all and only valid formulae possess a truth value, which, in the classical two-valued case, may be either true or false. The problem is that there are many seemingly valid syntactic expressions which do not appear to possess any meaning, or which appear to be both true and false. Logicians call the latter such expressions paradoxes. The liar paradox, which many readers may have seen in college, is one such
7
Specifically, Mereology treats of part/whole relations, modifying the standard theory in such a way as to capture some aspects of resolutional relationships. The basic modifications are simple: the "singleton" (unit class) is postulated as the primitive element; the "parts" of a class (or set) are defined to be all and only its subclasses. Classically, because transitivity was not an automatic feature of the set-element relation, part/whole relations (where transitivity is a given) were detached from and subordinated to the broader context of set-membership. David Lewis, in his Parts of Classes7, inverts this and treats part/whole relations as the more primitive. In his calculus, x is a member of a class y only when x is a member of a singleton that is a part of y. Moreover, we can adapt standard iterative set theory to formally characterize this relationship . Version 0.93, last saved 1/23/2009 7:29:00 PM
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Page 19/80
Copyright 1998 Dimensional Systems, all rights reserved 222 Third St. suite 3122 Cambridge, MA 02139
The LC Model for OLAP
example. A modern version of it is “This sentence is false.” If the sentence is true then it is false; but if it is false, then it is true. The lack of an adequate theory of meaning is a problem for canonical logic that appears in all applications, not just OLAP. In the OLAP world, for example, a hypercube typically defines many empty cells. And it is important to know which cells are empty because they are meaningless, and which cells are empty because their data is missing. In the relational world, it is clear that canonical logic has nothing to say about handling invalid propositions. This is why the debate continues in the relational world between advocates of two-, three-, four- and many-valued logics. There is no Canonical way to decide the issues. So Canonical logic, with its inability to define levels, allow for multiple schemata to be defined on the same data, or adequately account for meaning, does not provide a comprehensive foundation for MD.
Version 0.93, last saved 1/23/2009 7:29:00 PM
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Page 20/80
Copyright 1998 Dimensional Systems, all rights reserved 222 Third St. suite 3122 Cambridge, MA 02139
The LC Model for OLAP
5. Overview of a New Approach
5.1 LC Logic: A Functional Approach Located-Content or (LC) logic is the name given to a functionally8-based logic that resulted from research conducted over the past fifteen years by a number of individuals, including myselfi. (The term “functional” refers to any of several approaches to language, math and logic based on the Tractatus Logico-Philosophicus, by the Austrian philosopher Ludwig Wittgenstein.) The LC model for OLAP is a software data model based on that logic. The term locator is roughly analogous to the term dimension for those products that distinguish dimensions and measures. The term „location‟ is roughly analogous to the concept of „the intersection of all non-measure dimensions‟. And the term content is roughly analogous to the terms variable, or measure. The major difference is that the two LC terms refer to functional or use distinctions for a homogeneous set of underlying types and not for any primitive difference between data structures. 5.2 Summary of the LC model for OLAP At a high level, the LC model for OLAP is composed of types and data that are functionally connected through one or more schema. What are called dimensions and variables in an Express or Pilot Analysis or Metacube server model would be treated as individual types in an LC model. Data would be data, and schema would (loosely), correspond to dimensioned variables. Thus, in a typical situation, time and stores would be types as well as sales and costs. Types define the limits of what can be defined, queried and/or calculated. Neighboring concepts include primitive object classes, potentials, root structures, basic categories, metadata, domains and axioms. Types can have any form of single or multiple hierarchy. They can be combined through such structural operators as Cartesian product, Insert, and 1-1 correspondence. Formulas may be associated with the instances of a type and/or between types. Internally, types behave like object classes. They are capable of inheritance, encapsulation, polymorphism, and messaging.
8
The term “functional” refers to any of several approaches to language, math and logic based on the Tractatus Logico-Philosophicus8, by the Austrian philosopher Ludwig Wittgenstein. Wittgenstein wrote the work in the early part of the century, shortly after his studies and collaboration with Bertrand Russell at Cambridge. The major goal of his book was to delineate what can be expressed by means of propositions, these being the proper subject matter for logical treatment. Along the way, it laid a groundwork for distinguishing propositions from non-propositions, by means of usage- or functionally-based criteria. Perhaps because of the difficulty of the work, few of his innovations were incorporated within the main body of Canonical logic, though he is remembered as having been the inventor of the truth tables. Version 0.93, last saved 1/23/2009 7:29:00 PM
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Page 21/80
Copyright 1998 Dimensional Systems, all rights reserved 222 Third St. suite 3122 Cambridge, MA 02139
The LC Model for OLAP
There are no restrictions to the data in an LC model. Any media can be data: numeric, string, audio, and visual. Data may be input from outside a system (a particular LC model) and/or derived. There are no restrictions to the calculations that can be defined in an LC model. For example, data can be entered at a leaf level and flow upward, or it can enter at the top and propagate downward, or it can enter in the middle and expand outward within a level. A schema is defined by associating structured types with the data. By structured, is meant that certain types may be combined or structured as, for example, the way typical dimensions are combined in a Cartesian product as a part of their association with data. Thus, a data set containing columns for store, time, products, indirect sales, direct sales, indirect costs and direct costs could be associated through a schema with a set of similarly structured types where store, time and products are cartesianed together. The schema serves as the backdrop for parsing expressions (queries, definitions and calculations) input into the system. It is analogous to a cube or collection of partially overlapping cubes. The distinction between locator and content can be made on an expression-by-expression basis relative to a schema (set of types connected to a data set). This is the essence of what is meant by the functional approach. It is important to keep track of which terms are being used for which function because the choice of locators and contents affects the processing of blank cells. Blank cells in a locator render the row inapplicable or meaningless (assuming the input data exists in table form). Blank cells in a content have no affect on the rest of the row and are themselves treated as missing - (assuming that the content applies to that location). Since the distinction between location and content is entirely functional, there is no need for a user to declare types as locators or contents, the distinction is apparent from the form of the query, definition or calculation. In general, by parsing input tokens (and assuming here a textual language) into type names „Tx‟ and type instances „ix‟, (for example, if „Time‟ were a type name; then „1996‟ would be an instance of that type), the contents are those types whose instances are unspecified.
Version 0.93, last saved 1/23/2009 7:29:00 PM
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Page 22/80
Copyright 1998 Dimensional Systems, all rights reserved 222 Third St. suite 3122 Cambridge, MA 02139
The LC Model for OLAP
Thus, the query
select sales where store = Cambridge and year = 1996
parses into T1: sales T2: store, i2: Cambridge T3: year. i3: 1996. The result set is the collection of instances for the type sales. So. for this query, sales is the content, and stores and time are the locators. In contrast, for the query
select stores where sales > 300
stores are the content since we are only given the name, and sales are the location as we are given the name and instance.
Version 0.93, last saved 1/23/2009 7:29:00 PM
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Page 23/80
Copyright 1998 Dimensional Systems, all rights reserved 222 Third St. suite 3122 Cambridge, MA 02139
The LC Model for OLAP
6. Defining Type Structures
A data set without some means of interpretation is meaningless. At best it‟s a BOB (Binary OBject). Types are essential to the interpretation of data. They define all of the categories and values that a system can recognize in a data set for the purposes of definition, query and/or calculation. Types can be defined independently from data or they can be defined in terms of data (for example by performing a select distinct on a column). Instances can be explicitly defined, as for example with a typical products dimension. Or instances can be formulaically defined relative to each other, as for example, with the integers. What follows is a description of the general form of a type and an extension of that basic form to account for variably ordered and hierarchical types. 6.1 General Definitions There is no way to define necessary attributes or properties independently of some set of voluntarily chosen criteria. The following general definition of types is no exception. The criteria that we impose on our definition of types is that the types so defined are rich enough to support a classical truth functional logic. At the very least, (and we are looking for minimal criteria), this translates into the ability to define assertions and negations. Let‟s begin by narrowing down what we mean by the terms „type‟ and „instance‟. The relationship between the terms „type‟ and „instance‟ is roughly analogous to the relationship between the terms „potential‟ and „actual‟, „set‟ and „member‟. That is to say, the identification of a type places a boundary on what may be considered an instance of that type. For example, MA, NY, CT, and VT, are valid instances of the type „state‟, $1, $2, $3, and $4 are valid instances of the type „$‟,. B. Smith, L. Jones, and K. Hardy are valid instances of the type „person name‟. We have chosen to use the terms „type‟ and „instance‟ because they are typically used for similar purposes in computer science and because they are less burdened with ongoing meaning than the other terms. In addition, we introduce the term „metric‟ and while recognizing that it represents a generalized notion of its typical mathematical usage, it is nonetheless the term that comes closest to representing the quantitative relationships embedded in hierarchical type structures.
Version 0.93, last saved 1/23/2009 7:29:00 PM
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Page 24/80
Copyright 1998 Dimensional Systems, all rights reserved 222 Third St. suite 3122 Cambridge, MA 02139
The LC Model for OLAP
6.1.1.1 Multiplicity A type needs to have at least two instances.
If some type, Tn had only one instance, in there would be no possibility of creating propositions where Tn served as a predicate capable of assertion and negation. Booleans or binary types are the simplest possible types. 6.1.1.2 Uniqueness The instances of a type need to be unique and mutually exclusive. Given for example a type Tn with instances i1, i2, i3, i4. This means that all i‟s are different from each other (uniqueness) and that if any in is true within the context of a proposition, all other i‟s are false (and that no two i‟s can be true at the same time). To see why this is necessary let‟s examine the converse. Assume that two instances are the same Tn = i1(1), i2, i3, i1(2). Now consider the statement that y = i1(1) is true. We can‟t say that the negation of i1(1) is false because there is another instance of i1, namely i1(2) that would also be true. This violates the rules of logic. What about mutual exclusivity? An example of a collection of non-mutually exclusive instances would be „red, blue, green, small, large‟. Something could be red and large. So red and large, though unique, are not mutually exclusive. The notion of mutual exclusivity is tied to the notion that a type denotes a single measurement or evaluation method or strategy. Clearly we would need two different tools/procedures to measure color and size. For a single evaluation method we can say the output of evaluation strategy „x‟ is i1 or i2 or i3 or i4 or… 6.1.1.3 Metrics All Instances are associated with metrics. Whereas the type name denotes the conditions for being a valid or potential value, the instances (whether formulaically or explicitly defined) depict the potentially actual values which, if explicitly defined, are not necessarily exhaustive. For example, the instances „balls‟, „bats‟, and „mitts‟ would function as actuals (capable of expansion), relative to the potential „baseball equipment‟. The validity or inclusion conditions denoted by the type name acts, in a mathematical sense like a generalized notion of metric or unit9. So stores, times, products, political parties, countries,
9
While the term „metric‟ is usually associated with cardinal measures such as dollars, deutchmarks, ounces, seconds, days, or volumes, such usage reflects a conflation of the relationship between the instances of a metric and Version 0.93, last saved 1/23/2009 7:29:00 PM Page 25/80
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Copyright 1998 Dimensional Systems, all rights reserved 222 Third St. suite 3122 Cambridge, MA 02139
The LC Model for OLAP
and customers may be thought of as denoting metrics10, (in a generalized sense of the term). [Note, the reason for introducing the term „metric‟ at this point is that most OLAP dimensions are characterized by hierarchical metric relationships, as for example with the relationship between days, weeks and months. It would be overly confusing to try and use the term „type‟ for these purposes.] Since a type denotes a series of metric-specific instances and since, (as will be shown below) there may be more than one metric associated with the instances of a type, it is necessary to add a metric to each instance of a type.
Tn/m0 = i1m1, i2m2, i3m3, i4m4….
Note that the type name denotes that metric which is common to all instances. In simple cases where there are no hierarchies, as for example with the type „color‟ having instances „red‟, „blue‟, and „green‟, there is no need to specify the metric „color‟ for each instance as this is defined by the term „color‟ appearing as the name of the type. In more complex cases, however, where there are multiple metrics within a single type as, for example, with the type „Geography‟ containing stores, cities and states, the term „Geography‟ serves as a root metric relative to the metrics store, city and state.
6.1.2 Non-hierarchical Types
Non-hierarchical types are types whose instances all share the same metric. Instances with the same metric may be nominally, ordinally or cardinally ordered. This is most completely shown in a matrix of equivalency relationships. The rows and columns represent instances and the cells show the relationships between the instances taking the row labels as left hand sides (of the equivalency relationship) and the column labels as right hand sides.
the metric itself. In as much as one can say 1, 2, 3, days where „day‟ represents the metric and 1,2, and 3 represent the instances of the metric „days‟, one can say red, blue, green color where color represents the metric and red, blue and green represent the instances. One could consider such usage a generalization of the concept of metric. Even numbers require metrics. Consider the integers 1,2,3,4… Although one might think that integers can be compared without resort to a metric, there is an implicit metric with the integers. Take, for example a basic algebraic statement 2+3 = 5. There is an implicit metric. Namely that 2 of some type + 3 of that same type = 5 of that same type. 2 horses + 3 girls do not add into 5 of either.
10
The term „typically‟ is used because any term could be used as either an instance or a type.
Version 0.93, last saved 1/23/2009 7:29:00 PM
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Page 26/80
Copyright 1998 Dimensional Systems, all rights reserved 222 Third St. suite 3122 Cambridge, MA 02139
The LC Model for OLAP
6.1.2.1 Nominally ordered instances If instances are nominally ordered then all one can say is that two instances are equal „=‟or not equal „!=‟. i1 = != != != i2 != = != != i3 != != = != i4 != != != =
i1 i2 i3 i4
Most OLAP dimensions - stores, products, customers, channels are nominal to the user, (although to the system there may be an ordinal value such as load time order). Clearly, there is no concept of previous, next, first, or last in a nominal ordering.
6.1.2.2 Ordinally ordered instances If one can rank instances but not assign any specific distance to the difference between them, the instances are ordinally ordered.
i1 i2 i3 i4
i1 = > > >
i2 < = > >
i3 < < = >
i4 < < < =
Ordinalization may occur as a part of the type‟s definition or after the type has already been defined. In either case, an ordinally ordered type needs to support relative positions such as previous and next which can be generalized to be a „starting point‟, „offset‟ and „direction‟. Within the context of a multidimensional type structure, there may be many ways to order the instances of a type. Stores, for example, may be ordered by size, age, sales, profitability, and so forth. Thus one needs to be able to track multiple orderings per type. This is most easily done by providing user-defined and user-named orderings.
Stores.(size_order, this -1 )
could refer to the previous position in the stores dimension
ordered according to size. In general one needs to distinguish ordering principle, starting point, direction and offset.
Version 0.93, last saved 1/23/2009 7:29:00 PM
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Page 27/80
Copyright 1998 Dimensional Systems, all rights reserved 222 Third St. suite 3122 Cambridge, MA 02139
The LC Model for OLAP
Stores. (age_order, first +3)
would refer to the fourth position in the stores dimension
ordered by age.
6.1.2.3 Cardinally ordered instances If one can assign distances to the differences between the instances then the relationship between the instances is cardinal. i1 +/- 0 +1 +2 +3 i2 -1 +/- 0 +1 +2 i3 -2 -1 +/- 0 +1 i4 -3 -2 -1 +/- 0
i1 i2 i3 i4
Notice that the integers, as shown above, are co-defined with their operations. And notice that the relationship between the integers is a part of their definition. Another way of describing this to say that the integers represent formulaically defined instances. The dimensions that most naturally possess cardinal properties are numeric, time and space. And note how, as with numeric dimensions, the members of time and space dimensions can be formulaically defined. In other words, unlike such business dimensions as product or employee whose members need to be explicitly defined, the members of a time dimension can be generated by a formula.
In addition, as with ordinally ordering types, there may be many ways to cardinally order the instances of an otherwise nominal type. Stores which are otherwise a nominal type, for example, may be cardinally ordered by size, age, sales, profitability, and so forth. The difference between ordinalizing and cardinalizing stores is a function of how the ordering information is used. Thus one needs to be able to track multiple cardinal orderings per type. This is most easily done by providing user-defined and named orderings. Note that the syntax necessary to support relative positions for ordinal types is the same as that for cardinal types. The key difference is in the properties of the type once defined. For example, although you can count offset-based distances on an ordinally ordered type as easily as with a cardinally ordered type, you can not compare seemingly equal distances within an ordinal type.
Time.( chronological_order, this-1)
refers to the previous time position
6.1.3 Hierarchical Types
Version 0.93, last saved 1/23/2009 7:29:00 PM
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Page 28/80
Copyright 1998 Dimensional Systems, all rights reserved 222 Third St. suite 3122 Cambridge, MA 02139
The LC Model for OLAP
Instances with quantitatively inter-translatable metrics denote hierarchies. When there are different metrics within a type, the metrics for the different instances need to be quantitatively inter-translatable. Days, weeks, months, years are quantitatively inter-translatable for example. And they belong in one, hierarchical, type. On the other hand, red.color, blue.color, green.color, small.size, medium.size, large.size have quantitatively non-translatable metrics as there is no quantitative translation to convert colors into sizes. A color‟s truth does not exclude the truth of a size. So the two sets of instances belong on different types.
Hierarchies may be differentiated as a function of the degree to which they are most efficiently described using a named level or linked node vocabulary. Both vocabularies are grounded in the general notion of a set of metric-specific instances
6.1.3.1.1 Pure level hierarchies
A hierarchy that is most efficiently described using only named level terms is frequently called a symmetric hierarchy. Clock and calendar time are the most common symmetric types. Their levels might be called seconds, minutes, hours, days, weeks, months, years. In addition to all the type‟s instances belonging to one or another metric and for all the metrics being quantitatively inter-translatable, all instances within a level are ordinally arranged. So you have the 1st, 2nd and 3rd days or stores or persons. Given the known relationship between levels such as the conversion between seconds and minutes, and the ordinal position of the instances, one can figure out the exact lineage connections. For example, second number 73 is a child of minute number 2. Day number 40 is a child of month number 2. Thus, in a pure level hierarchy, one can refer to any instance by referring to the level and the offset within the level. There are three main hierarchical relationships that can be specified within a pure level hierarchy: distance and direction, endpoint, and a particular collection of instances. 1. One can specify a level offset or number of levels up/down with respect to a starting point. For example, Geog.store3.levelup(2) would refer to the level that is two levels above stores. If the hierarchy contained stores, cities and states, the expression would refer to the state within which store3 is located. 2. Another way of specifying the state within which a store is located, (useful for comparing a variable for a store with a variable for the state of the store, say, sales and tax rate), would be to write Geog.store3.state. There could be any number of levels between stores and states. The expression would always refer to the state within which the store3 is located. Note, in this use of the token „state‟ for level, the token refers to a relative modifier or offset with respect to an instance within a type. 3. There is another sense in which the level tokens need to be used. This is to refer to collections of instances, specifically, the individual states or countries or stores named by the level. For example, one needs to be able to write something like „Geog.states‟ and have
Version 0.93, last saved 1/23/2009 7:29:00 PM
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Page 29/80
Copyright 1998 Dimensional Systems, all rights reserved 222 Third St. suite 3122 Cambridge, MA 02139
The LC Model for OLAP
the expression refer to all states. Syntactically, this is expressed by introducing the token „*‟ to refer to all of what precedes it. Thus, to refer to all states one would write Geog.states.*. Examples of level references include:
Geog.countries.*
identifies all countries in the geography dimension identifies all countries except USA identifies all positions that
Geog.countries.* disjunct USA
Geog.countries.* disjunct (USA, Canada).level down (2)
are two levels below countries except USA and Canada.
Geog.store_x.country
refers to the country of store_x identifies all positions that are under the years 1992-1996
Time.year(<97 union >91).under
Given a set of known relationships between levels: level1 = x(level2), level2 = X(level3) etc.. A type could be fully specified with a two column specification table. In other words, the lineage would be implicit from the ordinal ranking of the position and the known relationship between the levels. Instance number i1 i2 i3 i4 i1 i2 i1 Level number level1 level1 level1 level1 level2 level2 level3
6.1.3.2 Pure linked node hierarchies
On the other extreme, a hierarchy that is most efficiently described using only lineage descriptors is frequently named a linked node hierarchy. In such a hierarchy, every instance belongs to a different metric. A family tree where metric resolution is defined as age and where all the members are of different ages would be efficiently described using lineage terms - up „x‟, down „x‟, root, leaves, position „x‟.
Version 0.93, last saved 1/23/2009 7:29:00 PM
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Page 30/80
Copyright 1998 Dimensional Systems, all rights reserved 222 Third St. suite 3122 Cambridge, MA 02139
The LC Model for OLAP
Geog.Massachusetts Geog.USA.down(2)
identifies the position Massachusetts
identifies all the grandchildren of USA or all the positions that are below all the positions that are below USA
Geog.USA.up(1)
identifies the parent of USA identifies all the grandchildren of the grandparent of Mass.
Geog.Mass.up(2).down(2)
A table specification would be the familiar parent/child table
Instance name i1 i2 i3 i4 i5
up(1) none i1 i1 i2 i2
(parent) name
6.1.3.3 Mixed hierarchies Hierarchies may have a mixture of named level and lineage characteristics. Instance name i0 i1 i2 i3 i4 i5 i6 i7 level name level0 level1 level1 level2 level2 level2 level2 level3 up1name none i0 i0 i1 i2 i2 i2 i6
When a hierarchy is defined with both lineage and level characteristics, the following two things must be true: Relative to the parent-child hierarchy, (and assuming for the moment there is only one hierarchy in the type), the instances of any two levels A and B must be related so that all instances of A are reached from B either purely by upward movements (towards parents) or purely by downward movements (towards children). For the purpose of any analyses that the levels are to be involved in, the instances of any level A must be in a constant relation with the instances of any level B. The first constraint may be verified by the system, while the second constraint must be validated by the designer.
Version 0.93, last saved 1/23/2009 7:29:00 PM
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Page 31/80
Copyright 1998 Dimensional Systems, all rights reserved 222 Third St. suite 3122 Cambridge, MA 02139
The LC Model for OLAP
For mixed hierarchies (unlike the two pure extreme hierarchies), the level descriptors and the lineage descriptors are independent. So it is important to be able to use both sets of descriptors for intratype referencing within the same type.
6.1.4 Types may have more than one hierarchy Finally, a single type may have more than one hierarchy. Each hierarchy is composed of a unique set of levels and hierarchically related nodes. Although levels are unique within a single hierarchy, hierarchies may overlap, effectively sharing levels. For example a time type may be composed of two hierarchies: a fiscal hierarchy and a calendar hierarchy. Both hierarchies may share a day and month level. The hierarchies may then diverge at the quarter and year levels.
Instance name 1 1 2 2 3 3 4 4 5 5 6 6 7 7 8 8 9 9 10 10
hierarchy name b s b s b s b s b s b s b s b s b s b s
up1 name 6 9 6 10 7 9 7 9 7 10 8 n/a 8 n/a None n/a n/a None n/a None
Version 0.93, last saved 1/23/2009 7:29:00 PM
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Page 32/80
Copyright 1998 Dimensional Systems, all rights reserved 222 Third St. suite 3122 Cambridge, MA 02139
The LC Model for OLAP
6.2 Syntax
For the purposes of querying and calculating, it is necessary to distinguish between the name of the type and all the instances of the type. Thus „geography‟ as a name for a type is distinguished from every instance of geography or „geography.*‟ In general, the „.‟ is used as an inter token separator within a single type. Thus geography.USA denotes the instance USA in the type „geography‟
Version 0.93, last saved 1/23/2009 7:29:00 PM
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Page 33/80
Copyright 1998 Dimensional Systems, all rights reserved 222 Third St. suite 3122 Cambridge, MA 02139
The LC Model for OLAP
6.2.1.1 Intratype referencing syntax The following summarizes the syntactic devices required to capture the distinctions brought out in this section.
Intratype expression Typename [ . intratype_instance_spec ]+ (read and modify from left to right; any valid subexpression may be in parentheses.)
intratype_instance_spec term term set-operator intratype_instance_spec term instance_id [hierarchy_id .] hierarchy-specifier level-id * ordering-expression relational-operator value existence-test set-operator union intersect disjunct relational-operator . > (for cardinal types) < etc. existence-test |( boolean_expression ) hierarchy-specifier down ( # ) up ( # ) under / atunder over / atover root [ .* ]
Version 0.93, last saved 1/23/2009 7:29:00 PM
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
(possibly inter-type expression)
Page 34/80
Copyright 1998 Dimensional Systems, all rights reserved 222 Third St. suite 3122 Cambridge, MA 02139
The LC Model for OLAP
leaves [ .* ] level-name [.*] level this level up (#) level down (#) Positional specifier position [thru position] position first [+ #] this [+/- #] last [- #] top [#] bottom [#]
ordering-expression rank ( rank-expression, positional specifier) rank-expression intertype-expression ordering-name existence NA A M P
(not applicable) (applicable) (missing) (present)
6.2.2 Summary Types are a necessary part of any data model. Hierarchies evolve naturally out of types. So-called symmetric and parent/child hierarchies are extreme cases of what we called a mixed-type hierarchy.
Version 0.93, last saved 1/23/2009 7:29:00 PM
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Page 35/80
Copyright 1998 Dimensional Systems, all rights reserved 222 Third St. suite 3122 Cambridge, MA 02139
The LC Model for OLAP
7. Type Structuring Operations
Since types are the vehicle through which all data interaction occurs, as a precondition for querying and calculating data, types need to be connected to a data set. The connection between types and data is defined by a schema. As a part of a schema, types may be structured in a particular way. For example, types that are going to be used as dimensions within some schema need to be crossed or cartesianed11 together for that purpose. What kind of operation is „cartesian product‟ and how does it compare to „summation‟ or „averaging‟? Typically, the former type of operation is found along with operations such as „join‟ „union‟ and „restriction‟ as a part of basic relational algebra. These relational-like operators, which figure so prominently in the relational world, receive relatively less attention in the OLAP world. For example, while most OLAP tools allow the user to edit a dimension, few tools provide operators for concatenating dimensions, and for putting certain dimensions in 1-1 correspondence with other dimensions (the essence of attributes), and for taking the union of two dimensions. Relational operators occur all the time in OLAP applications whether or not they are explicitly recognized. For example, every time a cube is built from dimensions, those dimensions were cartesianed together whether or not there was a cartesian operator that was invoked. And every time variables are defined (whether on their own or as a part of a variables dimension), those variables are defined as existing in 1-to-1 correspondence with the dimensional intersections whether or not that relationship was explicitly stated. Since relational style operators are a fact of life for OLAP, they need to be systematically taken into account. So the next question is: where do they belong? Should there be a category for relational operators distinct from OLAP or data operators? Should operators on types be distinguished from operators on data? As stated in section 6, relative to a type, instances are co-defined with their equivalency relationships. Thus, nominally ordered instances were co-defined with the relationships of equality and inequality while integers were co-defined with addition and subtraction. The difference between „relational‟ operators and „OLAP‟ operators is a difference in the ordering required of the type‟s instances in order for the operation to work. All the OLAP-style calculations: summation, averaging, ratios and other functions require cardinally ordered instances. (Alternatively, one could say OLAP-style operations are grounded in algebra which is defined on the domain of the rationals.) Whereas the relational operators: cartesian product, restriction, projection, union, difference, and intersection require only nominally ordered instances for their domain. For example, you can take the cartesian product of two types each of
11
Typically the term „Cartesian product‟ is used in its passive form as in „to take the cartesian product of two columns‟. Needing an active voice term, I will use the terms „cartesian‟ and „cartesianed‟. Version 0.93, last saved 1/23/2009 7:29:00 PM
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Page 36/80
Copyright 1998 Dimensional Systems, all rights reserved 222 Third St. suite 3122 Cambridge, MA 02139
The LC Model for OLAP
whose instances are only nominally ordered (such as a „store‟ type and a „product‟ type), whereas you cannot add, subtract, multiply or divide two nominally ordered instances. Thus, there is no fundamental distinction between type structuring operations and data-oriented calculations. Both kinds of operations are expressed in terms of types. Simply, as a pragmatic issue, most type structuring operations will occur during the course of schema creation. Since data-based queries and calculations are parsed relative to a schema, these type structuring operations will be discussed first. What follows is a description of those primitive nominal operators necessary for producing schemata that require only a single type as an operand. The reader may notice that some, but not all relational operators are present. The ones that are present, specifically, restriction, product, union, intersection, and difference, require only a single type as an operand. There are also other non-relationally grounded operators, namely „1-to-1‟ and „1-to-N‟ correspondence. The omitted relational operations, (projection, join, and divide), require relations consisting of at least two types, as operands. Although omitted from an exposition of primitive operations, they can nonetheless be created from the primitives.
7.1 Cross or Cartesian Product: Any number of types may be cross-producted or cartesianed. This produces an M X N (or many-to-many), relationship or structure. For example, types used as locators in a schema need to be cartesianed before they interact with data. In contrast, types used as contents need to be put in 1-1 correspondence with the locators. The cartesian operator is the symbol „x‟. To illustrate the operation, if some type P has instances {shirts, coats, hats} and another type G has instances {USA, Canada, Mexico}, then P.* x G.* will result in the nine-element two-dimensional array shown below. (Recall from 6.2.3.4 that .* means every instance of the preceding dot-token.)
P/G shirts coats hats
USA shirts, USA coats, USA hats, USA
Canada shirts, Canada coats, Canada hats, Canada
Mexico shirts, Mexico coats, Mexico hats, Mexico
By varying the .*, it is possible to specify cartesian products between subsets of types. This is particularly useful when certain types only apply to a subset of another type. For example, the type color may only apply to certain products. To illustrate a partial cartesian product, and continuing with the current example consider the diagram below.
Version 0.93, last saved 1/23/2009 7:29:00 PM
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Page 37/80
Copyright 1998 Dimensional Systems, all rights reserved 222 Third St. suite 3122 Cambridge, MA 02139
The LC Model for OLAP
If instead of saying P.* x G.*, one said
Type ProdRegion = P.* x G.(USA union Canada),
it would create the two-dimensional type shown below.
P/G shirts coats hats
USA shirts, USA coats, USA hats, USA
Canada shirts, Canada coats, Canada hats, Canada
7.2 1-to-1 correspondence A type that is used as a content needs to be put in 1-to-1 correspondence with the set of intersections defined by the cartesian product of the types used as locators. For example, if a sales type, S, were to be used as a content with the product/geography type structure above serving as a locator, it would exist in 1-to-1 correspondence with the intersections of that structure. Thus, there would be 9 sales instances. The symbol for the 1-to-1 operator is „~‟. This can be represented in the following way.
7.1.2.1 S ~ (P.* x G.*)
Note that the .* was not used for the sales type. This is because we are associating, for each intersection of P and G, some instance from S. S is, in a sense, the variable which will be evaluated, presumably by reading in a data set whose key structure matches the cartesian product of P and G.
1-to-1 correspondence is also the operation that needs to be performed to define one type as an attribute of another type. For example, to say that a type O, or opening date, is an attribute of stores in a store type, S, is to implicitly define a 1-to-1 correspondence between each unique instance of stores and an instance of the type opening date. This can be represented in the following way.
7.1.2.2 S.* ~ O
“Each S is 1-to-1 for an O.” Note how in this case, one of the types, S, has the .* suffix. What this means is that S is being used as a locator and O as a content. Stated otherwise, it says that
Version 0.93, last saved 1/23/2009 7:29:00 PM
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Page 38/80
Copyright 1998 Dimensional Systems, all rights reserved 222 Third St. suite 3122 Cambridge, MA 02139
The LC Model for OLAP
for each unique instance of sales as defined by the type S, there exists some instance from the type O. 1-to-1 correspondence is also the kind of relationship that exists between a type S, such as stores whose instances were machine generated, and another type, say store names, N, whose instances are user-oriented.
7.1.2.3 S.* ~ N.*
“Each S is 1-to-1 with each N” Note how in this case the .* suffix was used for both types. For this operation to succeed, the number of instances of S needs to be the same as the number of instances of N (unless some other restriction were placed on the instances). And it assumes that there exists some default ordering for S and N that can be followed for the purposes of association. Concerning default orderings, it is useful to be able to manipulate types in terms of the positions of their instances regardless of what the instances mean. As implied above, and here stated, this paper assumes that the instances of a type can be internally represented in an ordinal format. Thus, even a nominal type such as political party has a default ordinal ordering that can be used for the purposes of insertions and correspondences. There is nothing fundamental to this assumption and if a product did not have a default ordering for its types (though most, if not all, products do have such an ordering) lengthier statements would be required to create the 1-to-1 pairings such as the one shown above.
7.3 Correspondence The process of combining two types that have a hierarchical relationship into a single hierarchical type, say a city type and a region type, is a form of 1-to-N correspondence. The „~<‟ symbol denotes a 1-to-N relationship. For example, consider the following city type, C and region type, R. Typically, this would be expressed as R.* ~< C.*, or “each region is 1-to-N with some cities; cities only appear in one region.” Cities Boston Newton Cambridge Springfield Northampton Lenox Regions Eastern Central Western
Version 0.93, last saved 1/23/2009 7:29:00 PM
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Page 39/80
Copyright 1998 Dimensional Systems, all rights reserved 222 Third St. suite 3122 Cambridge, MA 02139
The LC Model for OLAP
Without knowing more about the specific parent-child relationships, there is no way to combine specific C with specific R in a single operation defined at the type level. Instead, the 1-to-N operation would need to be invoked for each region instance and its connection to the set of city instances that correspond to the region. For example, the following operations would serve to create a single hierarchy within one type from the two types: Cities, and Regions.
7.1.3.1
Geog.Regions.Eastern.AtChildren = Regions.Eastern ~< Cities.Boston - Cambridge Geog.Regions.Central.AtChildren = Regions.Central ~< Cities.Springfield - Northampton Geog.Regions.Western.AtChildren = Regions.Western ~< Cities.Lenox
7.1.3.2
7.1.3.3
7.4 Insertion and Deletion Clearly, one needs to be able to edit types through inserting and deleting instances. For inserts, (and assuming they are making use of a default ordering), it is necessary to specify an insertion point and a before or after relationship to that insertion point (i.e., insert „x‟ before/after „y‟). Consider the following two types, new_stores and stores. And imagine one wants to insert new_stores into stores following Paris. Expression 7.1.4.1 represents that insertion statement.
New_stores Nice Seville Sienna Florence
Stores Paris Munich Frankfurt Geneva
7.1.4.1
Insert New_stores Into Stores after Stores.Paris
If the keyword „after‟ is omitted the insertion point defaults to after the last instance of the inserted-into type and the operation could be called „concatenation‟. Although there is no equivalent notion of insertion point for deletions, one can make use of the range operator, „thru‟, as in the following expression.
Version 0.93, last saved 1/23/2009 7:29:00 PM
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Page 40/80
Copyright 1998 Dimensional Systems, all rights reserved 222 Third St. suite 3122 Cambridge, MA 02139
The LC Model for OLAP
7.1.4.2 Remove Stores.Munich thru Geneva from Stores
One can also make use of the range operator for insertions as in the following expression. 7.1.4.3 Insert New_stores.Nice thru Sienna Into Stores before Stores.Frankfurt
7.5 Traditional Set Operations The traditional set operations of „Union‟, „Disjunct‟, and „Intersection‟ also need to be supported.
7.6 Type Definition Syntax This section brings together some of the syntactic elements described previously in this chapter. A new LC Type can be defined in terms of another existing type. It is assumed that at the very least there will exist a library of “native” types (integer, float, string, etc.) upon which new LC types may be built. As new types are added to the library, they can be used to construct ever more new types. LC keywords are in bold. 7.6.1 Type Declaration New LC types are declared using one of the following statements: Type TypeName = OtherTypeName Type TypeName = Single-Type Instance Specification Type TypeName = Multi-Type Combination Where Single-Type Instance Specification refers to one or more instances of an existing LC type using the Intratype syntax defined in the last chapter.
Multi-Type Combination refers to the joining (cartesian, union, one-to-one association, etc.) of a set of instances from two or more existing LC types.
7.6.2 Instance Insertion and Deletion Instances may be added and removed from a Type by the Insert and Delete keywords.
Version 0.93, last saved 1/23/2009 7:29:00 PM
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Page 41/80
Copyright 1998 Dimensional Systems, all rights reserved 222 Third St. suite 3122 Cambridge, MA 02139
The LC Model for OLAP
Insert (InstanceName [, InstanceName …]) Into TypeName [Before | After Instance] Insert Single-Type Instance Specification Into TypeName [Before | After Instance] Delete Single-Type Instance Specification Where Single-Type Instance Specification refers to one or more instances of a Type using the Intratype syntax defined in the last chapter. The Instance specification following the Before and After keywords refers to a single instance of the type and is the insertion point for the new instances. 7.6.3 Definition of Hierarchy 7.6.3.1 Linked Node Hierarchy Simple parent-child relationships can be assigned using the = operator. For example, Product.Shoes.* = Loafers, Pumps, Sandals assigns Loafers, Pumps and Sandals as children of the instance Shoes type Product. 7.6.3.2 Levels and Named Hierarchies Levels can be defined for a type using the Levels keyword: TypeName.Levels.* = LevelName [, LevelName…] Named hierarchies are similarly declared: TypeName.Hierarchies.* = HierarchyName [, HierarchyName…] Having created named levels or named hierarchies, member instances can be specified, such as: TypeName.LevelName.* = InstanceName [, InstanceName…] and TypeName.HierarchyName.* = InstanceName [, InstanceName…]
Version 0.93, last saved 1/23/2009 7:29:00 PM
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Page 42/80
Copyright 1998 Dimensional Systems, all rights reserved 222 Third St. suite 3122 Cambridge, MA 02139
The LC Model for OLAP
Note that the instance lists above, denoted by “InstanceName [, InstanceName…]”, can be replaced by any Intratype expression resulting in one or more instances.
Version 0.93, last saved 1/23/2009 7:29:00 PM
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Page 43/80
Copyright 1998 Dimensional Systems, all rights reserved 222 Third St. suite 3122 Cambridge, MA 02139
The LC Model for OLAP
8. Schemata Between Types and Data
As stated at the beginning of section 7, a precondition for querying and calculating with data is the connection of some types to that data wherein the connection between types and data is called a schema. Having explored the variety of operations that can be performed on types, we now turn to the relationship between types and data as defined by the schema itself. After discussing minimum conditions for a valid schema, we look at the different ways by which schemata may connect types to data as well as their implications for querying and calculating. Note that the goal of this section is to show how queries and calculations presuppose schemata, and to show the main ways that types connect to data. This section is not meant to serve as a comprehensive treatment of links between types and data. Such a treatment of links, which would include maintaining dynamic links, updating types, updating data tables, recognizing different table forms, and manipulating data in the table, is not a part of a core logical model and so falls beyond the scope of this document12.
8.1 Schemata in general
Regardless of how a schema connects types to data, that connection must define a 1-to-1 correspondence between every individual data element and individual type instances. For example, consider the simple types and table below.
Time January February March April …December
Sales Any $ value between 0 and 10,000,000
Column Row 1 2 3
12
1 Time January February March
2 Sales $150 $200 $250
For a discussion of links, please see OLAP Solutions, pages 82-89 and 187 - 194. Version 0.93, last saved 1/23/2009 7:29:00 PM
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Page 44/80
Copyright 1998 Dimensional Systems, all rights reserved 222 Third St. suite 3122 Cambridge, MA 02139
The LC Model for OLAP
4 April $300
The individual data elements are the contents of the locations defined by the intersections of the rows and columns. They are pictured above as the shaded elements. Regardless of whether the types restrict the set of valid data elements or whether the data is used to define the types, any schema must connect to all eight of those elements. Consequently, it is necessary to have a language for describing data that can be used within schemata for connecting types to data. So long as the language is known to the schema and is capable of individuating data elements, it does not matter what that language is. For the sake of simplicity, we will use the metaphor of rows, columns and tables for describing data. Any data element can be differentiated according to its table name, tn, row name, rn, and column name, cn .
8.2 Minimal schemata
A minimal schema needs to perform two tasks. 1. It needs to connect instance ranges of individual types with ranges of tables, rows and columns. For example, in the types/table above, a minimal schema would assert that time connected to column 1 and sales connected to column 2. 2. It needs to define, for each type/data pair, whether the type values constrain the data values or whether the data values constrain the type values. For example, in the types/table above, a minimal schema might assert that the schema is conditioned by the values in the data table and the values in the data table are limited by the type instances. In practice, a minimal schema is more useful when less is known about the data. A consequence of using a minimal schema is that additional schema-like information needs to be provided for expression parsing. Consider the types and table below.
Time Jan Feb March
Stores 1 2 3
Products a b c
Sales any $
Costs any $
Time Jan Feb March
Stores 1 2 3
Products a b c
Sales 100 200 300
Costs 90 180 260
Page 45/80
Version 0.93, last saved 1/23/2009 7:29:00 PM
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Copyright 1998 Dimensional Systems, all rights reserved 222 Third St. suite 3122 Cambridge, MA 02139
The LC Model for OLAP
How would one add a new variable, say profit, defined as sales - costs? Without additional schema-like information, how would the system know how to dimension profit? How would it know that profit should be dimensioned by store, time, and product but not by sales and costs? There is no way for the system to know without being told (or without it doing an analysis of the data). In other words, one would need to define something like the following. Definition
profit, time.*, store.*, products.*
There are no logical keys in a minimal schema. This means that there is no initial paring of the data stemming from blank cells in key columns, because there are no key columns. Any such paring happens on a query-by-query basis. (Please see section 9 below for a treatment of querying.)
8.3 Rich schemata A rich schema contains all the connection information as a minimal schema and in addition it contains information about how the types in the schema are structured, from which one can deduce the relative locator and content roles played by the types, and it contains information about the flow of values between types and data. There are at least two different types of structuring that must be present. Types that are cartesianed for the purposes of defining a schema denote locators or dimensions. This is because every unique instance of each type is associated with every unique instance of every other type. Types that are put in 1-to-1 correspondence with a set of cartesianed dimensions denote contents. Although most rich schemata will have some types cartesianed together and some put into 1-to-1 correspondence with them, there are some simple schemata consisting of only one locator dimension for which no cartesian operator is necessary. In these cases, however, one can still recognize the locator from the fact that all of its instances (as denoted by the token .*) are put into 1-to-1 correspondence with an instance from the other type used as the content. [Recall the dimension-attribute example in 7.1.2.2.] Other structurings are possible, though not necessary. For example, two types may be concatenated or put in 1-to-N correspondence for the purposes of defining a schema.
In addition to specifying the relationship between the types that are connected to the data, it is necessary to specify the relationship between the types and the data. Are the data simply constrained to be a valid instance, or do the data need to match a particular set of instances? Or, are the type instances constrained to match the data as is the case when one defines the types in terms of the data.
Version 0.93, last saved 1/23/2009 7:29:00 PM
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Page 46/80
Copyright 1998 Dimensional Systems, all rights reserved 222 Third St. suite 3122 Cambridge, MA 02139
The LC Model for OLAP
Using the table and types shown in 8.2, a typical schema that would connect the types to the data and which constrained the column of locators (stores, time, products), to match the instances of their respective types while allowing the content columns (sales and costs), to be any valid instance, would look as follows.
Note, there are four types of connector symbols: The „>‟ symbol means that any rows that do not correspond to a type instance are invalidated and any type instances not reflected in the rows are added to the rows. The type used by the schema is entirely defined by the type. The „<‟ symbol means that any type instances that do not correspond to a row are invalidated (removed from the schema) and any rows not reflected in the type instances are added to the types. The type as used by the schema corresponds only to what is present in the rows; however, the schema‟s version of the type may be compared with the original version. The „<>‟ symbol means that any rows that do not correspond to a type instance are added to the types and any type instances not reflected in the rows are added to the rows. The type used by the schema is the union of instances that were in the type and instances present in the rows. The „>!‟ symbol means that any data rows not corresponding to a type instance are invalidated without any additional constraints. The type structure ((Stores.* x Time.* x Products.*) ~ (Sales ~ Costs)) where Connect Connect Connect Connect Connect Stores.* Time.* Products.* Sales Costs > Column 1 Column 2 Column 3 Column 4 Column 5 all rows all rows all rows all rows all rows The data (Rows 1-3 x Column 1-5)
A rich schema makes system-testable claims about the data. For example, it claims that the locator dimensions, (in this case stores, time and products), are independent. That claim can be tested in the data set. The schema also claims that sales and costs are dimensioned by stores, time and products. Again, this claim is straightforward to test. With a defined schema, one can also test the sparsity of the locators and the contents. And since the locators are the keys, missing locators can be detected and isolated (i.e., treated as non-applicable).
Version 0.93, last saved 1/23/2009 7:29:00 PM
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Page 47/80
Copyright 1998 Dimensional Systems, all rights reserved 222 Third St. suite 3122 Cambridge, MA 02139
The LC Model for OLAP
It is also clear how to dimension new variables. Simply, they get dimensioned by the locators. As will be shown in the following section, one can still treat any type as a locator or a content on a query-by-query basis with the exception that missing data in a locator column will invalidate the entire row regardless of the query-specific schema that is used. 8.4 A Single Schema with Multiple Data Sets A single schema may connect to more than one data set. For example, sales data may exist in multiple tables, one for each region. In that case, and given the same types as used above, except for stores which now have 9 instances, and assuming all three tables have the same structure as the table above, the declaration of type-data connections needs to associate instance ranges in the type with table names as shown below.
Type Structure (Stores.* x Time.* x Products.*) ~ (Sales ~ Costs) (Tables 1 thru 3 Data Table (Rows 1-3 x Column 1-5))
where Connect Connect Connect Connect Connect Connect Connect Stores 1-3 Stores 4-6 Stores 7-9 Time.* Products.* Sales Costs Table 1 Table 2 Table 3 all tables all tables all tables all tables Column 1 Column 1 Column 1 Column 2 Column 3 Column 4 Column 5 All rows All rows All rows All rows All rows All rows All rows
8.5 Sparsity implications of minimal versus rich schema
The use of a type as a locator versus content affects the allowability of different kinds of sparseness. Locators can never be sparse. This is rule number one in the relational model version 2. If a locator is missing and there is no assumption about the value of that missing locator, the entire row or set of contents evaluated for that location are invalidated13. In contrast, when a content is missing, it has no impact on the other contents for that location.
13
One might argue that it is possible to use the content data associated with the missing locators within aggregate statistics. Thus if the locator were time and the content were sales and certain time values were missing, one could still talk about the total mean and standard deviation for sales. However to compute these aggregates one needs to assume that the missing locators would have been valid were they to have been present. Version 0.93, last saved 1/23/2009 7:29:00 PM
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Page 48/80
Copyright 1998 Dimensional Systems, all rights reserved 222 Third St. suite 3122 Cambridge, MA 02139
The LC Model for OLAP
Thus within a single data set, the choice of locators and contents will determine which blank cells are treated as missing and which make other cells not applicable. 8.6 Schema Connector Syntax The syntax for connecting Types to data is as follows: Connect Instance-Specification Connector TableName.ColumnName RowRange Where Instance-Specification is any Intratype expression resolving to one or more instances of a Type. Connector is one of: > < <> >! TableName.ColumnName refers to the name of the table and the name of the column (or column number) pertaining to this type. RowRange refers to the desired rows of the table. It may refer to all rows in the table or to a filtered range: All Rows, which means just what it says, or Where condition, which is a standard SQL where clause.
Version 0.93, last saved 1/23/2009 7:29:00 PM
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Page 49/80
Copyright 1998 Dimensional Systems, all rights reserved 222 Third St. suite 3122 Cambridge, MA 02139
The LC Model for OLAP
9. Queries
So far, and as a pretext to querying and calculating, we have explored the need and basis for types, how types support hierarchies, operations that apply to types and the schemata that connect type structures to data. We are now ready to explore the multidimensional querying and calculating enabled by this systematic treatment of types. In this section, we describe queries. In addition to the intratype referencing syntax introduced in 6.2.3.4 and the type structuring syntax introduced in 7, there are two additional tokens, described below that are required for the expression of queries14. 9.1 Primitive Tokens The necessary primitive tokens are , comma; used to separate types () parentheses; used to bound expressions that are calculated at one time. Calculations proceed from inside out; context is passed from the outside in.
9.2 The Basic Form of a Query The basic form of a query is Ta, Tb.Ib, Tc.Ic …. where the token „T‟ refers to a type name, the token „I‟ refers to a type instance and the subscripts a b c refer to distinct types. In other words, given a type name, Ta, and one or more type-instance pairs for other types, Tb.Ib, Tc.Ic, select the instance(s) for the named type, Ta, associated with the set of locations defined by the type instance pairs. For example, given a set of types „Sales‟, „Stores‟, and „Time‟, and an appropriate data set, and a schema connecting the two, the query
Ta Sales Tb.Ib Stores.Cambridge Tc.Ic Time.>1995
will select the set of sales instances for the Cambridge store for all time after 1995.
14
Note that technically, a query does not define a return set in the sense of a data set presented in some form to an output device. Strictly speaking, a query identifies cells that meet the query criteria. This is important because when I talk about queries, I am referring to them in their pure sense. Also, to understand this document correctly, I am focusing on queries in the absence of how the result sets are represented. In a later version of this document I will address the representation issue. Version 0.93, last saved 1/23/2009 7:29:00 PM
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Page 50/80
Copyright 1998 Dimensional Systems, all rights reserved 222 Third St. suite 3122 Cambridge, MA 02139
The LC Model for OLAP
Looking at operations from an LC perspective, the first task of any parser is identifying the locator and content roles of the types involved. This is a straightforward task because the content terms are the ones for which no instance information is given. And the locator or location terms are the ones for which instance information is given.
9.3 Basic Queries As a backdrop for the following queries, let us assume a typical rich schema of time, geography, products, sales and costs.
The type structure ((Geog.* x Time.* x Products.*) ~ (Sales ~ Costs)) where Geog.* Time.* Products.* Sales Costs > Column 1 Column 2 Column 3 Column 4 Column 5 all rows all rows all rows all rows all rows
The data (Rows 1-3 x Column 1-5)
The query
sales
has no instance modifiers, so it will select all sales values from all (time.* x geography.* x product.*) intersections in the data set. The query
sales, Geog.Store_10
will select all sales values from all Store_10 * product * time intersections in the data set. The query
Geog, |( sales.300 )
will return all geographies where the sales value was equal to 300. Note how functionally speaking, „geography‟ is the content (or predicate) and „sales‟ is the locator (or subject). (Recall that sparse data is not allowed in the locators so there is a subtle difference in the handling of sparsity as a function of which types are used as locators and which are used as contents.) The sales.300 predicate will be evaluated over all times and products too since sales is dimensioned by them, but the query will only return the geographies.
Version 0.93, last saved 1/23/2009 7:29:00 PM
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Page 51/80
Copyright 1998 Dimensional Systems, all rights reserved 222 Third St. suite 3122 Cambridge, MA 02139
The LC Model for OLAP
The query
sales, |(sales.>100)
will select all sales values where sales are greater than 100. And the query
Geog, sales.>100, Geog.stores.*
will select all stores where sales are greater than 100. Note in this last query the dual uses for the type „geography‟. In the first use, it is as a content for the query. In the second, the „.stores.*‟ refers to a collection of instances. 9.4 Queries with Modifiers As described in section 6.2.3.4, intra-type modifiers may take several forms: 1. absolute location specification 2. relative positionally-offset location specification (within an ordered set) 3. absolute named level specification 4. relative hierarchical specification (through named levels or linked nodes) 9.4.1 Absolute location specification Multiple absolute location modifiers may be used in a query. In the following query, there is an absolute modifier for every location type. Thus
sales, Geog.Store_10, Time.Jun1995, Products.ActionToys
will always select the sales value at the Store_10 * Jun.1995 * ActionToys intersection within the geography, time and product types, respectively. 9.4.2 Relative positionally-offset location specification
Relative position specifications are frequently performed in time. The query
Sales / (Time.this - 1,Sales)
computes the ratio of sales to sales lagged by one time unit, for each time, product, and geography intersection to which the formula is applied. The time lag will be performed at whatever level the reference time instance happens to be; if „this‟ time is a quarter, it will refer to the previous quarter, while if „this‟ time is a month, it will refer to the previous month.
Version 0.93, last saved 1/23/2009 7:29:00 PM
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Page 52/80
Copyright 1998 Dimensional Systems, all rights reserved 222 Third St. suite 3122 Cambridge, MA 02139
The LC Model for OLAP
9.4.3 Absolute named level specification Absolute level specification can be performed by named level or hierarchical specification. For example, the query
Sales , Geog.State.*, Time.Year.*
will select the Sales values from the set of all states and years. And the query
Sales, Geog.leaves.*, Time.leaves.*
will select the sales values for the leaf levels of geography and time 9.4.4 Relative hierarchical specification Most queries that use relative hierarchical specifications also contain calculations and so will be discussed in the next section. What follows are examples of pure queries using relative hierarchical specifications.
sales, Geography.states.children
The above query selects the sales for the children of each state in the Geography type.
sales, Time.months.levelup(1)
This query selects the sales for each location that is one level up from months.
Time.Jan95, Geog.NYC_1, product.Toys.ByCategory.AtUnder, sales, costs, margin
selects sales, costs and margin from Time.Jan95, Geog.NYC_1, by all product instances in the ByCategory hirearchy at or under product.Toys. 9.5 Nested queries Nesting is used to select instances from a type, such as certain stores or certain times, based on criteria from other types. In English, for example, one might say, “ Show me the total profits for all stores that sold more than $100,000 worth of furniture during the summer.”. Or, “Show the sales for those time periods where the average The query expression
Version 0.93, last saved 1/23/2009 7:29:00 PM
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Page 53/80
Copyright 1998 Dimensional Systems, all rights reserved 222 Third St. suite 3122 Cambridge, MA 02139
The LC Model for OLAP
sales, Geog.stores.|(sales.>1000, Time.1995, Prod.toys), Time.1995.month.*, Prod.total
will select the total product sales for every month in 1995 for those stores with toy sales of over 1,000 for 1995. The query expression
sales, Time.months.|(costs.>1000, Geog.all, Prod.toys), Geog.stores.*, Prod.all
will select total product sales for every store for those months where aggregate toy costs were greater than 1000.
The query expression
profits, products.all, Time.month.*, Geog.store.(rank (sales, 1 thru 10))
will select the profit values for the top ten selling stores per month, for each month.
Sales / (sales, product.ByGender.parent), product.leaf.*, geog.state.*, time.qtr.*
takes the ratio of sales to parent‟s sales along the product‟s ByGender hierarchy, for each leaf product by state by quarter.
rate, geog.stores.|(margin >= 3, product.all), time.month.*, empl.mgr.leaf.rank ((hours, geog.all), bottom 5)
Returns rates for bottom 5 managers in terms of total hours per month, per store/times where total product margins are >= 3.
time.this - time.(sales.P).first, product.All, geog.leaf.*
returns for each leaf geography the number of time units between this time and the time of the first existing sales value, at this level of time scale.
Sum ((sales.P) ? 1 : 0, time.first thru this), time.Dec95, geog.leaf.*
returns, for Dec95 x Product.* x geog.leaf.*, the number of months that sales values were reported.
sales, time.1995, prod.all, geog.stores.|(max (margin, time.qtr.*) > 3)
selects sales for 1995, product total, by stores that had at least one quarter where quarterly margin > 3:
Version 0.93, last saved 1/23/2009 7:29:00 PM
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Page 54/80
Copyright 1998 Dimensional Systems, all rights reserved 222 Third St. suite 3122 Cambridge, MA 02139
The LC Model for OLAP
sales, time.1995, prod.all, (geog, geog.stores.*, time.qtr.*, |(margin > 3))
performs the same query! the nested query selects geographes from stores x quarters where margins > 3, and the outer expression selects sales from 1995 by all-product by these stores. Note that .|() restricts the locator type while ,|() restricts all locations of the sub-expression.
sales, time.1995, prod.all, geog.stores.|(count ( |(margin > 3, time.qtr.*)) > 0)
performs the same query again, by including only those stores that have more than zero quarters of margin > 3. A query from hell is
sales, time.(sales > 300).last, geog.(sales > 300), product.(margin (margin, time.this - 1) > 1)
No type can be used to restrict or form a context for the other types before evaluation begins. To evaluate: across all product.* x geog.*, select last sales > 300. Across all of these times by all products, select geogs with sales > 200. Now, across all of these time, geog tuples, select those margins where increase of margins from preceding period to this period was greater than 1. 9.6 Iterative Queries Iterative queries provide a way to selectively refine result sets. At the UI level, they allow the user to only restate those aspects of a query that have changed. At an implementation level, they should allow the query to be answered from local memory without having to reissue the original query. Through an appropriate tree-like numbering schema (visible, perhaps, only to the system), it is possible to keep track of every line of querying during a session. Following a query, there are three basic paths (and associated numbering schemes) that a user can follow in formulating a new query. 1. The new query is not a refinement of any previous query. In this case the new query would be assigned a new top level number. 2. The new query is a refinement of the query before. In this case, the new query would be assigned a new decimal of value „1‟ relative to the query before. For example, if the query before were labeled 1, the new query would be labeled 1.1. 3. The new query is a different refinement of an ancestor of the query before. In this case, the new query would be assigned a value incremented by one of the last decimal of the existing refinements of the query of which it is a new refinement. (Individual digit position is used
Version 0.93, last saved 1/23/2009 7:29:00 PM
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Page 55/80
Copyright 1998 Dimensional Systems, all rights reserved 222 Third St. suite 3122 Cambridge, MA 02139
The LC Model for OLAP
instead of decimal points after the decimal point– 1.23 is the third variation of the second modification of the first query.) For example, a new refinement of a query labeled 1.1 where there already exists a 1.11 and a 1.12 would be labeled 1.13.
Given our standard schema of Geography, Time, Products, sales and costs, follow the example below. Note the numbering scheme used. 1. sales, Geog.*, Prod.*, Time.* 1.1 1.11 Prod.toys (refinement of 1) Time.1995 (refinement of 1.1)
1.111 Time.1995.March (refinement of 1.11) 1.12 Time.1996 (new refinement of 1.1)
1.121 Time.1996.May (refinement of 1.12) 1.2 2 Prod.shirts (new refinement of 1) costs (not a refinement of any previous query)
9.7 MD queries In a sense, all queries are multidimensional. What this section outlines is how the query syntax makes it easy to specify tuples versus intersections. The query
sales, sales.>1000
will select the sales for all Time, Geography, Product locations where sales are greater than 1000. Now what if you wanted to track these Geography, time, product triplets so that you could use them for other queries/calculations/reports. What you need is to query the Geographies, Times, and Products as with the following query fragment.
Geog.*, Time.*, Product.*, |(sales .> 1000)
Version 0.93, last saved 1/23/2009 7:29:00 PM
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Page 56/80
Copyright 1998 Dimensional Systems, all rights reserved 222 Third St. suite 3122 Cambridge, MA 02139
The LC Model for OLAP
This selects all tuples of geog, time, and product that have sales > 1000. (Of course, since sales is dimensioned by geography, time and product, simply
|( sales .> 1000)
has the same effect.)
Version 0.93, last saved 1/23/2009 7:29:00 PM
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Page 57/80
Copyright 1998 Dimensional Systems, all rights reserved 222 Third St. suite 3122 Cambridge, MA 02139
The LC Model for OLAP
10. Representations
As stated at the beginning of the previous section, queries are distinct from representations. To use a relational term, the result of a query is a set of selected tuples. Once selected, there are many possible representations that can be made. For example, a set of tuples could be represented as a one-dimensional table, a two-dimensional static grid, a dynamic hypercube, or a collection of line graphs. Syntactically (and assuming, as will be discussed below, there exists some library of representational forms), assigning a particular representation to a query or calculation result is very simple in the LC model for OLAP. All that is required is the expression “Represent or „Rep‟ as X” where X can be any valid representational form. Consider the following queries with explicit representations.
Rep as table sales, Geog.stores.*. Time.months.*. Products.toys
Rep as hypercube
sales, Geog.stores.*. Time.months.*. Products.toys
There is nothing binding the representational expression to the query. In other words, there could be a default representational form that determines the representation of a query without having to specify it. The reason this can work is that each representational form is understood by the system in terms of its location and content structure. Thus, given a schema, the system can figure out how to connect each type in the query/calculation expression with each component of the representation. This does not ignore the very real differences that exist between different representational forms. For example, if a result set is very sparse in terms of locators, and one doesn‟t want to show nulls, then it is better to represent the data in table form where non-existent tuples can be left out rather than in grid form where it is impossible to not show missing cells. Or, if the locators are nominally ordered, it would not be a good idea to show the data as a line graph, as this would imply that the locator serving as the X axis was cardinally ordered.
Consider, now some descriptions of typical representations in terms of locators and contents. 10.1 The LC structure of a table representation Every column in a table can be represented by a type. And a natural ordering for the types among the columns of a table is for locators or keys to be placed first, followed by contents or attributes.
Version 0.93, last saved 1/23/2009 7:29:00 PM
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Page 58/80
Copyright 1998 Dimensional Systems, all rights reserved 222 Third St. suite 3122 Cambridge, MA 02139
The LC Model for OLAP
10.2 The LC structure of a line graph representation The X axis of a line graph represents one locator and the Y axis can represent 1 or more contents. If there is more than one locator in a query being represented as a line graph then the second through Nth locator need to be represented as pages. 10.3 The LC structure of a hypercube representation A hypercube is composed of one or more types in a nested relationship attached to an X, Y, and page axis. A query or calculation result is defined by one or more locator types and one or more contents. Furthermore, it is known for each locator, whether a range of values or a single value is to be represented. This is enough information to determine which types need to be shown as pages; namely, the ones for which only a single instance is given. And it is enough to determine which types need to be shown along the X or the Y axes. The only information lacking is the exact mapping of non-page types to their respective nesting level on an X or Y axis. And while a user could certainly specify an exact mapping, there are many equally valid ways to represent a schema as a hypercube. Rather than force a user to specify how every type is represented in the hypercube, since the hypercube is supposed to be a dynamically re-arrangeable representation, it is easier to have the user specify only that the representation is to be a hypercube, let the system figure out which types are page and which are non-page and then put up a default arrangement of the non-page types that the user can reconfigure in seconds if so desired.
Version 0.93, last saved 1/23/2009 7:29:00 PM
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Page 59/80
Copyright 1998 Dimensional Systems, all rights reserved 222 Third St. suite 3122 Cambridge, MA 02139
The LC Model for OLAP
11. Calculations
Calculations are more complex than queries as they involve at least one operation. New.price = Old.price x 3 is an example of a very simple calculation. An important feature of the LC model for OLAP is that the calculation syntax allows the same formula to be expressed either as a relationship between instances of the same type or as a relationship between types. The benefit of this feature is that it allows the user/analyst to model the same event in multiple ways as well as to translate between the different ways. For example, depending on how one designed a schema, the formula Profit = Sales - Costs might represent a relationship between instances of a „Variables‟ dimension, or each of the terms, „Profit‟, „Sales‟, and „Costs‟ might represent separate types. 11.1 Basic examples The assignment operator „=’ is used to bind the name of a type (and possible application range), with a calculation expression that is used to compute it. Ignoring the application range component for the moment, the statement
Profit = Sales - Costs
states that the type Profit will be computed for any location by subtracting costs at that location from sales at that same location. In contrast, the expression
taxes_owed = sales * (state, tax_rate)
says that taxes owed will be computed for any location by taking the sales for that location and multiplying them by the tax rate for the state within which the sales were made. Note (as was described in section 6.2.3.4), how the token „state‟ is used above without the suffix „.*‟ to refer to a relative hierarchical offset.
11.2 Examples with application ranges Incorporating an application range, the statement
Profit, Geog.Store.*, Time.year.* = Sales - Costs
says that, within each intersection of store-level members by Year-level members, Profit will be computed by subtracting costs at that location from sales at that same location. The type instances on the left hand side of the expression (Geog.store.* and Time.year.*), denote the
Version 0.93, last saved 1/23/2009 7:29:00 PM
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Page 60/80
Copyright 1998 Dimensional Systems, all rights reserved 222 Third St. suite 3122 Cambridge, MA 02139
The LC Model for OLAP
application range or range of locations for which the formula on the right hand side of the expression will be or can be applied. Absent any other applications of Profit to other location contexts (Geogs other than store-level), Profit will not be applicable anywhere beyond these locations. The expression
profit, |(sales > 100) = sales - costs
calculates profit for all locations where sales > 100. The expression
Profit, Geog.stores.|(sales.>1000, Time.1995, Prod.toys)), Time.1995.month.*, Prod.total = Sales - Costs
calculates total product profits for all months in 1995 for all stores whose 1995 toy sales exceeded 1000. Within an application, there could be an initial declaration for profit such as
Profit, Geog.*, Time.*, Products.* = Sales - Costs
This would serve to set the default formula for calculating profits and the range of locations to which Profit applies. Afterward, succeeding descriptions of Profit that contained application ranges such as the ones below would identify ranges within which Profit had a localized or exceptional definition.
Profit, Geog.USA.under = Sales - Foo_costs Profit, Geog.Europe.under = Sales - (costs + Due_costs)
11.3 Aggregation functions that apply to hierarchical dimensions
Many rollup-like functions can be expressed as a generalized formula applicable to a type that would affect all data that might be aggregated/disaggregated according to the type. For example, in English, one might want to say “All data sums across time”. This would be expressed as
Time.leaves.above = sum (Time.leaves.*)
Of course, within a real-world application, one is likely to see different variables needing to be aggregated in different ways. For example, wages and prices would not sum across time. The
Version 0.93, last saved 1/23/2009 7:29:00 PM
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Page 61/80
Copyright 1998 Dimensional Systems, all rights reserved 222 Third St. suite 3122 Cambridge, MA 02139
The LC Model for OLAP
best way to handle this sort of irregularity is to override the default aggregation functions with aggregation functions that attach to specific variables.
11.4 Examples that specify aggregations on other dimensions This section shows how aggregation formulas can be attached to specific variables either as a part of the formulaic definition of the variable or as a separate expression. Start with the by now standard definition for profit 11.4-1 Profit, Geog.*, Time.*, Products.* = Sales - Costs If one wanted to state how Profit aggregates across Geography, Time and Products, it would be necessary divide up the global application for Profit into a local range where profit was input and a local range where Profit was calculated. This is shown below. 11.4-2 Profit, Geog.leaves.*, Time.leaves.*, Products.* = Sales - Costs 11.4-3 Profit, Geog.leaves.above, Time.leaves.above, Products.leaves.above =
sum(Geog.leaves, Product.leaves), Avg(Time.leaves)
If one wanted to express the formula for Profit and any aggregation formulas for Profit in the same expression then it would look like the following. 11.4-4 Output variable name Input function Derivation function
Profit
= (sales, time.leaves.*, geog.leaves.*, product.leaves.*) - (costs, time.leaves.*, geog.leaves.*, Product.leaves.*) = sum(Geog.leaves), sum(Product.leaves), avg(Time.leaves)
Expression 11.4-5 could be further condensed as follows. 11.4-5 Output variable name Input function Derivation function
Profit
= (sales - costs, time.leaves.*, geog.leaves.*, product.leaves.*) = sum((Geog, Product).leaves), avg(Time.leaves) Version 0.93, last saved 1/23/2009 7:29:00 PM
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Page 62/80
Copyright 1998 Dimensional Systems, all rights reserved 222 Third St. suite 3122 Cambridge, MA 02139
The LC Model for OLAP
11.5 Moving functions A basic requirement of DSS calculations is the strong support for moving functions. A combination of standard aggregation functions plus the referencing syntax introduced in 6.2.3.4 is sufficient to define any moving function. For example, a rolling sales variable could be defined in terms of standard averaging and time offset ranges as shown below.
rolling_sales = avg(time.this-3 thru this, sales)
Another way of expressing the function is through a transparent shortcut as shown below. The remainder of the examples in this section are written in „transparent shortcut‟ fashion.
rolling_sales = rollavg(time.this-3, sales)
assumes there is only one ordering for time. Calculates a rolling average based on the last three periods for all levels of time
rolling_sales = rollavg(time.this-3, sales), time.leaves.*
assumes there is only one ordering for time. Calculates a rolling average based on the last three periods at the leaf level of time
cumsales = cumsum(time.first, sales)
assumes there is only one ordering for time and calculates a cumulative sum for sales starting with the first time value for every level.
cumsales = cumsum(time.year.first, sales)
assumes there is only one ordering for time and calculates a cumulative sum for sales starting with the first time value for the year level only.
YToDSales = cumsum (sales, time.year.month.first), time.year.month.*
assumes there is only one ordering for time and calculates a cumulative sum for sales at the month level from the first month to the target month for each year. 11.6 Generalized tokens and Inheritance
Inheritance works smoother if there are tokens for generalized expression components. The L-C model offers a variety of natural tokens: L meaning locator, C meaning content, i meaning instance, and N meaning name.
Version 0.93, last saved 1/23/2009 7:29:00 PM
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Page 63/80
Copyright 1998 Dimensional Systems, all rights reserved 222 Third St. suite 3122 Cambridge, MA 02139
The LC Model for OLAP
For example, one could define a generalized averaging function using the Locator and Content terms.
content_avg = avg(locator.leaves)
Then one could define a set of specific contents as inheriting this generalized function.
(prices, wages, tax_rates, production_flow) isa content_avg
Assuming that the relevant schema contains a Time, Geography and Product type used as locators for Prices, the specific formula for Prices would translate into:
Prices = avg (time.leaves, geog.leaves, product.leaves)
The generalized instance term „I‟ is necessary for expressing some one instance of a type. For example the page dimension of a representation takes some L.i.
Version 0.93, last saved 1/23/2009 7:29:00 PM
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Page 64/80
Copyright 1998 Dimensional Systems, all rights reserved 222 Third St. suite 3122 Cambridge, MA 02139
The LC Model for OLAP
12. Multi-Schema Situations
12.1 A Single Data Set with Multiple Rich Schemata
A single data set may connect to multiple rich schemata sharing a single minimal connection15. For example, different user groups may use the same base data for different scenario modeling. Or the same external economic data may figure in multiple applications such as payroll (if there are COLA clauses), and marketing forecasts. The various schemata can be tracked in a table as shown below16. column name stores time products sales costs Type number 1 2 3 4 5
Function per schema 1 Function per schema 2
L L L C C
L C L L C
12.2 Example of Multischema integration In this section, we look at an example that integrates data from two schemata into a single schema. The example starts with types geography, time, product, sales, cost, profit, margin, employee, rate, hours, pay. Sales, cost, profit, margin, rate, hours and pay are cardinal types without a set of discrete instances defined. Geography, time, product and employee have a specific set of instances defined (and of these, only time is cardinal).
15
While it is possible for two schemata that are minimally distinct to interact with the same data set, the situation is beyond the scope of this document.
16
The ability to switch schema in the functional approach takes into account what have been called „attribute‟ dimensions by some vendors. An attribute dimension is a type that can be used as a locator or a content. Pilot Software‟s Analysis Server, for example, calls color an attribute dimension because in one scenario it is used as an attribute of the dimension product and at another moment as a dimension in its own right. In our vernacular, color would simply be a type which one is free to use as a locator or as a content. Version 0.93, last saved 1/23/2009 7:29:00 PM
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Page 65/80
Copyright 1998 Dimensional Systems, all rights reserved 222 Third St. suite 3122 Cambridge, MA 02139
The LC Model for OLAP
The following figure shows the structure of the geography, time, product and employee dimensions, with named levels, named hierarchies, and parent-child relationships:
Country Region State City Store Y ear Quarter Month
CO WY West CA LA USA SF
CO_1 WY _1 SF_1 SF_2 LA_1 LA_2 NY C_1 NY C NY C_2 Q4-95 1995 Q3-95 Q2-95 Q1-95
Jan95 Feb95 Mar95 Apr95 May 95 Jun95 Jul95 Aug95 Sep95 Oct95 Nov 95 Dec95
East
NY ROCH
ROCH_1 ROCH_2
MA RI
MA_1 RI_1
CEO Exec GIJoe Dolls Toy s Play Hot Wheels By Category All Sport Basketballs Bats Jacks Chess Games Checkers Risk By Gender Boy s Girls Scrabble Store Emp Mgr Total Admin HR Fin Barbie Tonka HQ VP
Frank Sarah Joe
Tim Jane Jim Lucy
Mike Ralph Cathy Homer Grant
Either Chas Mary Clerk Jill Erin Kara
The types are related in the following way:
Version 0.93, last saved 1/23/2009 7:29:00 PM
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Page 66/80
Copyright 1998 Dimensional Systems, all rights reserved 222 Third St. suite 3122 Cambridge, MA 02139
The LC Model for OLAP
(Time.* x Geog.* x Product.*) ~ (Sales, Cost, Profit, Margin) (Time.* x Geog.* x Employee.*) ~ (Rate, Hours, Pay) The presence of the x (cartesian) operator indicates what role a type is to play in the schema, which for clarity‟s sake may be shown as: Global cube Dimension Time Geog Role L L
Local Sales cube
Dimension Product Sales Cost Dimension Employee Rate Hours Pay
Role L C C Role L C C C
Local Employee cube
The formulas defined in the schema are as follows: Intratype formulas: (1) Time.leaf.above = sum (time.leaf) (2) Geog.leaf.above = sum (geog.leaf) (3) Product.leaf.above = sum (Product.leaf) (4) Empl.leaf.above = sum (empl.leaf) Inter-type formulas: (5) profit = sales - costs (6) margin = sales / costs (7.1) pay, time.*, empl.*, geog.* (7.2) pay, time.leaf.*, empl.leaf.*, geog.leaf.* = hrs * rate (8) Rate, locator.leaf.above.* = pay / hrs
Version 0.93, last saved 1/23/2009 7:29:00 PM
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Page 67/80
Copyright 1998 Dimensional Systems, all rights reserved 222 Third St. suite 3122 Cambridge, MA 02139
The LC Model for OLAP
There are two data tables that the schema will relate to the type structure, one of product information and one of employee pay information. Table 1: Product sales information
Row 1 2 3 4 5 6 7 8 9 10 11 12 13 14 Geog NYC_1 NYC_1 NYC_1 NYC_1 NYC_1 NYC_1 MA_1 MA_1 MA_1 MA_1 MA_1 NYC MA East Time Jan95 Jan95 Jan95 Jan95 Feb95 Feb95 Jan95 Jan95 Jan95 Feb95 Feb95 Q1-95 Q1-95 Q1-95 Product GIJoe Tonka Barbie Scrabble Tonka Barbie GIJoe Basketballs Risk HotWheels Chess Toys Sport Games Sales 50 80 100 50 30 150 60 90 40 50 40 410 90 130 Cost 15 30 50 25 *** 80 20 30 20 20 10 175 30 55 Profit 35 50 50 25 *** 70 40 60 20 30 30 ? 60 75 Margin 3.3 2.7 2.0 2.0 *** 1.9 3.0 3.0 2.0 2.5 4.0 ? 3.0 2.4
Table 2: Employee-related information. Derived information shown in shaded cells (only some derived rows are shown).
Row 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 Geog NYC_1 NYC_1 NYC_1 NYC_1 NYC_1 NYC_1 MA_1 MA_1 MA_1 MA_1 MA_1 MA_1 NYC_1 NYC_1 MA Time Jan-95 Jan-95 Jan-95 Feb-95 Feb-95 Feb-95 Jan-95 Jan-95 Jan-95 Feb-95 Feb-95 Feb-95 Q1-95 Q1-95 Q1-95 Empl Mike Homer Grant Mike Homer Chas Cathy Erin Kara Cathy Erin Kara Mgr Clerk Total Rate 20 12 10 20 12 12 17 10 10 20 10 10 20 11.57 13.27 Hrs 40 40 25 40 30 20 50 40 40 50 40 40 80 115 260 Pay 800 480 250 800 360 240 850 400 400 1000 400 400 1600 1330 3450
12.2.1 Queries that go against one cube as a time Queries: Against this schema, let us examine some queries and their results.
Costs Version 0.93, last saved 1/23/2009 7:29:00 PM
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Page 68/80
Copyright 1998 Dimensional Systems, all rights reserved 222 Third St. suite 3122 Cambridge, MA 02139
The LC Model for OLAP
As elsewhere, would result in all costs across all geography by time by product combinations. No employee values would be involved.
rate, Time.Feb95
results in all rates from Feb95 by geog.* x empl.*
Employee, Time.Feb95, |(Rate < 17)
results in all employees where time is Feb95 and rates < 17 (across all geographies).
12.2.2 Queries that go against both cubes
Hrs, Empl.Mgr, Margin, Product.Toys, Geog.NYC_1, Time.Q195
is a query that (given our schema), specifies values for all locators and will result in {Hrs: 80, Margin: 2.2}
rate, empl.mgr.leaf.*, geog.state.|(margin >= 3.0).store, time.1995
selects the manager rates for stores in all states where the state-level margin is >= 3, at 1995.
Hrs, Margin, Geog.NYC_1, Time.Q195,
results in the set of hours and the set of margins from the Geog.NYC_1, Time.Q195 intersection. Since both hours and margins have a set of additional locators for that location, they are selected too. One very reasonable formula that we could define against these two schemata would be:
unit_profit = (profit, product.all) - (pay, empl.store_empl)
This would implicitly vary by the intersection of the dimensions of profit and pay, namely geog and time. Note how the type „Product‟ and the type „empl.‟ have become a part of the content structure.
12.3 Relating Hypercube and Series cube schemata Imagine a type space that contained stores, time, products, sales, costs, variables and values (i.e., a union of hypercube and series cube types). Keep in mind that the type „value‟ would need to have multiple types of instances (already a problem) namely dollar values and integers. Now imagine two schemas connected to the same data
Version 0.93, last saved 1/23/2009 7:29:00 PM
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Page 69/80
Copyright 1998 Dimensional Systems, all rights reserved 222 Third St. suite 3122 Cambridge, MA 02139
The LC Model for OLAP
Schema1 = (stores x times x products)~(sales, costs) Schema2 = (stores x times x products x variables) ~ values In schema 1 if you want to define „margins‟ you would use a formula such as margin = sales - costs In schema 2 it would be Variable.margin,values.integers = (variable.sales,value.$) - (variable.costs,value.$) The reason for this extra clumsiness in schema 2 is that there are multiple data types masquerading as instances of the value type and only integers are applicable to the margin calculation. The term „variable‟ is used because in schema 2, sales and costs are instances not types thus when references it is relative to a type. In practice, commercial OLAP products get around this extra syntax by keeping the value dimension to a single type thereby letting them ignore it. Thus, if all types are IEEE floating-point, then there is only one kind of value and both the variables term and concomitant distinction between variable and value can be ignored, leading to margin = sales - costs Finally, the two schemas need to be equated by creating an equivalency relation between sales, costs and variables and values. (Variables * Values) = (sales + costs) assumes that values has only one type. As an example of this, consider a type space that includes dimensions of stores, time, products, scenario, unit cost, COGS, sales, and units sold. For some application, the domain of the unit costs, COGS, sales, and units sold are all considered to be the same (floating-point number). In schema A, the types are related by
(stores.* x time.* x product.* x scenario.*) ~ (unit costs, COGS, sales, units sold)
while in schema B, the types are related by
(stores.* x time.* x product.* x (unit cost UNION COGS UNION sales UNION units sold)) ~ (actual, budget, forecast)
How are these to be reconciled? In order to treat unit cost, COGS, sales, and units sold as values of a dimension, a dimension needed to be created of which these were all values. Express applications typically call this a „line‟ dimension. In the same way that actual, budget and forecast are scenarios, these cost and
Version 0.93, last saved 1/23/2009 7:29:00 PM
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Page 70/80
Copyright 1998 Dimensional Systems, all rights reserved 222 Third St. suite 3122 Cambridge, MA 02139
The LC Model for OLAP
sales factors are “line items.” An abstract type of „line item‟ needs to be introduced into the type space, and each kind of line item needs to be declared as ISA line item. We end up with the following types:
stores (store1, store2, … storen) time (time1, time2, … timen) product (prod1, prod2, … prodn) LineItemType ISA decimal unit costs ISA LineItemType COGS ISA LineItemType sales ISA LineItemType units sold ISA LineItemType LineItemName (LineItemType.inheritors) ScenarioType ISA decimal actual ISA ScenarioType budget ISA ScenarioType forecast ISA ScenarioType ScenarioName (ScenarioType.inheritors)
All line item types inherit a common decimal type, by virtue of the fact that „line item‟ is defined to be a decimal. LineItemName is a type composed of the names of types inheriting from LineItemType All scenario types also inherit a common decimal type. ScenarioName is a type composed of the names of types inheriting from ScenarioType.
Schema A
(stores.* x time.* x product.* x ScenarioName.*) ~ LineItemType.inheritors
Schema B
(stores.* x time.* x product.* x LineItemName.*) ~ ScenarioType.inheritors
In schema A above, all types which descend from LineItemType are related 1-to-1 to the cartesian of stores by time by product by scenario names. Any new types declared as ISA LineItemType will automatically be related as well. Additionally, since the ScenarioName dimension is defined as all types which inherit from ScenarioType, any new types defined as ISA ScenarioType will automatically appear as a locator for these line item types as well. Similarly, in schema B, all types which descend from ScenarioType are related 1-to-1 to the cartesian of stores by time by product by line item names. Changes to the set of types that are declared ISA ScenarioType will automatically be reflected in the set of contents in this schema, while changes to the set of types declared ISA LineItemType will be reflected in changes to the LineItemName locator type.
Version 0.93, last saved 1/23/2009 7:29:00 PM
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Page 71/80
Copyright 1998 Dimensional Systems, all rights reserved 222 Third St. suite 3122 Cambridge, MA 02139
The LC Model for OLAP
Each of these schemata maps onto exactly the same set of data. Note that issues of mapping a type structure onto multiple table structures were discussed in section 8.3.3.3. This is the converse issue: mapping multiple type structures onto a single set of data.
Table 1: Financial System records Store Time Product NYC_1 Jan95 GIJoe NYC_1 Jan95 GIJoe MA_1 Feb95 Barbie MA_1 Feb95 Barbie In Schema A, Connect Stores.* Connect Time.* Connect Product.* Connect LineItemType.inheritor Connect Scenario.Actual Connect Scenario.Forecast In Schema B, Connect Stores.* Connect Time.* Connect Product.* Connect LineItemName.* Connect Actual Connect Forecast
Line COGS UnitCosts COGS UnitCosts
Actual 10 5 10 15
Forecast 8 5 10 15
> > > > >! >!
Table1.Column1 Table1.Column2 Table1.Column3 Table1.Column4 Table1.Column5 Table1.Column6
> > > > >! >!
Table1.Column1 Table1.Column2 Table1.Column3 Table1.Column4 Table1.Column5 Table1.Column6
If an additional margin type is defined,
margin ISA LineItemType margin = sales / cost
then it would automatically appear in schema A as a type with the formula
margin = sales / cost
and in schema B as a value of LineItemType with the formula
margin = (lineitemname.sales / lineitemname.cost)
Version 0.93, last saved 1/23/2009 7:29:00 PM
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Page 72/80
Copyright 1998 Dimensional Systems, all rights reserved 222 Third St. suite 3122 Cambridge, MA 02139
The LC Model for OLAP
13. Analytical Session
The interactive user scenario described in this section demonstrates the power of the LC approach, which allows the user to dynamically tailor their context to support any query that arises through the analytical thought process. For example, consider the following source dataset:
Row 1 2 3 4 5 6 7 8 9 10 11 12 13 14 Geog NYC_1 NYC_1 NYC_1 NYC_1 NYC_1 NCY_1 MA_1 MA_1 MA_1 MA_1 MA_1 NYC MA East Time Jan95 Jan95 Jan95 Jan95 Feb95 Feb95 Jan95 Jan95 Jan95 Feb95 Feb95 Q1-95 Q1-95 Q1-95 Prod GIJoe Tonka Barbie Scrabble Tonka Barbie GIJoe Basketballs NULL Hot Wheels Chess Toys Sports Games Sales 50 80 100 50 30 150 NULL 90 40 50 40 410 90 130 Cost 15 30 50 25 NULL 80 20 30 20 20 10 175 30 55 Profit 35 50 50 25 NULL 70 NULL 60 20 30 30 235 60 75 Margin 3.3 2.7 2.0 2.0 NULL 1.9 NULL 3.0 2.0 2.5 4.0 2.3 3.0 2.4
Table 1 : Source Data: SALESDATA
Independent of our data set, assume we have defined the following LC Types:
Type Geog Time Prod Sales Cost Profit Margin Description Level Hierarchy with levels of Country, Region, State, City, Store Level Hierarchy with levels of Year, Quarter, Month Products: Linked-Node hierarchy Non-Hierarchy, integer type Non-Hierarchy, integer type Non-Hierarchy, integer type Non-Hierarchy, float type
Table 2: LC Types
Given the above types and source data, we may begin an analytical session. Following is an example of how such a session might progress. Note that when the session begins, we are starting with two entities whose definitions are mutually exclusive of each other: a set of source data and a set of LC types. During the course of our interactive session, we will make various connections between the data and the type to satisfy whatever queries arise. The first thing we have to do is connect our Types to our data source.
Version 0.93, last saved 1/23/2009 7:29:00 PM
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Page 73/80
Copyright 1998 Dimensional Systems, all rights reserved 222 Third St. suite 3122 Cambridge, MA 02139
The LC Model for OLAP
LC Command Connect Geog.* > SALESDATA.Column 1 all rows Connect Time.* > SALESDATA.Column 2 all rows Connect Prod.* > SALESDATA.Column 3 all rows Connect Sales >! SALESDATA.Column 4 all rows Connect Cost >! SALESDATA.Column 5 all rows Connect Profit >! SALESDATA.Column 6 all rows Connect Margin >! SALESDATA.Column 7 all rows Comments
Session 1: Connect the Types to the Source Data
Having executed the above statements, we have created a Minimal Schema context. We have not identified any particular types as either Locator or Content. We can now begin our interactive query session. Note that the following ignores representation, although this is covered in the LC document. The “CL” in the left column refers to “Context Level”17.
CL 1 LC Command Geog.Stores, Time.*, Prod.*,Profit.>50 Comments
Select the stores with profits greater than 50 across all time and all products. This query uses Geog as the Content and uses Time, Prod and Profit as Locators.18
2
Sales, Geog.Stores.*, Time.*, Prod.*, Costs.<50
Show all the sales values for all stores across all time whose Costs are less than 50. This query uses Sales as the Content and uses Geog, Time and Costs as the Locators.
2.1
Sales, Time.1995.Quarters.first
This query uses the Rich Schema implied by query 2. We need not specify all Locators in the query, as we are assuming Prod.*, Geog.Stores.* and Costs.<50. This query asks for Sales values for the first quarter of 1995 (implied across all Stores and Products whose costs are less than 50).
2.2
17
Sales, Prod.Toys.down(1)
This query again uses the Rich Schema implied by
The “Context Level” column in the interactive session refers to the context in which the user is operating. In this example, the user starts with a Minimal Schema, and the first query, with Context Level 1, is performed in that context. In general, Context Levels referenced by whole numbers refer to what might be called the “base level” context, or the “initial” context. Query 2.1 is an example of an iterative query making use of a preceding context, 2 in this case.
18
Also note the following about the first query. Time, Prod and Profit are used as Locators, meaning instances on these types will be used as keys in the query. Therefore, rows 5, 7 and 9 from the dataset will be invalidated before the query is processed, as they contain NULL values in what we‟ve designated as key columns. In query 2, rows 5 and 9 will be invalidated, and the value for Sales in row 7 would be interpreted as “Missing.” Version 0.93, last saved 1/23/2009 7:29:00 PM Page 74/80
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Copyright 1998 Dimensional Systems, all rights reserved 222 Third St. suite 3122 Cambridge, MA 02139
The LC Model for OLAP
CL LC Command
query 2. It asks for the sales values for all children of Toys (implied across all Time and all Stores whose costs are less than 50)
Comments
3
Prod, Margin>2.0, Costs<20, Geog.Stores.*, Time.*
This query reverts to a minimal schema. It uses Prod as the Content and uses Margin, Costs, Geog and Time as Locators. The query is asking for all products with a Margin greater than 2.0 and Costs < 20 across all stores and all time.
3.1
Prod.Games.under
This query uses the Rich Schema implied by query 3 above. It asks for all products under the Games instance in the hierarchy (using the implied Locators of all time across all stores with costs less than 20 and margin greater than 2.0).
3.2
Costs, Time.Q1-95
This query uses the Rich Schema implied by query 3 above, but it uses a Locator as Content, thus modifying the context further.19
Session 2: Interactive Query Session
Assuming we have access to a second source table called ADVERT that tracks advertising expenses by city and time:
Row 1 2 3 4 5 6 7 8 9 Geog NYC NYC NYC Albany Albany Albany Boston Boston Boston Time Jan95 Feb95 Mar95 Jan95 Feb95 Mar95 Jan95 Feb95 Mar95 AdvertCost 100 110 90 80 70 75 85 88 85
Table 3 : Advertising Table: ADVERT
Now, we can introduce a new type and begin using it in our session right away:
CL LC Command AdvertCost = decimal Connect AdvertCost !< ADVERT.Column 3 all rows 4 AdvertCost, Profit, Cost, Prod.*, Time.*, Geog.Cities.* Comments
Declare a new type and connect it to source data. This query uses our new Type in a minimal schema: Select all
19
Note in this case that the data set of possible results is first constrained by the query at context level 3, namely it considers only Geog instances at the Store level, Margins greater than 2.0 and Costs less than 20. Now, at context level 3.2, we start with that constrained data set and further constrain it to a Time instance of Q1-95. From the constraint imposed at context level 3, there will be no Costs in the result set of 3.2 with a value of >= 20. The results are potentially different than if the query at 3.2 had instead been issued at the “base level” of context. Version 0.93, last saved 1/23/2009 7:29:00 PM Page 75/80
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Copyright 1998 Dimensional Systems, all rights reserved 222 Third St. suite 3122 Cambridge, MA 02139
The LC Model for OLAP
AdvertCosts, Profits and Costs across all products, all time and all cities.
4.1
Profit, (AdvertCost/Cost > .5)
Uses Rich Schema implied above (Locators of Prod, Time, Geog) to get the values of Profit where advertising costs are more than 50% of costs.
Session 3 : Interactive Session Continued
Version 0.93, last saved 1/23/2009 7:29:00 PM
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Page 76/80
Copyright 1998 Dimensional Systems, all rights reserved 222 Third St. suite 3122 Cambridge, MA 02139
The LC Model for OLAP
14. Conclusion
In part one, this document showed that there is not currently a standard OLAP model, that different products use different implicit models, that relative to an ideal logical model for OLAP, each product-based model is limited in some way, and that neither the Relational model, nor canonical math and logic do an adequate job of grounding OLAP.
In part two, the LC model for OLAP was introduced. It was shown that a functional approach to the relationship between metadata and data wherein the distinction between dimensions and variables is functionally rather than referentially based provides more flexibility and conciseness of expression than referentially-based approaches. (And the end documents describe how the foundations of the LC model can also support a new mathematics and logic consistent with the approach taken by Wittgenstein in the Tractatus Logico-Philisophicus.) Specifically, the functional approach embodies a powerful method for modeling multidimensional data that provides a systematic way of computing with sparse data, of providing for ways that a single data set can be simultaneously connected to multiple schemata, of expressing hierarchical references using both a parent/child and a level-based taxonomy, and of providing a unified calculation environment, all while maintaining a strict economy of terms. In part three, the query and calculation syntax for the LC model was presented and discussed. Building on the type referencing of the previous section, it was shown that the proposed query syntax has far fewer terms than other syntaxes while maintaining human and parser readability. In addition it was shown how the LC schema facilitates the process of connecting a representational framework to a data set, how it supports nested, iterative and multidimensional queries, and how it supports generalized tokens within a single and multicube environment.
1.1
i
Overview of L-C logic
The below-described MD-relevant features of an L-C logic are 1. it distinguishes positional and resolutional adjacencies in dimension structures, 2. it handles sparsity by incorporating a mechanically-grounded model for distinguishing propositions from non-propositions, and works only on valid propositions, and 3. it supports omni-directional data flows. 1.1.1 L-C Dimension Structures L-C logic articulates a principle for ordering propositions that translates into the MD world as a primitive topology for dimensions. This principle is grounded in the primitive concept of adjacency. There are two primitive types of adjacency: position such as between two cities or
Version 0.93, last saved 1/23/2009 7:29:00 PM
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Page 77/80
Copyright 1998 Dimensional Systems, all rights reserved 222 Third St. suite 3122 Cambridge, MA 02139
The LC Model for OLAP
persons or between two measurements made in meters, and resolution such as between cities and a state or measurements made in meters with measurements made in angstroms. Traditionally, canonical logic has ignored these distinctions, since it claims to treat only the combinatorial connections between propositions (and, or, not, etc.), irrespective of their meaning. And surely concerns about “adjacency” must belong to their semantics, or meaning. However, the logical treatment of two propositions p and q cannot be the same when the meaning of q is part of the meaning of p, as when the two are inferentially independent. This is more evident in the case of predicate logic, which treats specifically of the cardinality relationships between sets and members. According to the L-C way of thinking, to say that something is a part of something else (in the resolutional sense of “part”) is to express a tautology. Similarly, the relationship between an operation and its results (as used, positionally speaking, to express a series of integers) is also tautologous. And tautologies are precisely the domain of logic. Ignoring the relevance of adjacency means the logician must abstain from specifying any criteria for well-formedness (i.e., the use of those p’s and q’s in real world situations). Note how this parallels the MD distinction between hierarchical (or root - leaf) relationships such as parent-child, and non-hierarchical (or branch - branch) relationships such as inter-sibling. The combination of position and resolution adjacencies constitutes a minimally sufficient set of types for creating any possible specification or statement. In other words, any adjacency relation is determined by mixing from the two “pure” adjacency types (wherein, "pure" position measurements are resolution-specific, while "pure" resolutional measurements are position-specific) in the same way that a location on the globe is determined by mixing from a pure set of longitude, latitude and altitude measurements. 1.1.2 Meaning and Sparsity Traditionally, the term “proposition” has been used in one way or another to indicate the “meaning” of a sentence, following on the observation that two or more sentences may mean the same thing, or (though less often recognized) that the same sentence can have two or more meanings. However, the claim that a consistent logic and a viable theory of representation somehow follow from a theory of propositions is not widely acknowledged. Canonical logic treats of sentences, propositional tokens or signs, well-formed formulae, character strings and propositions indiscriminately (according to various definitions of well-formedness). Historically, the sentence/proposition distinction has only been invoked in the evident cases where a sentence does not uniquely specify a proposition, or an apparently well-formed sign doesn't yield a proposition at all. One reason is that logicians are inclined to believe meaningfulness can be ascertained from the sign itself - else by what criteria would we call a system formal? A second reason is that the very notion of proposition has never been well-defined. Aristotle characterized logic within the context of an affirming/denying game (the
Version 0.93, last saved 1/23/2009 7:29:00 PM
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Page 78/80
Copyright 1998 Dimensional Systems, all rights reserved 222 Third St. suite 3122 Cambridge, MA 02139
The LC Model for OLAP
"dialectic," in its original sense), and defined propositions as the primitive units of this game. He further diagnosed a certain compositeness of type as their defining character, distinguishing that which an assertion was asserting from that of which the assertion was being made. Yet since Aristotle equated these types with the grammatical distinctions of predicate and subject respectively, he thereby sanctioned the subsequent focus on indicative sentences. Logicians ever since have been devising ways of preserving that focus. Bertrand Russell, who early on recognized that grammatical form does not always indicate logical form, is a case in point. His theory of descriptions was an attempt to uncover the "hidden" logical form of sentences whose grammatical subject no longer exists, and so maintain their place in the truth-functional calculus-otherwise, well-formed but meaningless (i.e., non-truth-functional) sentences might infect the calculus as a whole. The insight behind Tractarian logic is that well-formedness on the page, however it is defined, does not sufficiently determine meaning. In contrast to the referential approach in Canonical logic, where both the "grammatical" Aristotelian and "logical" Russellian subjects and predicates are distinguished referentially (i.e., by reference to individual substances and universals, for the former, and to individuals of acquaintance and sense-data-properties for the latter), the regulating principle of an L-C logic is a "functional" distinction of types. To this end, it distinguishes two primary or "prototype" functions necessary and sufficient for determining a proposition: location specification and content specification. (For a detailed treatment of these functions please see Thomsen and Shavel [90].) [N.B. The terminology is somewhat reversed in logical parlance, where the term “variable” is usually reserved for the locator (the x of a propositional function f(x)). In an L-C logic, however, both prototype functions ( content and location) are expressed by variables ( the f and x respectively of quantified expressions).] The prototype- specification has quite a lot to say about which sentences enter the calculus, regulating the possible substitutions for the variables p, q, r ... in a logical schema. This is of no small consequence since, even as canonical logic recognizes, not all well-formed sentences determine propositions. The self-referentially interpreted, „This sentence is false‟ (which if true, is as it claims, false; but if false, is thereby true) is a case in point. It is on all accounts well-formed; but if allowed into the calculus, it would undermine the "complementarity" requirement of truth and falsity. Both L-C and canonical logic, then, begin with the insight that certain well-formed sentences do not take a truth value. L-C, however, reasons about these sentences without having to reason with them; that is, without bringing them into the calculus. Because sentences are themselves located contents (as perceptible facts: character strings, tokens, etc.), it is possible to mechanically parse their representational forms from the outside. One distinguishes the genuine from the putative propositions by attempting to make the terms perform their prototype functions. Within a two-valued propositional calculus, then, the only "presupposition" is that propositions alone are treated; and it is expressed in the following manner: propositions occur in other propositions only as the "arguments" of truth-functions. „P is false‟ is identical to the truth-function ~p (or not p). So, even in the case of „This very proposition is false', however
Version 0.93, last saved 1/23/2009 7:29:00 PM
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Page 79/80
Copyright 1998 Dimensional Systems, all rights reserved 222 Third St. suite 3122 Cambridge, MA 02139
The LC Model for OLAP
strongly the phrase „this very proposition‟ insists that it is a proposition, it fails the L-C compositeness query. Nor can one retreat to the claim that the phrase "names" a proposition, namely „this very proposition is false', because this string of tokens has to submit to the same query - what proposition? Neither „this very proposition‟ nor the entire „this very proposition is false‟ are (on the standard reading of these terms) possible values of p in „p is false‟ (that is, neither can be its own truth-argument in the logical formula ~p). In L-C logic, then, meaninglessness never infects a logical formula, since it never enters the calculus. Despite appearances, „If the square root of 2 is blue, then Bill is in the bar‟ is not an instance of „if p then q'. It is not the logical formula that is called into question (as if its applicability were no longer global), but the substitution of a non-proposition for the propositional variable p. And if an element of a proposed molecular compound fails the test, the element may be eliminated, generally without having to delete the entire expression.Thus, L-C logic takes a firm stand on the issue of invalid propositions, and can serve as a useful support for the OLAP (and relational) requirements of logical sparsity handling (as it did in chapter 7). 1.1.3 Data flows The position and resolution structuring of the L-C dimensions relativizes all statement locations, and paves the way for data to flow from anywhere to anywhere. Allocations and inferences are as natural as aggregations. In contrast, Canonical logic, as evidenced by Russell‟s empiricism, his logical atomism and his Theory of Descriptions (a decompositional approach to transforming ordinary language into collections of quantified logical statements and assertions of primitive sense data), tacitly assumes that data enters at some base level and then propagates upward.
Version 0.93, last saved 1/23/2009 7:29:00 PM
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Page 80/80
Copyright 1998 Dimensional Systems, all rights reserved 222 Third St. suite 3122 Cambridge, MA 02139