Cubes in the Cloud by HC11120619321

VIEWS: 16 PAGES: 10

									  Enabling Eco-Science Analysis with MatLab and DataCubes in the Cloud

                                  Jayant Gupchup† and Catharine van Ingen*
                          Computer Science Department, The Johns Hopkins University†
                                             Microsoft Research*



                                                           are tedious to do in Excel. We describe how we
Abstract: The ecological sciences are rapidly              have connected MatLab on the desktop over the
becoming data intensive sciences. Several groups           internet to one of our datacubes. The approach can
have been pioneering the use of databases,                 be generalized to connect other tools to our family
datacubes, and web-services to address some of the         of datacubes to give our scientists a choice in
data handling challenges caused by the                     analysis tools. We believe this has implications for
avalanche/tsunami/flood of data. Science happens           other researchers exploring how to enable
only when the data are actually analyzed and today         ecological scientists.
that very often happens with one of the common
scientific desktop analysis tools such as Excel,           1.1 The Ameriflux carbon-climate data set
MatLab, ArcGIS, or SPlus. The challenge then is
how to connect the data in the cloud to the analysis       The Ameriflux network [AMERIFLUX] is a
tool on the desktop without requiring full data            scientific collaboration of over 50 institutions across
download. This article describes our prototype             America and operates approximately 120
connection between one such service and one such           measurement sites. Each site provides continuous
tool. We describe how this approach can be                 observations of ecosystem level exchanges of CO2,
generalized across a number of different science           water, energy and momentum spanning diurnal,
questions and tools. We also explain why this is a         synoptic, seasonal, and inter-annual time scales.
good solution from a scientist’s perspective.              Ameriflux is one of several regional networks that
                                                           together form the FLUXNET global collaboration.
1 Introduction
                                                           Each Ameriflux tower site contributes 22 common
The combination of rapid advances in sensor                measurements to the Ameriflux archive at ORNL.
technology, remote sensing, and internet data              The ORNL archive works with researchers in the
availability are causing a dramatic increase in the        sister CarboEuropeIP network to produce science-
amount and diversity of ecological science data. At        ready data products which are gap-filled, quality
the same time, scientists are collaborating to attempt     assessed and contain additional computed variables.
regional or even global scale studies. These analyses
require mixing time-series data from different             The data are used to understand how climate
sources with site property or other ancillary data         interacts with plants at a systems level to influence
from still different sources.                              carbon flux and global warming. In the past, such
Using a database to assemble and curate such data          studies have been primarily individual site
collections have been documented in depth                  investigations. Today, regional and global analyses
elsewhere [OZER], [SDSS], [CUAHSI-ODM]. At                 are being attempted. At the same time, the data are
the Berkeley Water Center [BWC], we have been              also used by other non-field scientists to provide
building a number of related environmental                 ground truth for climate modeling efforts and
datasets. We describe one of these datasets and the        satellite-based remote sensing data.
kinds of analyses commonly performed. We then
describe the important aspects of our databases and        Carbon-climate data is similar to many other
datacubes.                                                 environmental data sets in the following ways.
                                                                The       data      has     strong    temporal
Our focus here is on how to enable common data                     characteristics – understanding diurnal,
analyses using tools already in use by scientists. We              seasonal, long term changes and other time
chose to use MatLab for two reasons [MATLAB].                      variations are important to the science.
First, a number of our scientists use it. Second, we            The       data     has     important    spatial
also wanted to use it for simple visualizations which              characteristics. For example, micro-climate
         is affected by latitude, longitude, and
         proximity to the coast.
        There are strong and weak correlations
         between the observed and computed
         variables.         Understanding        those
         relationships such as the change in leaf
         production as the result of temperature and
         precipitation correlations is at the center of
         the science.
        Analysis of the time series data often
         requires knowledge of other site
         parameters such as vegetative cover or soil
         composition or site disturbances such as
         fires, floods, or harvests.

Similarly, the sorts of data analyses are similar to
other environmental analyses. Examples include:
     Look for trends or changes in variables
         outside of the common diurnal and
         seasonal fluctuations.
     Look for changes in variables after a
         relatively rare event or disturbance such as
         a flood or fire.                                 Figure 1. Example environmental datacube
     Look for similarities and differences in            dimension. Common dimensions include what
         variables across        sites    of similar      (datumtype and exdatumtype), where (site and
         characteristics such as tropical rainforests).   offset), when (timeline), which (WorkingGroup),
     Integrate with maps.                                and how (quality).
                                                          As shown in Figure 1, our cubes have five common
These characteristics are very common to other            dimensions:
ecological   sciences    such   as  hydrology,
oceanography, and meteorology.                                   What or datumtype and exdatumtype:
                                                                  measurement variable such as precipitation
1.2 Scientific Datacubes                                          or latent heat flux. Because of the large
                                                                  number of variables we handle, we
The what-where-when nature of time series data
                                                                  sometimes parse the variable as a primary
drives much of our databases schema and datacube
                                                                  variable (datumtype) and one or more
dimensions. A datacube is a database specifically
                                                                  extensions (exdatumtype). Most analyses
optimized for data mining or OLAP [GRAY1996],
                                                                  need only the primary variable.
[SSAS]. Datacube abstractions include:
                                                                 Where or site and offset: (x, y, z) location
        Simple aggregations such as sum,
                                                                  where (x,y) is the site location and (z) is
         minimum, or maximum can be pre-
                                                                  the vertical elevation at the site. The site
         computed for speed
                                                                  dimension also surfaces important site
        Hierarchies such as year to day of year to               attributes such as climate classification or
         hour of day can be defined for simple                    vegetative cover to allow locations to be
         filtering with drilldown capability                      grouped      or    filtered   along    those
                                                                  characteristics. The site dimension includes
        Additional calculations such as median or                hierarchies such as latitude band which
         standard deviation can be computed                       enables drilldown as well as grouping.
         dynamically or pre-computed
                                                                 When or time: timeline. This dimension
        All operate along dimensions such as time,               allows aggregation across different time
         site, or measurement variable                            granularities such as day of year or hour of
Datacubes can be constructed from relational                      day. We also build a number of hierarchies
databases using commercial tools.                                 to enable drilldown in time from decade to
                                                                  year through to minute. Some of our cubes
        also include science-specific time attributes             gapPercent: percentage of contributing
        and hierarchies such as water-year or                      data that is either missing or has been gap-
        MODIS-week.                                                filled.
       Which or working group or dataset: data           Datacubes are queried by the multidimensional
        versioning or other collections such as “all      query language MDX [MDX], [MDXTutorial].
        Boreal forest sites” or “real-time data”          MDX is similar to the SQL query language but has
        useful for analyses. As shown in Figure 1,        some prominent differences. A SQL returns an
        this is a many-to-many dimension – a              array; each column relates to one query element
        given site can be a member of multiple            (e.g. time, datumtype, site). An MDX query returns
        datasets.                                         a matrix with a notion of a column axis and a row
                                                          axis; each cell relates to two or more elements. Each
       How or quality: This dimension varies the         axis can contain one or more dimensions or
        most across our cubes, although all have          attributes. Thus each axis can be viewed as a join of
        some notion of data quality. This may             all the dimensions on that axis.
        include spike detection, gap-filling, or
        other data “goodness” metric.                     1.3 Datacube Clients

We’ve been including a few computed members in            In recent years, one sees a considerable growth in
addition to the usual count, sum, minimum and             the attention given to simple access to datacubes.
maximum                                                   Most of these tools are GUI-based and intended for
                                                          business applications. Tableau [TAB], Proclarity
       hasDataRatio: fraction of data present            [PRO], and Cognos [COG] are three such business
        across time and/or variables. This measure        application software applications which provide a
        includes both orginal data and any gap-           GUI and additional analysis features.
        filled data.
       DailyCalc: average, sum or maximum                At present, the most common way of accessing a
        depending on variable and includes units          datacube is the Excel PivotTable [EXCEL]. Excel
        conversion                                        PivotTables allow you to set up a connection with
                                                          the datacube and then browse and select the data
       YearlyCalc: similar to DailyCalc                  using a drag and drop type mechanism. The MDX
                                                          query is generated by Excel and passed over an
       RMS or sigma: standard deviation or
                                                          OLEDB connection.
        variance for fast error or spread viewing

                                                          Figure 2 shows how MDX queries are rendered in

 COLUMN AXIS : Data
 DIMENSIONS :                                             SELECT
    variables and sites
                                                           NON EMPTY CROSSJOIN
                                                           (
                                                             {[Datumtype].[Datumtype].[Datumtype]},
                                                             {[Site].[IGBPClass].[IGBPClass]}
                                                           )
                                                           DIMENSION PROPERTIES
                                                           PARENT_UNIQUE_NAME ON COLUMNS,

                                                           NON EMPTY CROSSJOIN
                                                           (
                       Aggregate Measures                     {[Timeline].[Year].&[2003]},
                                                              {[Timeline].[day].[day]}
                                                           )
                                                           DIMENSION PROPERTIES
                                                           PARENT_UNIQUE_NAME ON ROWS


          ROW AXIS : Time                                  FROM [LatestAmfluxL3L4Daily]
          DIMENSIONS : Year, day
                                                          WHERE ([Measures].[Average])

Figure 2: Rendering of an MDX query in Excel. The various fragments of the query and the rendering
are marked in the same box style (background color and font) to make it easier to identify the mapping.
Excel. The PivotTable columns correspond to the                              leading to fragile cut-paste.
MDX query columns; the PivotTable rows                              These same limitations apply to the above
correspond to the MDX query rows. The returned                      commercial tools as well. These tools also suffer
measures populate the PivotTable array.                             from the difference between scientific graphics and
Despite the ease Excel has a number of associated                   business graphics – the colors, shapes and axes
restrictions from a scientist’s view-point.                         labeling are foreign to scientists. Familiarity is
      Excel PivotTables have limited plotting                      important to scientists. At a minimum, the
          capabilities. To make a scatter plot, you                 difference means that the plot must be repeated with
          must cut-paste the data from the                          another tool prior to publication.
          PivotTable thereby losing the ability to
          update the data via query.                                Our preliminary survey suggests that pairing a
                                                                    datacube with a rich scientific client application
      Excel does not have a scripting feature.                     should offer the best of both. The datacube provides
          Scientists often make collections of very
                                                                    simple slice and dice to aggregates; the rich client
          similar graphs for example to look at
                                                                    provides scripting, familiar graphics and powerful
          different variables across sites. To graph
                                                                    analysis libraries.
          each column in a returned PivotTable
          requires a lot of tedious select-cut-paste.
                                                                    2 System Overview
      While Excel includes some scientific
          libraries such as histograms or Fourier
          transforms, the selection is not as wide as               The components of our solution are shown in Figure
          tools intended for scientists such as                     3. This section explains each as well as identifying
          MatLab. The libraries are also not well                   those which can be reused with other clients or
          integrated with PivotTables again likely                  datacube structures.




                                                                              MatLab


                                                                  Results                 Column              Handles and
                                             Command
                                                                  Object                   Index             Column Names

                            GUI selections
         GUI Builder                             MDX Field Picker                              Handle Manager


                                              Fields and
                                                                                     Results Object
                                                Filters


                                                  Query Builder                                    Deserializer


            Menu           Cube
            Config         Config
                                               Credentials and                                      Serialized
                                                                               HTTP
                                                 MDX Query                                           Results

                                                                             Web Server


                                                            Authenticated
                                                                                          Serialized
                                                           Credentials and
                                                                                           Results
                                                             MDX Query

                                                ADO MD
                                                                             Serializer
                                               Input MDX
                                              Output Results



                                             Figure 3 : System Architecture
2.1 GUI Builder                                                   highlighted datasets are included in the
                                                                  query.
The GUI Builder allows the scientist to select the               “When” is selected in the “Time Axis”
dimension attributes, hierarchy levels, and measures              pane. The time range is selected by the
to be retrieved for inclusion in the analysis.                    start and stop years. The hierarchy to be
                                                                  used and the depth of the hierarchy to be
As shown in Figure 4, the GUI is divided into 2                   traversed are selected in the Select Time
major panes. The Field Axis acts as the column axis               box.
whereas the Time Axis acts as the row axis. The                  The data aggregate is chosen in the “Select
Field Axis supplies the what-where-which; the Time                Measure” box.
Axis supplies the when.
      “What” is determined by the “Select               Note that the interface does not support setting a
        Datum” box. In Figure 4, “LE” (latent heat       filter on a date-time window. This is a limitation of
        flux) is selected. Multiple datums can be        MDX. There is no construct that allows
        selected by control-clicking.                    specification of months 04-12 for 1999, and 08-12
        “Where” is chosen with the “Select              for 2006 and full months for the years in between.
        Groupings” box and the associated “Drill         We chose to set a filter at the year granularity.
        down sites” check box. Latitude and
        longitude bands are common selection             The contents of each menu are determined by
        criteria. If the drill down sites check box is   configuration files. For example, each entry in the
        selected, data are returned for each site        “datum.txt” file is entered on a new line, and each
        within the latitude bands; if the check box      entry is of the format <alias, MDX representation>.
        is not selected, data are aggregated across      The aliases are shown in the GUI box and the MDX
        the band.                                        representations are used by the MDX Field picker to
        “Which” is chosen with the “Dataset” box        create the lists that are passed to the Query Builder
        and the associated “Use dataset filter”          module. As an illustration, the entry for the LE
        check box.        By default, all data are       entry shown in Figure 4 looks like:
        included in the returned aggregate. If the
        check box is selected, only the selected           LE,[Datumtype].[Datumtype].&[11]




   Figure 4. GUI Builder. The GUI exposed by the GUI builder is used to select the what-where-which-why.
                                                         2.4 Serializer (Cube Access)
Note also that the prototype does not include the
“how” or quality dimension.                              The Query Builder invokes the ASP-based web
                                                         service Serializer by http post. The Serializer
2.1 MDX Field Picker                                     unpacks the post, passes the query to the datacube
                                                         and then produces a results stream. An example
The MDX field picker module provides the primary         post is below.
MatLab programming interface. The field picker
invokes the GUI Builder, passes the obtained user        http://<xxxx>/mdxconnect/Default.aspx?db=Latest
selections to the Query Builder, and returns the         AmfluxL3L4Daily&mdx=SELECT%20%20NON%
results object. To invoke the GUI to make                20EMPTY%20CROSSJOIN%20({[Datumtype].[D
selections, the MDX field picker is invoked by:          atumtype].%26[1],[Datumtype].[Datumtype].%26[1
                                                         9]},{[Site].[Site].&[477]})%20%20DIMENSION%
[v1 v2 v3 res] =                                         20PROPERTIES%20PARENT_UNIQUE_NAME
    MdxFieldPicker();                                    %20ON%20COLUMNS,%20%20NON%20EMPT
                                                         Y%20{[Timeline].[Year].%26[1990]:[Timeline].[Y
where v1 v2 v3 are GUI variables and res is the          ear].%26[2006]}%20%20DIMENSION%20PROPE
returned results object.                                 RTIES%20PARENT_UNIQUE_NAME%20ON%2
After the query parameters are selected, the user hits   0ROWS%20%20FROM%20%20LatestAmfluxL3L
submit to exit the GUI and the above call returns.       4Daily%20%20WHERE%20%20([Measures].[Year
To retrieve the results, the field picker is invoked a   lyCalc])
second time:
                                                         The natural question is why does one need to do
[v1 v2 v3 res] =                                         produce results as a stream? Excel and other ADO
    MdxFieldPicker(                                      [ADO] compatible applications can talk to the cube
    'MdxFieldPickerOutputFcn',                           using the OLE DB (ADO MD) drivers. The OLE
    v1,v2,v3);                                           DB driver maintains the relationship of each
                                                         returned data cell with the associated 2 or more
                                                         dimensions. After much investigation, we found that
2.3 Query Builder                                        no such driver exists for environments that cannot
                                                         handle ADO objects; MatLab is one such
The Query Builder module builds the MDX query            environment. In order to solve this problem, we
based on the GUI Builder selections. The Query           made use of the underlying structure of an MDX
Builder module accepts as input:                         result: we serialize the results in a manner that can
     List of groups (sites) and datums                  be reconstructed at the client end. We:
     Time hierarchies (Year – day etc)                        convert the query results into a stream
     Filters: time range filter and dataset-filter               using the ADO MD driver
     Variable Measure(s)                                      convert that stream to a text stream
                                                               pass that text stream over the internet
The selected datum(s) and groups(s) form column                reconstruct the stream to an object that
axis. The Query Builder looks at the number of                    maintains the cell-dimension(s) association
dimensions needed and then cross-joins dimensions                 on the client.
as necessary. Similarly, the row axis is generated
from the time range and hierarchies. The SELECT          The organization of the stream is as follows. The
clause is then constructed by combining the row          first 2 numbers represent the number of rows and
axis MDX and the column axis MDX. The FROM               columns. This is followed by the number of
clause is specified in the cube configuration file.      dimensions on the column axis followed by the
The measures and dataset filters are used to generate    actual column dimension attributes. Based on the
the WHERE clause. Finally, the clauses are               number of columns and number of dimensions on
combined to complete the MDX query.                      the column axis, we can write all the column-
                                                         dimension attributes. Next we write the number of
Our prototype Query Builder can generate queries         dimensions on the row axis followed by the row-
where each of axis can have up to 3 dimensions.          dimension attributes. Again, as done with columns,
This was chosen for coding simplicity and                by combining the information of number of rows
accommodates our family of related eco-datacubes.        and number of dimensions on the row axis, we can
                                                         write in the row-dimension attributes. After reading
the dimensions on the row and column axis, we           access      the    datacube      using    the     NT
write the data matrix [row X col].                      AUTHORITY\NETWORK SERVICE account. We
                                                        realize that basic authentication is not a long term
As an illustration, consider the results of the MDX     solution as the credentials are encoded as Base64 in
query.                                                  clear text and can be decoded quite easily [BASIC].
                                                        This does, however, demonstrate that some level of
SELECT                                                  security can be achieved. Basic authentication also
NON EMPTY CROSSJOIN                                     prevents web-crawlers and robots from accessing
(                                                       the data and over-loading the system.
       {[Datumtype].[Datumtype].&[11
],[Datumtype].[Datumtype].&[19]},                       2.6 Deserializer
       {[Site].[SiteID].&[477],[Site
].[SiteID].&[480]}                                      The results stream is deserialized by back tracing
)                                                       the serializing mechanism. We construct a MatLab
DIMENSION PROPERTIES                                    object that associates the cells with the dimensions.
PARENT_UNIQUE_NAME ON COLUMNS,                          The pseudo-class is represented as follows:
NON EMPTY
       {[Timeline].[Year].&[2000]:[T                    Struct MdxResults
imeline].[Year].&[2006]}                                {
DIMENSION PROPERTIES                                          Integer: Number of             Rows
PARENT_UNIQUE_NAME ON ROWS                                    Integer: Number of             Cols
FROM                                                          Integer: Number of             dimensions
       LatestAmfluxL3L4Daily                            on Col axis
WHERE                                                         Integer: Number of             dimensions
       ([Measures].[YearlyCalc])                        on Row axis
                                                              Struct    Axis[2]                :      Axis
The result of this query is as follows:                 structure
                                                              Double[,] : Data
6,4,2,LE,US-Ton,LE,US-                                  }
Var,Precip,US-Ton,Precip,US-Var,
1,2001,2002,2003,2004,2005,2006,                        Struct Axis
46.1471300802596,NaN,NaN,14.0221343                     {
47994,NaN,38.6128757119495,NaN,33.8                           String [Number of Dimensions
220144215576,NaN,NaN,NaN,81.4135755                     in axis][Number of attributes in
203902,87.5986066925887,44.60779285                     each dimension] : Header
24156,267.782040508823,116.88883878                     }
2413,267.167004732928,295.245106825
869 ...                                                 The MatLab object, res, that implements the above
                                                        results structure is shown below:
For convenience, the numbers that tell us about the
dimensions are in bold, the dimensions themselves       res =
are in italics and the data are underlined. The first       rows: 27
two numbers tell us the number of rows and                  cols: 37
columns in the result. Thus the number of rows is 6         dim1axes: 2
and the number of columns is 4. The next number             dim2axes: 1
(third number) tells us the number of dimensions in         axis: [1x2 struct]
the column axis. In this example, we have 2                 data: [27x37 double]
dimensions on the col. Axis, and 4 columns,
therefore we must have 2*4 = 8 attributes on the        The MatLab user has access to this structure and the
column dimension. The row axis follows this; with       query results, without having to construct the MDX
only one dimension and 6 rows, there are 6              query.
attributes. Lastly the data [Row X Col] are written.
                                                        A typical MDX result contains many dimensions
Access to the Serializer access is secured with         and attributes associated with those dimensions. As
HTTP Basic Authentication [HTTP] and a dedicated        such, we need mechanism that enables the MatLab
machine-local no-login account. The Serializer then     user to make the column-attribute association using
some form of a search. The Handle Manager is that        The user can also retrieve the columns with names
mechanism.                                               containing “US-Var” (columns 3 and 4), “LE”
                                                         (columns 1 and 3) or “US-Var” and “LE” (column
2.7 Handle Manager                                       3).

The Handle Manager associates the datacube               index = find_dims(hm,'US-Var','LE')
dimensions and attributes with the returned results
columns. The Handle Manager is invoked by:               3 Conclusions

Hm =                                                     We have had a great deal of interest in our prototype
  handle_manager(res.axis(1).dim)                        from our colleague scientists. We are still very early
                                                         in getting experience using the connection. One
Consider an MDX query with 3 dimensions on the           unexpected benefit is that many of our scientists
column axis each of which has 10 attributes              have non-Windows desktops. Macintosh Excel
associates. The total number of columns in the           PivotTables does not support datacube access, so
result set will be 10X10X10 = 1000 columns. The          MatLab is the most accessible access.
Handle Manager provides a MatLab user friendly
way to find the right two columns for a scatter plot.    The scripting facility and improved rendering
                                                         facility is already helping us. A collection of plots
The prototype Handle Manager provides two                from one of our Russian River hydrology cubes is
mechanisms to make the association. The user can:        shown in Figure 6 on the page following. The upper
    provide the column number (index) and get           right pane is a simple time plot of two variables
        back the fully qualified name of the             (discharge and turbidity). The upper left pane show
        column by concatenating the attributes           the results of an FFT (Fast Fourier Transform). This
        along different dimensions.                      can be done with Excel, but requires careful cut-
    provide the attribute names and obtain the          paste which is not updated across PivotTable
        indices (or handles) at which those              changes. The lower pane shows a color-coded plot
        attributes are found. The name can be            of discharge as a function of site (aggregated by the
        either partially or fully qualified.             drainage area property) in 2003.This sort of plot is
                                                         often used by our scientists and is not possible with
To further illustrate this point, consider Figure 5 to   Excel.
be the output of a small, simple MDX query. There        Our solution is also faster than Excel over
are two site (US-Ton and US-Var) and two                 potentially slow lines to a scientist desktop, Excel
datumtypes (LE and Precip).                              uses a SOAP-based approach; the XML headers
                                                         make the result bulkier than our text-based
                                                         approach.

                                                         As the amount of data returned by the query gets
                                                         large, the performance can become sluggish. This is
                                                         a combination of the time necessary to retrieve the
                                                         data the network transport time, and the scaling of
                                                         MatLab when handling large amounts of data. The
                                                         good news is that the datacube approach can
                                                         postpone that slowdown when the analysis is not at
                                                         the leaf nodes of the hierarchies. The datacube can
Figure 5: Sample Result set of an MDX query.
                                                         precompute the aggregate and only those aggregates
Yearly values of two datumtypes (LE and Precip) are
                                                         need to be passed to the desktop application and
returned for two sites (US-Ton and US-Var) for the
                                                         handled by that application.
years 2001 through 2006.
                                                         Of course, we are describing only a prototype. Our
To discover the contents of column 3, the user can
                                                         query generator cannot handle more that 3
retrieve the fully qualified column name “US-
                                                         dimensions on an axis. Thus, the maximum number
Var_LE”.
                                                         of dimensions that the query generator can accept is
Header = get_header(hm,3)                                6 (3 on column axis and 3 on row axis). This is not a
                                                         limitation for our cubes, but could be in the future.
Figure 6: Example MatLab generated plots from our Russian River cube. The lower color coded plot of
discharge in 2003 is not possible to create with Excel.

We have also not attempted to include the (very         JOINs; we have demonstrated feasibility and
widely varying) quality dimension. Lastly, we are       correctness, but not optimal coding.
using only basic authentication.
                                                        5 Acknowledgements
4   Future Work
                                                        We would like to acknowledge the valuable
Near term, we want to convert the prototype to an       contributions made by Deb Agarwal, Monte Goode,
easily to deploy technology artifact. We need to add    Matt Rodriguez, and Robin Weber of the Berkeley
support for selecting a datacube including              Water Center, in getting the data ready and testing
specifying credentials, menu configuration files and    various modules during our development and
would like to move to HTTPS [HTTPS].                    deployment. We would also like to thank Rebecca
                                                        Leonardson our first user for many terrific
Our scientists have asked for a command line            suggestions. As always, we rely on Stuart Ozer for
interface in addition to the GUI. They have also        his continued datacube wisdom.
suggested returning an n-dimensional array rather
than using the Handle Manager; that would be more       6 References
intuitive to them.
                                                        [ADO]: ActiveX Data Objects (ADO), a language-
We need to consider how to abstract the differing       neutral object model that expose data raised by an
quality dimensions across our data sets; this is much   underlying         OLE         DB         Provider,
more of a user model than GUI or query generation       http://support.microsoft.com/kb/183606
question. Lastly, we have some performance testing
to do on our generated queries given the CROSS          [AMERIFLUX]:               AmeriFlux     Network,
                                                        http://public.ornl.gov/ameriflux/
                                                        [SSAS] SQL Server Analysis Server, An integrated
[BASIC]: SSL Man-in-the-Middle Attacks, Peter           view of business data for reporting, OLAP analysis,
Burkholder,        February       1,       2002,        Key Performance Indicator (KPI) scorecards, and
http://www.sans.org/reading_room/whitepapers/thre       data                                       mining,
ats/480.php                                             http://www.microsoft.com/sql/technologies/analysis
                                                        /default.mspx
[BWC]            Berkeley      Water           Center
http://esd.lbl.gov/BWC/.                                [SDSS] The Sloan Digital Sky Survey SkyServer,
                                                          http://skyserver.sdss.org/
[COG]:                                   Cognos,
http://www.cognos.com/solutions/index.html              [TAB]: Tableau, A tool for querying and analyzing
                                                        OLAP databases without any knowledge of MDX,
                                                        http://www.tableausoftware.com/info/OLAP_Front_
[CUAHSI] Consortium Consortium of Universities          End/OLAP_Front_End_fw.php
  for the Advancement of Hydrologic Science,
  Observations Data Model,
  http://www.cuahsi.org/his/odm.html

[EXCEL]:           Excel     Pivot        tables,
http://www.microsoft.com/dynamics/using/excel_pi
vot_tables_collins.mspx

[GRAY1996] J. Gray, A. Bosworth, A. Layman,
and H. Pirahesh, “Data cube: A relational operator
generalizing group-by, crosstab and sub-totals,”
ICDE 1996, pages 152–159, 1996.

[HTTP]: J. Franks et al. HTTP Authentication:
Basic and Digest Access Authentication, June 1999.
IETF RFC.

[HTTPS]:                                  HTTPS,
http://technet2.microsoft.com/windowsserver/en/libr
ary/052d2ea9-586c-4e33-9c56-
ecc0c2b203be1033.mspx?mfr=true

[MATLAB] The language of technical computing,
http://www.mathworks.com/products/MatLab/

[MDX]: Multi Dimensional eXpressions (MDX), a
query language to query the SQL Server Analysis
Services (SSAS), http://msdn2.microsoft.com/en-
us/library/ms345116.aspx

[MDXtutorial]:           MDX               Tutorial,
http://msdn2.microsoft.com/en-
us/library/ms144884.aspx

[OZER] Stuart Ozer, Alex Szalay, Katalin Szlavecz,
Andreas Terzis, Razvan Musǎloiu-E., Joshua
Cogan, Using Data-Cubes in Science: an Example
from Environmental Monitoring of the Soil
Ecosystem, MSR-TR-2006-134, 2006.

[PRO]: Proclarity, http://www.proclarity.com

								
To top