Testing the Incremental Load Improvements in 10gR2 OLAP by yvtong

VIEWS: 9 PAGES: 9

									Testing the Incremental Load Improvements in 10gR2 OLAP - Part

I
December 23, 2005
Mark Rittman

One of the questions that often comes up when I'm helping people with Oracle OLAP,
is what is the most efficient way to load incremental data into a cube? Take for
example a situation where you have an initial load of 8000 rows into a cube that has
five dimensions, and subsequently you need to load another 2000 odd rows; do you
rebuild the cube from scratch when you need to add the further 2000 rows, or do
you leave the cube as it is and add the 2000 rows on top of what's already there,
either as a separate load on their own or mixed in with the data that was already
loaded? If you're using AWM10g, do you select the "Aggregate the full cube" option,
or the "Aggregate just for incoming values" option, and if you choose the latter what
benefits does it bring?




This is actually one of the questions I've addressed in the forthcoming article I've
written for DBAZine on Oracle Database 10g OLAP performance tuning, but I don't
think I'm giving too much away when I tell you that, with 9iR2 and 10gR1, it actually
doesn't make any difference at all whether you choose to aggregate just for new
values or for all values in the cube, or whether you separate out the new data and
load it in on it's own, or you just reload the entire dataset, old data and new - the
aggregation process takes the same time whatever. However, one of the new things
promised with 10gR2 is that the loading of incremental data is much faster than with
9iR2 or 10gR1, as this excerpt from the Oracle OLAP 10gR2 PDF shows:




Quite an impressive improvement. With this in mind, I thought I'd load up the
dataset that I used for the article and see what sort of improvements I could find.

The cube consisted of two measures (order quantity, ship quantity) and five
dimensions (product, customer, channel, time and promotion). The source table for
the measures contained 11429 rows, and I split this into two load batches, one of
8884 rows, and 2545 in the second. I then put together four scenarios:

   1. Load the first 8884 rows into the cube, select "Aggregate the full cube", then
       load these 8884 rows plus 2545 additional rows, again selecting "aggregate
       the full cube".
   2. Load the first 8884 rows as before, select "Aggregate the full cube", then
       load these 8884 rows plus 2545 additional rows, but this time select
       "aggregate for incoming values only"
   3. Load the first 8884 rows as before, then additionally load just the 2545 new
       rows and select "aggregate the full cube"
   4. Load the first 8884 rows as before, then additionally load just the 2545 new
       rows, but this time select "aggregate for incoming values only"

These scenarios were tested on 10gR1 and 10gR2 on Windows XP SP2, with 10gR2
being the developer preview that's available on OTN. 10gR2 was running in a
VMWare virtual machine with 1GB of RAM allocated, whilst 10gR1 was running on
the OS proper. All settings were left as the default, with no changes to SGA_TARGET,
PGA_AGGREGATE_TARGET and so on.

With 10g Release 1, the results looked like this:
As I said earlier, there's no real difference between any of the load scenarios; each
of them took around the same time to process the additional data. Also, the rate at
which incremental data is processed is about one-fifth of the rate of the original data
load - my take on this is that the load routine within Oracle OLAP has to "merge" the
incoming data in with existing cells in the measures, which is a lot slower than
straight "inserts". Nonetheless, there's no performance boost to be had by either
selecting "aggregate incoming values only", or by separating out new data from old.
Infact, if you look at the last scenario, doing a full load of all the 11k rows, you'll see
that it's actually quicker to rebuild the cube from scratch than to try and add in an
incremental load.

With 10g Release 2 (remember, it's running in VMWare virtual machine, so we're
not strictly comparing apples and apples, and it's beta code) the results look like
this:




A few points to note here:

       The rate at which initial data is loaded is about 50% faster than 10gR1 - not
        bad.
       When the new data is mixed in with the old and reloaded into the cube,
        10gR2 is still faster than 10gR1, but the benefit is less than with our initial
        data load
       If the new data is split out from the old before it's loaded, 10gR2 is up to
        twice as fast as 10gR1 for loading incremental data. Strangely, it didn't seem
        to matter whether I selected the "aggregate just incoming values" option or
        not, in fact in the tests I carried out the load was actually faster if I just
        selected "aggregate all values".
       With these performance improvements, it's now worthwhile doing an
        incremental load, rather than rebuilding your cube from scratch each time.

I ran the tests a few times to try and get some stable / averaged out figures, but you
should bear in mind that the dataset is comparatively small and the incremental
load "boosting" algorithm may provide more benefits the larger the size of your
cube.

So, what this says then is that there are performance improvements in 10gR2 for
loading incremental data, but (from my tests, anyway) they are found when you
just load just the new data into a cube, rather than reloading the entire dataset
including your new data. I wasn't however able to see any performance benefit from
choosing the "aggregating incoming values only" option - I seemed to get this
benefit whether the option was selected or not. The performance benefit isn't huge
- certainly not the 18-fold improvement that the Oracle slide suggests - but it is
there, and it made the difference for me between reloading the cube from scratch
each time, to putting in place a system where we perform an incremental load of
just the new rows.

Anyone else had a play around with this, perhaps got closer to the improvements in
the slide? Let me know if you do.



UPDATE 14-Sep-05:
Scott Powell dropped me a line this evening with some comments on the above
article. They're actually very significant as they explain why the incremental loads
are taking almost as long as the full loads. Read on...

"I suspect this is what is also causing your performance problems / results with the
incremental loads in the article you published on Sept 8th "Testing the Incremental
Load Improvements in 10gR2 OLAP".

I've attached a quick script I wrote. This script creates 3 "tuple" tables, that hold all
of the composite tuples from the source data.

create table tuples_all as
select distinct
             channelid,
             promoid,
             prodid,
             custid
from    gs_sales;


create table tuples_load1 as
select distinct
             channelid,
             promoid,
             prodid,
             custid
from    gs_sales_load_1;


create table tuples_load2 as
select distinct
             channelid,
             promoid,
             prodid,
             custid
from    gs_sales_load_2;


select count(*) from tuples_all;
select count(*) from tuples_load1;
select count(*) from tuples_load2;

TUPLES_ALL loads from gs_sales, TUPLES_LOAD1 loads from gs_sales_load_1, and
TUPLES_LOAD2 loads from gs_sales_load_2. Here are the results:

tuples_all            1,688 tuples
tuples_load1          1,634 tuples
tuples_load2          1,135 tuples

Basically, with time as the dense dimension, I've removed it as a consideration from
the tuples (because any reference to any time period will have to load the entire
page). As you can see, your "load1" and "load2" data is touching almost all of the
tuples. I believe this is why your performance is almost always taking almost the
same amount of time as a full load - because essentially you are updating almost
every page both with the initial and the incremental loads.


I suspect (but have not yet tested) that your performance results would be
markedly different if you did the following:


a) redesigned the variables to be dimensioned like this: <channel composite<time
customer product promotion>>
b) split out the gs_sales_load_1 to include all data for time periods < 01-apr-2005
c) split out the gs_sales_load_2 to include all data for time periods >= 01-apr-2005


This gives a better representation of what an incremental load would look like - both
because the variable now has TIME as a sparse dimension (meaning when the
engine tries to aggregate it, it doesn't have to touch every page), and because
loading incremental loads based on time periods is pretty common from a business
perspective.

I will do my best to clean up this documentation, but hopefully you'll find this useful.
I'll try to check the performance of the incremental load in the next couple of days
- if you beat me to it, let me know how it goes."

Thank Scott! I'll rerun the tests as soon as possible and report back on the results.
Until then, treat the results in my test as unreliable until verified otherwise.



Update 15-Sep-05
: Scott send a follow-up email:

"Mark, just FYI, I had the chance to redefine the cube and do some further testing.
See the results immediately below - saw huge performance gains by choosing
channel as the dense dim instead of time:


Your initial cube <time <customer product promotion channel>>
Load data time 65 seconds
Aggregate data time almost 1.5 hours
Size of variables approx 315 Mb each


New cube <channel <time customer product promo>>
Load data time 1 second
Aggregate data time 33 seconds
Size of variables approx 10 Mb each


Redefining sparsity obviously made a huge difference.


However, I then tried to do an incremental load, and did NOT see the results I
expected. Basically, I broke up the gs_sales table into gs_sales_load_1 containing
everything with a date < 01-apr-2005 (10,675 records), and gs_sales_load_2 with
a date >= 01-apr-2005 (754 records). I expected to see a huge difference between
the initial load and the incremental update, but didn't:


Step 1:    load data from gs_sales_load_1 and do a full cube aggregation:
            Load time: 1 sec
            Full aggregation time: 35 sec (note - a bit longer than the 33
            seconds before, but not significant???)
Step 2:    incremental load from gs_sales_load_2 and do an incremental cube
aggregation:
            Load time: 3 sec
            Incremental agg time: 21 secs


Obviously the 21 secs is better than the 35 seconds, but not by the orders of
magnitude I expected to see (I thought it would take just 3 or 4
seconds to do the whole thing)


I'm not sure exactly what to make of this. With the solve times and disk size so small,
its hard to judge if these results are a true representation. I'm considering loading
up my big play cube to try a more robust test.


What do you think?


Note also that I haven't tested anything yet with compressed composites turned on.


Thx,
Scott"



Update 16-Sep-05
:I had a word with Paul Cannon, our resident OLAP/Express expert, who came back
with the following observations

"Mark,


This depends a bit on what Scott means by incremental load. Presumably he means
loading data for one time period at a time. If this is the case then having time inside
the composite is forcing the creating of a new set of tuples for each load.


For example, if the combination of customer, product and promotion results in, say,
1000 tuples, then you will have 1000 'sets' of these tuples for each time period, but
of course these tuples won't exist until data for each time period has been loaded.
Therefore when loading a new set of data for a new time period as well as loading
the data, the AW has the internal overhead of creating the extra tuples. If Time is
outside the composite then the same single set of tuples will get used over again for
each time period. (this is a very simplistic view - the actually number of tuples will
usually vary for each load, but we'll ignore that for this example!)


To put this another way, with time outside the composite, loading a new set of data
will result in variable size growing by one dimension value. With time inside the
composite, the variable is growing by 1000 dimension values.
Whilst putting time inside the composite may result in gains in one area
(aggregations etc...) you have to bear in mind the negative effect this extra internal
maintenance is having. Hence Scott won't get quite the performance improvement
he's expecting.


Unless the time dimension is genuinely sparse (which seems unlikely as data is
being loading on a per time period basis and presumably aggregated up a time
hierarchy), perhaps Scott should try leaving both time and channel outside of the
composite.


Paul."



Update 16-Sep-05
: Scott has posted a question on this subject to the OTN Forum. You can check it out
here.



Update 16-Sep-05
: Scott subsequently commented on Paul's observations:

"I think Paul is referring to a typical OFA type setup, where almost every composite
value (usually account & org) exists for just about every value of time (which is
usually at the month level), making time a very dense dimension. However, on your
GSW sample data, this is showing customer purchases at the day level spread out
over 8 years. Unless the same customer purchases the same product through the
same channel with the same promotion on just about every day over 8 years, time
is going to be sparse.


If you look at the original .xls worksheet (copied here), the sparsity rating of 433 for
TIME on the base level data means that the same customer / product / channel /
promo tuple only shows up once for every 433 time periods - meaning that if time is
dense the other 432 cells on a page would be empty. Note also that Paul is
completely right that the data gets more dense after aggregation - in this case, the
time sparsity rating goes from 433 down to 41, but that still means that, on average,
the same customer / product / channel / promo tuple shows up only once every 41
time periods... This is why making time sparse instead of dense resulting in the
variable sizes shrinking from 600+ Mb down to about 20 Mb, and reduced the solve
times from 1.5+ hours to only 33 seconds.
One final thing to note is that I did re-split the data for the "initial" vs. "incremental"
load to better highlight the incremental performance problem for Oracle. I split your
GS_SALES table down to have everything prior to 6/28/2005 being in
gs_sales_load_1 (11,411 records), and only 6/28/2005 - 6/29/2005 being in
gs_sales_load_2 (18 records). This means that the incremental load should only be
creating 18 new tuples plus the associate parent tuples (comes out to being exactly
1,466 total tuples). Given that the entire solved cube contains 435,274 total tuples,
I would expect the time required to aggregate should be (1,446 / 435,274) * the 33
seconds the full aggregation takes - or about .11 seconds. Obviously there may be
some overhead on setting up the incremental loads (perhaps that is what is
distorting the numbers on this small sample?)


So long and short - Paul's comments make sense in more of an "OFA" type setup,
but not for a cube that focuses on customer purchases, etc. Also note that this is one
of the big reasons I hate to see recommendations by Oracle that time always be
made a dense dimension...because I have found, for MANY applications, that time is
often one of the LEAST dense dimensions.


Let me know what you and Paul think! I'll let you know what I hear from the Oracle
side (the thread is on OTN also)."

There's obviously a lot more to dealing with scarsity than saying that "time is always
the dense dimension" and Scott in particular has sent me some useful calculations
and examples for working out sparsity both prior to, and after, a cube load. I'll post
an article when I get back from Open World on what my take on all this is.

								
To top