Embed
Email

roller

Document Sample

Shared by: panniuniu
Categories
Tags
Stats
views:
0
posted:
12/12/2011
language:
pages:
16
TPC-R

Report









Submitted by: Gunjan K. Gupta & James Roller

Department of Electrical & Computer Engineering

Date: October 17, 2000

TPC-R Report: EE382M









Table of Contents

1. Introduction _______________________________________________ 1

1.1. Background – TPC-D______________________________________ 1

1.2. Microsoft vs. Oracle Challenge _________________________ 1

1.3. Overview _______________________________________________________ 2

1.3.1. Purpose of TPC-R _________________________________________________________ 2

1.3.2. Setup ___________________________________________________________________ 3



2. Decision Support Systems - DSS ____________________ 3

2.1. Schematics ____________________________________________________ 3

to be elaborated – why refresh is needed and how that maps to the refresh functions) _____________ 3



2.2. Examples ______________________________________________________ 4

to be converted into sentences with explanation and insights into why TPC-R is relevant for them) __ 4



3. Definition of TPC-R Benchmark ___________________ 5

3.1. Load Test ______________________________________________________ 5

3.2. Performance Test ___________________________________________ 5

3.2.1. Power Test _______________________________________________________________ 5

3.2.2. Throughput Test __________________________________________________________ 5



3.3. Operation conditions for a TPC-R _____________________ 6

3.4. Implementation guidelines _______________________________ 6

4. Database Tables __________________________________________ 7

4.1. Part Table _____________________________________________________ 7

4.2. Part Supplier Table _________________________________________ 7

4.3. Line-Item Table _____________________________________________ 7

4.4. Orders Table _________________________________________________ 7



- Submitted by: Gunjan K Gupta (gunjan@ece.utexas.edu), James Roller (theroller@mail.utexas.edu)

i

TPC-R Report: EE382M







4.5. Supplier Table _______________________________________________ 7

4.6. Nation Table _________________________________________________ 7

4.7. Region Table _________________________________________________ 8

5. TPC-R Queries and Refresh Functions ___________ 8

5.1. Queries _________________________________________________________ 8

5.2. Refresh Functions __________________________________________ 8

6. Client-Server vs. Host Configuration ______________ 9

7. Scaling and Database Generation __________________ 9

7.1. Scaling Factor _______________________________________________ 9

7.2. Database and Query Generation________________________ 9

7.2.1. DBGEN ________________________________________________________________ 10

7.2.2. QGEN _________________________________________________________________ 10



8. Performance Metrics __________________________________ 10

9. An Implementation of TPC-R by NCR ___________ 10

Some discussion about Figure 8. ____Error! Bookmark not defined.Error! Bookmark not

defined.



10. Conclusions _____________________________________________ 13









- Submitted by: Gunjan K Gupta (gunjan@ece.utexas.edu), James Roller (theroller@mail.utexas.edu)

ii

TPC-R Report: EE382M







1. Introduction

TPC-R is the new decision support benchmark developed by the Transaction Processing

Performance Council (TPC) to measure how well data warehouses perform decision

support workloads with multiple, concurrent users submitting complex business queries.

To best describe TPC-R, we will begin with a brief history of TPC-D because it was the

basis of TPC-R’s development.



1.1. Background – TPC-D

TPC-D was launched in 1994 as an ad hoc query benchmark to be used with DSS

(Decision Support System) databases. However, between 1997 and 1998 problems began

to arise in the benchmark. Databases were being engineered around the 17 queries used

in TPC-D through the use of stored aggregates in order to improve query times. Stored

aggregates are used by optimizers to accelerate queries through pre-calculation of joins,

sums, counts, averages, and etc. Some specific examples of the stored aggregate were

NCR’s introduction of the join index (Teradata), DB2’s automatic summary table (AST),

and Oracle’s Materialized Views. The result was that the system resources went to

database building and updating rather than the intended query processing, so meaningful

performance metrics were no longer being reported for TPC-D’s.



Another drawback of using TPC-D was that the schema was overly simple (only 17

queries). The tables were too simplistic to reveal differences between indexing

techniques and optimizers. Uniform distribution of data was assumed which clearly does

not model the real world. Normally one would find a few large customers and

distributors with whom a business would work with and several smaller customers and

distributors. The data would actually be skewed rather than evenly distributed. Pre-

computation was possible and was not officially disallowed (loophole). Vendors started

optimizing for specific queries since they already had the list of queries.



1.2. Microsoft vs. Oracle Challenge

In November 1998, Oracle issued the “Million Dollar Challenge” to Microsoft to

“demonstrate that SQL Server 7.0 is not 100 times slower than the Oracle database when

running a standard business query against a large DB.” The results they were revealing

were based on Query 5 of TPC-D. Oracle’s database could handle the query in about 70

seconds, whereas the typical time for the query was roughly 1000 seconds. The dramatic

improvement was a result of “materialized views” that was making it possible for

companies to report TPC-D query execution times of less than one second. But

materialized views had fundamentally altered the nature of the test by using

summarization and pre-computation and thus the relevance of TPC-D benchmark for

evaluating database technologies became unclear. In February the General Council of the

Transaction Processing Performance Council abolished and decided to split it into two

separate benchmarks namely TPC-R and TPC-H. TPC-R was designed to measure the





- Submitted by: Gunjan K Gupta (gunjan@ece.utexas.edu), James Roller (theroller@mail.utexas.edu)

1

TPC-R Report: EE382M





performance of DSS systems where the queries were known in advance and thus

technologies like AST, Materialized Views and Join Index could be used. TPC-H, the ad

hoc benchmark, represented the original TPC-D ad-hoc decision support environment in

which users will not know the queries in advance and the execution times can be very

long.



1990





Designed for ad-hoc queries

TCP-D Got misused 1994







1998





Broken up by TPC

ad-hoc queries

reporting (original TPC-D intent)

(unexpected development)

TCP-R TCP-H









Figure 1: A timeline showing birth of TPC-R from TPC-D





A materialized view simply pre-aggregates data for a predefined query. The views are

generated during the database design and load phase. It results in almost instantaneous

responses when the query is sent. The downside is that updates take longer because both

the base data tables and corresponding materialized views must be refreshed whenever

the database is changed. As a result of materialized views, TPC was forced to re-evaluate

the validity of TPC-D that led to its split.



1.3. Overview

1.3.1. Purpose of TPC-R

TPC-R, the business reporting benchmark, represents a decision support environment in

which users know the queries very well and can optimize their DBMS to execute these

queries very rapidly. Queries are expected to be known in advance, so stored aggregates

are a legal engineering solution to database design in TPC-R. The benchmark is meant to

answer critical and frequently asked business questions.









- Submitted by: Gunjan K Gupta (gunjan@ece.utexas.edu), James Roller (theroller@mail.utexas.edu)

2

TPC-R Report: EE382M









1.3.2. Setup

TPC-R is designed to handle a large volume of data (order of 1 GB when using a scale

factor of one). The queries are more extensive with the addition of five new queries and

two update transactions. All seventeen original queries used by TPC-D were kept except

for Query 13, which was modified. Also, multiple concurrent query streams required in

part of the benchmark that was not seen in TPC-D. This had resulted in performance

numbers under TPC-D that were based on single-user query performance alone and could

not be accurately compared to results based on multiple users. In general, the queries are

more complex than OLTP (Online Transaction Processing, TPC-C). Finally, the

performance metric is based on the number of queries executed per hour for a given

database size (scale factor). The actual number used for the performance metric comes

from the square root of the product of the Power and Throughput metrics that will be

discussed shortly. All of the queries must be ACID (Atomic, Consistent, Isolated,

Durable).



The basic table manipulation operations used in the benchmark are –



 Select

 Aggregate

 Join

 Sort







2. Decision Support Systems - DSS



2.1. Schematics

to be elaborated – why refresh is needed and how that maps to the

refresh functions)

•Business Operations represents the OLTP(TPC-C) side. Example: the web-server and the user browsers on Amazon web-site.

•Business Owner side represents the back-end(TPC-R) that uses the data collected from the Business Operations side to support

business decisions.









- Submitted by: Gunjan K Gupta (gunjan@ece.utexas.edu), James Roller (theroller@mail.utexas.edu)

3

TPC-R Report: EE382M









refresh

Decision

OLTP OLTP DSS

DSS Makers

Transact OLTP

Databa Databa

se Queries

ions se OLTP

Databa

Databa

se

se

Business

Business Owner

Operatio

ns



Figure 2: A schematic diagram showing the operation of a DSS system









2.2. Examples

to be converted into sentences with explanation and insights into

why TPC-R is relevant for them)



•Amazon recommending books to its customers – a real-time DSS using collaborative

filtering.

•A non-automated DSS system would involve reports on sales and revenues etc. being

generated by the DSS system on a regular basis and used by managers to make business

decisions; for e.g. whether and when a model of a car should be phased out ?

•Profiling & customer: Garden.com generating categories of users and products and using

this information for optimizing warehouse and customize web-site.









- Submitted by: Gunjan K Gupta (gunjan@ece.utexas.edu), James Roller (theroller@mail.utexas.edu)

4

TPC-R Report: EE382M









3. Definition of TPC-R Benchmark



The TPC-R benchmark consists of two tests; the Load Test and the Power Test.



3.1. Load Test

The first test is the Load Test. Under the Load Test the database tables are created and

the system is brought up to the initial configuration needed for the Performance Tests.

No queries may be run at this time.



3.2. Performance Test

Subsequently a Performance Test must always follow a Load Test. The Performance

Test is further subdivided into 2 runs.



3.2.1. Power Test

The Power Test measures the single user raw query execution power. Under this test a

refresh of the database must be executed both before and after the query set. Figure 2

shows a sample execution block-diagram of the Power Test configuration for the NCR

system.



3.2.2. Throughput Test

The first run is the Following the Power Test is the Throughput Test that measures the

system’s optimum query processing ability. It is designed to demonstrate performance

against multi-user workloads.









- Submitted by: Gunjan K Gupta (gunjan@ece.utexas.edu), James Roller (theroller@mail.utexas.edu)

5

TPC-R Report: EE382M









Figure 3: Power Test configuration for the NCR system.





3.3. Operation conditions for a TPC-R

•Database is continuously available 24/7. (allowance for once a month maintenance).

•Periodic Refresh : DSS database tracks OLTP database.

•Database admin can set locking levels for queries and refresh

-Tradeoff between how often you refresh vs. performance.

•Database admin can set locking levels for queries and refresh





3.4. Implementation guidelines

•Must use a commercially available DBMS (Oracle, DB2..)

•Must use Dynamic SQL (Dynamic SQL is a programming technique that enables you to build SQL statements

dynamically at runtime.)









- Submitted by: Gunjan K Gupta (gunjan@ece.utexas.edu), James Roller (theroller@mail.utexas.edu)

6

TPC-R Report: EE382M









4. Database Tables

Talk a paragraph about the table setup in general – one to many relation ship and which ones are actually

used and how.









4.1. Part Table



Figure 4: Relationship between various tables of TPC-R





4.2. Part Supplier Table

4.3. Line-Item Table

4.4. Orders Table

4.5. Supplier Table

4.6. Nation Table



- Submitted by: Gunjan K Gupta (gunjan@ece.utexas.edu), James Roller (theroller@mail.utexas.edu)

7

TPC-R Report: EE382M







4.7. Region Table



5. TPC-R Queries and Refresh Functions

Each query is defined by the following components: business question, functional query

definition, substitution parameters, and query validation. The business question describes

the business context of the query. The functional query definition describes the function

to be performed by the query in SQL-92. The substitution parameters describe how to

generate values needed to complete the query syntax. Finally, the query validation

describes how to validate the query against the qualification database. [TPC, 19] Figure

5 lists the queries individually by name and number where the name of each query is

sufficient to describe the general intent of the query.



5.1. Queries

1. Pricing Summary Report 12. Shipping Modes and Order Priority

2. Minimum Cost Supplier 13. Customer Distribution

3. Shipping Priority 14. Promotion Effect

4. Order Priority 15. Top Supplier

5. Local Supplier Volume 16. Parts/Supplier Relationship

6. Forecasting Revenue Change Query 17. Small Quantity Order Revenue

7. Volume Shipping 18. Large Volume Customer

8. National Market Share 19. Discounted Revenue

9. Product Type Profit Measure 20. Potential Parts Promotion

10. Returned Item Reporting 21. Suppliers who kept orders waiting

11. Important Stock Identification 22. Global Sales Opportunity





Figure 5: List of TPC-R Queries



5.2. Refresh Functions

Describe the the function of the refresh functions and why you need two.

Refer to figure 2 to explain.



The Refresh Functions are defined by the following components: business rationale,

refresh function definition, and the refresh data set. The business rationale shows the

business context of the function. The refresh function definition describes the actual

function to be performed. Finally, the refresh data set defines the set of rows to be

inserted or deleted from the ORDER and LINEITEM tables that will be 0.1% of the

original population of the two tables. [TPC, 72]

RF1 : Old Sales – Removes old sales info – which is not used for DSS.

RF2 : New Sales – Add the new entries in the OLTP database.



Figure 6: List of TPC-R Refresh Functions







- Submitted by: Gunjan K Gupta (gunjan@ece.utexas.edu), James Roller (theroller@mail.utexas.edu)

8

TPC-R Report: EE382M





Two Refresh Functions are needed to keep the tables current. The first refresh function

removes stale data from the database where the system administrator defines stale data.

Depending on the business, stale data could be a year old or even a day old. It is

dependent upon the requirements of the business being emulated. The second refresh

function keeps the database current according to the activity present on the business

operations (OLTP) side of the business. (Refer to Figure 2) Both refreshes occur

periodically while queries are being run on the database.







6. Client-Server vs. Host Configuration

•Bottleneck depends on query type

– Network

– Client•Client (query execution)

•Server (database access)

•Performance will generally be better for Client Server









Figure 7: Two examples of possible configurations for TPC-R: Client Server vs. Host



7. Scaling and Database Generation

7.1. Scaling Factor

Talk about scaling issues –

•Scale factors are limited to 1,10, 30, 100, 300, 1000, 3000, 10000,

–Encourage comparable results.

–Ensure significant differences between database sizes.









7.2. Database and Query Generation



- Submitted by: Gunjan K Gupta (gunjan@ece.utexas.edu), James Roller (theroller@mail.utexas.edu)

9

TPC-R Report: EE382M







7.2.1. DBGEN

–ANSI C program for generating Data.

–Generates Tables as .tbl text files.

–Command line arguments for passing SF, table type etc.





7.2.2. QGEN

–Used to generate the SQL file (.sql) for each TPC-R Query.

–Uses a template definition language.

–Allows insertion of database and database instance specific information (Oracle, DB2).









8. Performance Metrics

Take slide 16 and put it here and explain it.









9. An Implementation of TPC-R by NCR

As seen in Figure 7, NCR implemented their system using 64 Intel Pentium III Xeon 550

MHz processors each with 1 MB of cache. In total, the hardware setup contained over

11,000 GB of storage space. On the software side, they used the NCR WorldMark 5200

MPP server platform running the Teradata V2R4.0 database manager and Windows 2000

as the operating system. Figure 7 lists the complete hardware system configuration.





- Submitted by: Gunjan K Gupta (gunjan@ece.utexas.edu), James Roller (theroller@mail.utexas.edu)

10

TPC-R Report: EE382M









Some important points to note while observing Figure 8 are that the scale factor was 1000

for their system under test and the Throughput Test used 7 simultaneous streams while

running queries. The final and most important result reported by Figure 8 is the

price/performance metric that came out to be $607 per query per hour using a database

adjusted with a scale factor of 1000. Unfortunately, NCR is the only group that has

provided a Full Disclosure Report for TPC-R, so no comparisons can be made as to the

strength or weakness of their price/performance metric. It should be observed, however,









- Submitted by: Gunjan K Gupta (gunjan@ece.utexas.edu), James Roller (theroller@mail.utexas.edu)

11

TPC-R Report: EE382M









Figure 8: System Configuration of NCR Implementation









Figure 9: Performance Results from NCR



- Submitted by: Gunjan K Gupta (gunjan@ece.utexas.edu), James Roller (theroller@mail.utexas.edu)

12

TPC-R Report: EE382M





that NCR’s implementation was at a cost of almost $13 million when considering the 5-

year upkeep and maintenance for 24 hour a day and 7 day a week vigilance.



An interesting observation of the query times is that Query 13 took over 10,000 seconds

to complete on average for the Throughput test which was more than double the next

closest comparable time. We noted that Query 13 happened to be the single query that

was completely changed from the original set of 17 queries provided by TPC-D. In TPC-

R, Query 13 is the Customer Distribution Query. This query evaluates the size of orders

maintained by every single customer in the database, even if the customer has never

made an order. Just from the query’s description, we can see that the processing of every

customer will be a very intensive operation for the database management system and

explains the rather long query time.



The Full Disclosure Report contains all of the detailed information needed to recreate

NCR’s implementation of the TPC-R benchmark including all C and SQL coding used in

the project. Furthermore, both the TPC and an independent auditor (InfoSizing) have

verified the results disclosed by NCR to ensure validity.





10. Conclusions

One para –

•Results are highly dependent on how well the TPC-R models the customer DSS system.

•Benchmark does not cover exhaustively every real-life DSS requirements.

•Does not cover many data-mining requirements which might be an integral part of a DSS.

•Fast evolving Market : relevance to the real-life future applications.

•Can only compare a TPC-R result with another TPC-R of same size.

•TPC-R measures the unexpected benefit derived from attempting to cheat the TPC-D benchmark

•Results are highly architecture and query dependent (whether the architecture used is Client-Server or

Host).









- Submitted by: Gunjan K Gupta (gunjan@ece.utexas.edu), James Roller (theroller@mail.utexas.edu)

13



Related docs
Other docs by panniuniu
organization_of_slp_working_files_3-23-10
Views: 1  |  Downloads: 0
Lesson 2 2011 key
Views: 0  |  Downloads: 0
Site Survey
Views: 2  |  Downloads: 0
alt energy project SP11
Views: 1  |  Downloads: 0
Effie Biography
Views: 0  |  Downloads: 0
Download-Organization-application-letter
Views: 0  |  Downloads: 0
TWIN_Nomination_form_2010
Views: 0  |  Downloads: 0
Engineering Change Order Master Log
Views: 2  |  Downloads: 0
360654.f1
Views: 0  |  Downloads: 0
By registering with docstoc.com you agree to our
privacy policy

You are almost ready to download!

You are almost ready to download!