"Distributed Databases and Query Processing - PowerPoint - PowerPoint"
Distributed Databases and Query Processing Distributed DB’s vs. Parallel DB’s • Many autonomous processors that may participate in database operations. Difference from a shared-nothing parallel system is in the assumption about the cost of communication. • Shared-nothing parallel systems: message passing cost is small compared with disk accesses and other costs. • Distributed systems: processors are typically physically distant, and so, the cost of message passing is high. Advantages • Like parallel systems, a distributed system can use many processors to accelerate query evaluation. • Further, since the processors are widely separated, we can increase resilience in the face of failures by replicating data at several sites. Disadvantage • Increased complexity of many DB aspects; we have to take care to minimize the number of messages sent between sites. Why distribute data? Organization is itself distributed among many sites Banks • Many branches. Each branch will keep a database of accounts maintained at that branch. • Some data are kept in the central office, such as employee records and current interest rates. Department Stores • Many individual stores. Each store has a database of sales at that store and inventory at that store. • Some data are kept in the central office, such as employee records and a chain-wide inventory, data about credit card customers, and information about suppliers such as unfilled orders. Libraries • Consortium of universities that each hold on-line books and other documents. • Search at any site will examine the catalog of documents available at all sites and deliver an electronic copy of the document to the user if any site holds it. Horizontal Partitioning • For example, the chain of stores might be imagined to have a single sales relation, such as Sales(item, date, price, purchaser) • This relation does not exist physically. Rather, it is the union of a number of relations with the same schema, one at each of the stores in the chain. • Local relations are called fragments. Vertical Partitioning • Example: Suppose we want to find out which sales at the Victoria store were made to customers who are more than 90 days in arrears on their credit card payments. Virtual relation: Sales(item, date, price, purchaser, lastCreditCardPayment) Real relations: Sales(item, date, price, creditCardNumber) – in one site CreditCard(purchaser, creditCardNumber, lastCreditCardPayment) – in another site The Distributed Join Problem R(A,B)S(B,C) Naïve Strategies 1. Send a copy of R to the site of S, and compute the join there. 2. Send a copy of S to the site of R and compute the join there. However: • If the channel has low-capacity, e.g., a phone line or wireless link, then the cost of the join is primarily the time it takes to copy one of the relations. • Even if communication is fast, there may be a better query plan if the shared attribute B has values that are much smaller than the values of A and C. • E.g., B could be an identifier for documents or videos, while A and C are the documents or videos themselves. Semijoin Reductions • Query plan: Only the relevant part of each relation is shipped to the site of the other. • For this, compute first the semijoin: R(X,Y) < S(Y,Z) = R(X,Y) Y(S (Y,Z)) by sending Y(S (Y,Z)) to the site of R. • In this way, R is reduced (hopefully)! • Finally, send the reduced R to the site of S and compute the join. Some arguments The semijoin plan is superior if: • Y(S (Y,Z)) is much smaller than S. – Y(S (Y,Z)) will be small compared with S if: 1. There are many duplicates to be eliminated; 2. The components for the attributes of Z are large compared with the components of Y • e.g., Z includes attributes whose values are audios, videos, or documents. • R(X,Y) < S(Y,Z) is smaller than R. – That is, R must contain many dangling tuples in its join with S.