"CS511 Advanced Database Management Systems"
Lecture 12: Overview of Post-Relational Development Oct. 12, 2007 ChengXiang Zhai CS511 Advanced Database Management Systems 1 Outline • Evolution of data models • Post-relational research topics CS511 Advanced Database Management Systems 2 Nine Historical Epochs • Hierarchical (IMS): late 1960’s and 1970’s • Network (CODASYL): 1970’s • Relational: 1970’s and early 1980’s • Entity-relationship: 1970’s • Extended relational: early 1980’s • Semantic: late 1970’s and 1980’s • Object-oriented: late 1980’s and early 1990 • Object-relational: late 1980’s and early 1990 • Semi-structured (XML): late 1990’s to present CS511 Advanced Database Management Systems 3 Pre-Relational Era • IMS (hierarchical data model): Lessons – L1: Physical and logical data independence are highly desirable – L2: Tree structured data models are very restrictive – L3: It is a challenge to provide sophisticated logical reorganization of tree structured data – L4: A record-at-a-time user interface forces the programmer to do manual query optimization, and this is often hard • CODASYL – L5: Networks are more flexible than hierarchies but more complex – L6: Loading and recovering networks is more complex than hierarchies CS511 Advanced Database Management Systems 4 Relational Era • Resolution of “relational” vs. CODASYL is settled by – The success of the VAX – The non-portability of CODASYL engines – The complexity of IMS logical data bases • Lessons: – L7: Set-a-time languages are good, regardless of the data model, since they offer much improved physical data independence – L8: Logical data independence is easier with a simple data model than with a complex one – L9: Technical debates are usually settled by the elephants of the marketplace, and often for reasons that have little to do with the technology – L10: Query optimizers can beat all the best record-at-a-time DBMS application programmers CS511 Advanced Database Management Systems 5 The Entity-Relationship Era • Proposed in mid 1970’s by Peter Chen • Never gained acceptance as the underlying data model implemented by a DBMS – No query language? – Over-shadowed by the relational model? – Looked too much like a “cleaned up” version of CODASYL? • But widely successful for DB schema design – DB design using normalization was “dead in the water” – It was straightforward to convert an ER diagram into a set of tables in 3rd normal form • Lessons: – L11: Functional dependencies are too difficult for mere mortals to understand. Another reason for KISS (Keep it simple stupid). CS511 Advanced Database Management Systems 6 Extended Relational (R++) Era • Beginning in the early 1980’s • A sizeable collection of papers of the following template: – Consider an application , call it X – Try to implement X on a relational DBMS – Show why the queries are difficult or why poor performance is observed – Add a new “feature” to the relational model to correct the problem • Valuable contributions – Set-valued attributes (e.g., available colors of an item) – Aggregation (tuple-reference as a data type, e.g., supply(PT, SR, qty, price), where “PT” and “SR” are pointers to tuples) – Generalization (inheritance) • Lessons: – L12: Unless there is a big performance or functionality advantage, new constructs will go nowhere. CS511 Advanced Database Management Systems 7 The Semantic Data Model (SDM) Era • Early 1980’s • Motivation: relational data model is “semantically impoverished” (can’t easily express a class of data of interest) • Define more general classes, allowing multiple inheritance • Most SDMs are very complex, and were general paper proposals • Have the same problems as the R++ work CS511 Advanced Database Management Systems 8 Object-Oriented (OO) Era • Beginning in the mid 1980’s • Motivation: “impedance mismatch” between relational DBs and languages like C++ – DBs have their own naming systems, data type systems, and conventions for returning data as results – Need conversions between DB conventions and programming language conventions – Like “gluing an apple onto a pancake” • As a result, persistent programming language has attracted much attention CS511 Advanced Database Management Systems 9 Persistent Programming Language • Characteristics – Variables can represent disk-based data as well as main memory data – DB search criteria = language constructs • Early prototypes (late 1970’s): Pascal-R, Rigel, … – Cleaner than SQL embedding – However, compiler must be extended with DBMS- oriented functionality (not very successful) – No technology transfer CS511 Advanced Database Management Systems 10 Object-Oriented Data Bases • In the mid 1980’s, C++ triggered resurgence of interest in persistent programming languages • Research systems: Garden, Exodus • Startups: Ontologic, Object Design, Versant • General goal: persistent C++ – Extend C++ as a data model – Any C++ structure can be persisted – Support “relationship” • Application/market domain: engineering DBs – Typically, open a large object (e.g., electronic circuit), process it exclusively and close it. – No need for a declarative query language (only need to reference objects) – No fancy transaction management is needed (one-user-at-a-time) – Performance has to be competitive with conventional C++ CS511 Advanced Database Management Systems 11 Current Status of OODB • Market never got very large (too many vendors competing for a “niche” market) • The OODB vendors either have failed or repositioned their companies to offer something else – E.g., Object Design is now Excelon and selling XML services • Reasons for the failure – For their own market: absence of leverage, no standard, relink the world – For competing with Relational DBs: lack of transactions, low-level record-at-a-time (with the exception of O2, which embedded a declarative language, i.e., OQL into a programming language) • Lesson: – L13: Packages will not sell to users unless they are in “major pain” CS511 Advanced Database Management Systems 12 The Object-Relational Era • Motivated by the need for handling geographic data • Question: How to extend a relational DB to handle new data type? • The object-relational proposal: add the following to SQL (Postgres): – User-defined data types – User-defined operators – User-defined functions, and – User-defined access methods • Commercially successful: – Postgres->Illsutra (acquired by Informix) • Lessons: – L14: The major benefits of OR is two-fold: putting code in the database (thereby blurring the distinction between code and data) and user- defined access methods – L15: Widespread adoption of new technology requires either standards and/or elephant pushing hard CS511 Advanced Database Management Systems 13 Semi-Structured Data • Motivation: abundance of semi-structured data, exchange format, … • Early system: Lore • Current standards: XMLSchema, XQuery • Two major points – Schema last – Complex network-oriented data model CS511 Advanced Database Management Systems 14 Schema Last • Application categories – Rigidly structured data – Rigidly structured data with some text fields – Semi-structured data (need to handle semantic heterogeneity) – Text • Very few examples of the 3rd category • The 3rd category can be converted to 1 and 2. CS511 Advanced Database Management Systems 15 XML Data Model • XML Records can be hierarchical as in IMS • Have “links” as in CODASYL • Have set-based attributes as in SDM • Inherit from other records as in SDM • And others that are known to be hard to implement • Possible scenarios: – XMLSchema will fail – A data-oriented subset of XMLSchema will be proposed – Repeat the “great debate” • Lessons: – L16:Schema-last is probably a niche market – L17: XQuery is pretty much OR SQL with a different syntax – L18: XML will not solve the semantic heterogeneity either inside or outside the enpterprise CS511 Advanced Database Management Systems 16 Post-Relational Research Topics CS511 Advanced Database Management Systems 17 Database Technology Timeline Simple Data Global Enterprise Management Management Early 80s Late 80s Early - Mid 90s Late 90s - 21st C Pre- Early Client-server Enterprise Internet relational Relational Relational -capable Computing Relational Data Packaged & Simple Active Warehouse & Vertical OLTP Database Hi-end OLTP Applications Scaleable OLTP, Middleware Simple (messaging, parallel query, Support for transactions, Stored queues, partitioning, all types of on-line procedures, events) cluster support, data, backup & triggers Java, row-level locking, extensibility, recovery CORBA, Web high availability objects interfaces Slide from Anil Nori’s presentation CS511 Advanced Database Management Systems 18 Current State of DBMSs • OLTP applications – Large amounts of data – Simple data, simple queries and updates • Update statement from debit/credit transaction: UPDATE accounts SET abalance = abalance + :delta WHERE aid = :aid; – Typically update intensive – Large number of concurrent users (transactions) • Data warehousing applications – Large amounts of data – Simple data but complex querying – Typically read intensive – Large number of users Slide from Anil Nori’s presentation CS511 Advanced Database Management Systems Current State of DBMSs • These applications require: – Large users/transactions – High performance – High availability (7x24 operations) – Scalability – High levels of security – Administrative support – Good utilities Slide from Anil Nori’s presentation CS511 Advanced Database Management Systems Internet Applications: Challenges E-commerce/Apps Information Management APIs Type Proprietary Open Tabular Heterogeneous Applications Delivery Standalone Integrated Generic Personalized Access Site Operation Management Read/write Lots of read-only Low TCO, Mission Critical Content Direct Search Availability Occasional 24X7 Slide from Anil Nori’s presentation 21 CS511 Advanced Database Management Systems Internet Applications: Challenges Transaction Processing Data Warehousing Users Larger User Populations Analysts Every Employee Trained Self-Service Size Network Systems Independent Integrated Gigabytes Terabytes Systems Management Usage Batch Immediate Simple Intelligent Operations Hours Importance Local Global Business- Useful Critical Slide from Anil Nori’s presentation CS511 Advanced Database Management Systems 22 New Challenges in Databases Traditional Traditional Relational RDBMS Traditional Users Data Functions New New Data Type? Data/Info New Users? Management Functions? CS511 Advanced Database Management Systems 23 New Kinds of Data • Text data Ranking in DB • Multimedia data “Schema Lean/Last” (Semi-structured data model) • Scientific data Complex object indexing • Sensor data Stream data • Log data Data mining • Personal data Data integration • Web/Email/Blog Internet computing applications • ... CS511 Advanced Database Management Systems 24 New Users • Everyone? CS511 Advanced Database Management Systems 25 New Functions • Information integration • Navigation New/More general Data Model/Architecture? • Ranking (Object-Oriented) • Pattern finding (data mining) New Algorithms • Decision support Adding intelligence to DB CS511 Advanced Database Management Systems 26 New Computing Environment • Distributed computing/Networks (Internet) • Mobile devices (cell phones, PDAs) Distributed DB Peer-to-Peer (P2P) DB Mobile DB? CS511 Advanced Database Management Systems 27 The Next Database Revolution [Gray 04] • Object Relational • Web Services • Queues, Transactions, Workflows • Cubes and Online Analytic Processing • Data Mining • Column Stores • Text, Temporal, and Spatial Data Access • Semi-Structured Data • Stream Processing • Publish-Subscribe and Replication • Late Binding in Query Plans • Massive Memory, Massive Latency • Smart Objects: Databases Everywhere • Self-Managing and Always Up CS511 Advanced Database Management Systems 28 Selected Current Topics • Text Database and Information Retrieval • Ranking in Databases • Data Integration • P2P Databases • Data Warehousing & OLAP • Data Mining • Stream Data Processing • Web Services • Semi-Structured Data (XML) CS511 Advanced Database Management Systems 29 What You Should Know • New developments in databases are mostly driven by new applications • The impact of a technology highly depends on the market (the right time, right environment, …) • Cycles of data models (complex->simple- >complex…) CS511 Advanced Database Management Systems 30