Organizing Data and Information Chapter 5 Topics in Ch. 5 • General data management concepts and terms • Two major approaches to data management • Three database models • The common functions performed by database management systems • Three popular end-user database management systems • Recent database developments. Data and Information What is data? • A collection of raw unprocessed facts, figures, and symbols What is information? • Data that is organized, meaningful, and useful How are data and information related? • Computers process data into information What is a database? • Includes a collection of data organized so you can access, retrieve, and use the data • Database software allows you to – Create a computerized database – Add, change, and delete data – Sort and retrieve data from the database – Create forms and reports for the data in the database • Database software also called a database management system (DBMS) Hierarchy of data • Bit – A single binary digit with either a one or zero state • Character – A byte, which is the basic building block of information • Field – A combination of one or more characters – The smallest unit of data a user accesses – A field name uniquely identifies each field • Record – A collection of related fields • Files or Tables – A collection of related records • Database – A collection of related files or tables The Hierarchy of Data Data Entities, Attributes, and Keys Data entities, Attributes, and Keys • Entity – Generalized class of people, places, or things for which data is collected, stored, and maintained – E.g. at a university, students, library books, and courses; the entity in figure 5.2 is employee. • Attributes – A characteristic of an entity – E.g., employee number, last name, first name, hire date, and department number in figure 5.2 • Key – A field or set of fields that identifies a record • A primary key: a field or a set of fields the uniquely defines a record: e.g. SSN • A secondary key: an alternative key that can be used to access records: e.g., last name The Traditional versus the Database Approach to Data Management Traditional approach (File management approach) What is a file management approach? • Each department or area within an organization has its own set of files • Records in one file often do not relate to the records in any other file The Traditional Approach Weakness of the “Traditional” Approach • Data redundancy – Wastes resources such as storage space and people’s time – The files often store the same fields in multiple files – Requires a larger storage capacity – Compromises data integrity • Data integrity – The degree to which data is correct – When a database contains errors, it loses integrity – Very important because computers and people use information to make decisions and take actions • Program-data dependence – Programs and data developed and organized for one application are incompatible with programs and data organized differently for another application • Isolated data – It is often difficult to access data stored in many files across several departments Database Approach What are the What are the strengths of the weakness of the database approach? database approach? •Reduced data redundancy • A high cost •Improved data integrity involved with •Shared data acquiring and •Reduced development time implementing a database •Easier modification and updating • Increased • so on ( Table 5.1 p. 175) vulnerability How do a database application and a file processing application differ in the way they might store data? File processing example Database example Data Modeling & Database Models Data Design Issues • Content: What data should be collected? • Access: What data should be given to what users? • Logical structure: How will the data be organized to make sense to a particular user? • Physical organization: Where will the data actually be located? Data Modeling • Logical design – How data are grouped together and how that are related to one another • Physical design – Combining or splitting some of the groups identified in the logical design • Planned data redundancy • Data model – A diagram used by a database designer to show the logical relationships among the entities in the database – Entity-relationship (ER) diagrams Entity-Relationship (ER) Diagrams • Fig 5.5 Correspondence with ER Model • Relations (tables) correspond with entity types and with many-to-many relationship types • Rows correspond with entity instances and with many- to-many relationship instances • Columns correspond with attributes • NOTE: The word relation (in relational database) is NOT the same same the word relationship (in ER model) Key Fields • Keys are special fields that serve two main purposes: – Primary keys are unique identifiers of the relation in question. Examples include employee numbers, social security numbers, etc. This is how we can guarantee that all rows are unique – Foreign keys are identifiers that enable a dependent relation (on the many side of a relationship) to refer to its parent relation (on the one side of the relationship) Basic ER Symbols Entity symbol Attribute symbol Relationship symbol (a) CUSTOMER entity type with simple attributes (b) CUSTOMER relation (a) CUSTOMER entity type with composite attribute (b) CUSTOMER relation with address detail Mapping the relationship Foreign key Database Models • Every database and DBMS is based on a specific data model • Consists of rules and standards that define how the database organizes data • Defines how users view the organization of the data • Three types of database models – Hierarchical models • Data is organized in a top-down or inverted tree structure – Network models • An extension of the hierarchical model in which a member may have many owners – Relational models • Stores data in tables that consist of rows and columns • Each row has a primary key and each column has a unique name Operations of a relational database • Selection operation – The operation extracts data from a row (record) • Projection operation • Join operation What is an example of a selection operation? selection operation results Database Management System • A DBMS is a data storage and retrieval system which permits data to be stored non- redundantly while making it appear to the user as if the data is well-integrated • Software that allows you to create, access, and manage a database • Available for many sizes and types of computers. Database Management System Application #1 Application #2 DBMS Database containing centralized shared data Application #3 DBMS manages data resources like an operating system manages hardware resources Database Management Systems (DBMSs) • Provide a user view – Schema: the logical description of an entire database – Subschema: a file that contains a description of part of the database and identifies which users can view or modify the data items in that part of the database • Provide tools to create & modify the database – Data Definition Language (DDL) – Data dictionary: a detailed description of all the data used in the database • Store & retrieve data • Manipulate data and produce reports – SQL (Structured Query Language) – Data manipulation language (DML) Schema • A description of the entire database • Subschema – Shows only some of the records and their relationships in the database • Data Definition Language (DDL) – Command used to create a database – Commands that define a database, including creating, altering, and dropping tables and establishing constraints – Schemas and subschemas are described using a DDL. Fig 5.11 Fig 5.12 Data Dictionary • System tables that store metadata • Contains data about each file in the database and each field within those files • Sometimes called metadata • Users usually can view some of these tables • Users are restricted from updating them What is a default value for a field? • A value that the DBMS initially displays in a field • Data dictionary allows you to specify a default value for a field Data Dictionary Benefits • Reduced data redundancy • Faster program development • Easier data & information modification • Increased data reliability What is a query? • A request for specific data from the database • Four commonly used methods to access data – Query language – Query by example – Form – Report generator What is Structured Query Language (SQL)? • The standard data manipulation language for relational databases • A query language that allows you to manage, update, and retrieve data • Has special keywords and rules that you include in SQL statements (Data manipulation language) projection operation join operation selection operation wizard for querying the Query by Example (QBE) database • The program retrieves records that match criteria you enter in the form fields • Uses a graphical user interface • Available in MS Access • MS Access translates QBE to SQL and vice versa query results What is a form? • Sometimes called a data entry form • A window on the screen that provides areas for entering or changing data in a database Concurrency Control • Problem – in a multi-user environment, simultaneous access to data can result in interference and data loss • Solution – Concurrency Control – A method of dealing with situation in which two or more people need to access the same record in a database at the same time. – Serializability – • Finish one transaction before starting another – Locking Mechanisms • The most common way of achieving serialization • Data that is retrieved for the purpose of updating is locked for the updater • No other user can perform update until unlocked Popular DBMS • Popular mainframe-based DBMSs – Oracle – IBM DB/2 – Microsoft SQL Server • Popular end-user DBMSs – MS Access – Lotus Approach – Inprise’s dBASE Selecting a DBMS • Database size • Number of concurrent users • Performance • Integration • Features • Vendor • Cost Database Developments • Distributed database • Data warehouses – Data marts – Data mining • Online analytical processing (OLAP) Distributed Database • Distributed Database: A single logical database that is spread physically across computers in multiple locations that are connected by a data communications link – Replicated database: a database that holds a duplicate set of frequently used data • Decentralized Database: A collection of independent databases on non-networked computers They are NOT the same thing! Reasons for Distributed Database • Business unit autonomy and distribution • Data sharing • Data communication costs • Multiple application vendors • Database recovery • Transaction and analytic processing Advantages of Distributed Database over Centralized Databases • Increased reliability/availability • Local control over data • Modular growth • Lower communication costs • Faster response for certain queries Disadvantages of Distributed Database compared to Centralized databases • Software cost and complexity • Processing overhead • Data integrity exposure • Slower response for certain queries • Data Warehouse: – A subject-oriented, integrated, time-variant, non- updatable collection of data used in support of management decision-making processes – Subject-oriented: e.g. customers, patients, students, products – Integrated: Consistent naming conventions, formats, encoding structures; from multiple data sources – Time-variant: Can study trends and changes – Nonupdatable: Read-only, periodically refreshed • Data Mart: – A data warehouse that is limited in scope, a single aspect of a company’s business, e.g. finance, inventory, or personnel An enterprise data warehouse Need for Data Warehousing • Integrated, company-wide view of high-quality information (from disparate databases) • Separation of operational and informational systems and data (for improved performance) Comparison of operational and informational systems Data Warehouse vs. Data Mart Data Warehouses • Data mining: an information analysis tool that involves the automated discovery of patterns and relationships among data • E.g.: identifies common characteristics of customers who buy the same products from your company • Predicts which customers are likely to leave your company and go to a competitor • Reveals the difference between a typical customer this month versus last month On-Line Analytical Processing (OLAP) • Used to store and deliver data warehouse information • The use of a set of graphical tools that provides users with multidimensional views of their data and allows them to analyze the data using simple windowing techniques • Relational OLAP (ROLAP) – Traditional relational representation • Multidimensional OLAP (MOLAP) – Cube structure • OLAP Operations – Cube slicing – come up with 2-D view of data – Drill-down – going from summary to more detailed views Figure 11-22: Slicing a data cube Summary report Figure 11-23: Example of drill-down Drill-down with color added Open Database Connectivity (ODBC) • Standards that make it easier to transfer and access data among different databases. • Advantages of ODBC – ODBC can be used to export, import, or link tables between different applications • For example, a table in an Access database can be exported to a Paradox database or a spreadsheet. Table and data can also be imported using ODBC. • Linking allows an application, such as a database, to use an object in another application, such as a spreadsheet, without actually importing the object into the application. An advantage of linking is that as the object is updated in its original location, the changes will be reflected in the document to which it is linked. Object-Relational Database Management Systems • A DBMS capable of manipulating audio, video, and graphical data • Hypertext • Hypermedia • Spatial data technology – Uses an object-relational database to store & access data via location. For example, NASA’s database of satellite photos of the Earth could be studied & manipulated to identify sources of pollutants.