Introduction to Databases by malj


									Introduction to Databases

      Data Organisation
       Data modelling
       DBMS functions
           Basics of data
DATA HIERARCHY (four categories)

• Fields = represent a single data item
• Records = made up of a related set of
  fields describing one instance of an entity
• File / Table = a set of related records - as
  many as instances (occurrence) in the set
• Database = a collection of related files
      Example of data structure

                Name       First name Telephone

Records         Zidane     Zinedine   45 25 65 65
                Feller     Joe        25 58 96 63
                Clinton    Bill       12 25 28 89
                Henry      Thierry    25 78 85 85

                            + Other files =>complete data
            File / Table
                            Structure = DB
        Database: Definition.

"A collection of interrelated data stored
  together with controlled redundancy, to
  serve one or more applications in an
  optimal fashion; the data is stored so that
  it is independent of the application
  programs which use it; a common and
  controlled approach is used in adding new
  data and in modifying existing data within
  the database."
        Definition - closer look
• A collection of interrelated data stored together
• with controlled redundancy
• to serve one or more applications in an optimal
• the data is stored so that it is independent of
  the application programs which use it
• a common and controlled approach is used in
  adding new data and in modifying existing data
  within the database.
 Advantages of Databases:
• data are independent from applications
  - stored centrally
• data repository accessible to any new
• data are not duplicated in different
• programmers do not have to write
  extensive descriptions of the files
• Physical and logical protection is
      Disadvantages of DBs:
• Centralisation can be a weakness
• Large DBs require expensive hardware
  and software
• specialised / scarce personnel is required
  to develop and maintain large DBs
• Standardisation of data on a central
  repository has implications for the format
  in which it is stored
    Characteristics of DBs…

• High concurrency (high performance under
• Multi-user (read does not interfere with write)
• Data consistency – changes to data don’t affect
  running queries + no phantom data changes
• High degree of recoverability (pull the plug test)
                  ACID test

•   Atomicity                            All or nothing
•   Consistency        Preserve consistency of database
•   Isolation             Transactions are independent
•   Durability        Once committed data is preserved
   DataBase Management
      System (DBMS):
• program that makes it possible to:
  – create
  – Use (insert / update / delete data)
  – maintain     a database

• It provides an interface / translation
  mechanism between the logical
  organisation of the data stored in the DB
  and the physical organisation of the data
      Using a database:
Two main functions of the DBMS :

• Query language – searching answers in
  data (SQL)
• Data manipulation language - for
  programmers who want to modify tha data
  model in which the data is stored
• + Host Language - the language used by
  programmers to develop the rest of the
  application - eg: Oracle developer 2000
          Relational DBs:
• Data items stored in tables
• Specific fields in tables related to
  other field in other tables (joint)
• infinite number of possible viewpoints
  on the data (queries)
• Highly flexible DB but overly slow for
  complex searches
• Oracle, SyBase, Ingres, Access,
  Paradox for Windows...
   Describing relationships
• Attempt at modelling the business
  elements (entities) and their
  relationships (links)
• Can be based on users’ descriptions of
  the business processes
• Specifies dependencies between the
  data items
• Coded in an Entity-Relationship Diagram
Types of Relationships
• one-to-one: one instance of one data item
  corresponds to one instance of another

• one-to-many: one instance to many instances

• many-to-many: many instance correspond to
  many instances

• Also some relationships may be:
   – compulsory
   – optional
• Student registering system

• What are the entities?

• What type of relationship do they have?

• Draw the diagram
Entity Relationship Diagram
      Example 2 – Sales Order
• Entities
• Relationships

• Use a business object based approach?
Next step - creating the data

• Few rules - a lot of experience
• Can get quite complex (paramount for the
  speed of the DB)
• Tables must be normalised - ie redundancy
  is limited to the strict minimum by an
• In practice, normalisation is not always the
       Data Structure Diagrams
• Describe the underlying structure of the DB: the
  complete logical structure
• Data items are stored in tables linked by pointers
   – attribute pointers: data fields in one table that will link
     it to another (common information)
   – logical pointers: specific links that exist between
• Tables have a key
• Is it an attribute or an entity?
    ORDER              Customer
    order number       Customer number
2   Item description   Customer name
    Item Price         Customer address
    Quantity ordered   Customer balance
3   Customer number    Customer special rate
    Item number


    Item number
    Item description
    Item cost           * compulsory attributes
    Quantity on hand    0 optional attributes
• Process of simplifying the relationships amongst data
  items as much as possible (see example provided -
• Through an iterative process, structure of data is
  refined to 1NF, 2NF, 3NF etc.
• Reasons for normalisation:
    – to simplify retrieval (speed of response)
    – to simplify maintenance (updates, deletion,
    – to reduce the need to restructure the data for each
      new application
       First Normal Form
• design record structure so that each record
  looks the same (same length, no repeating
• repetition within a record means one relation
  was missed = create new relation
• elements of repeating groups are stored as a
  separate entity, in a separate table
• normalised records have a fixed length and
  expanded primary key
     Second Normal Form
• Record must be in first normal form first
• each item in the record must be fully
  dependent on the key for identification
• Functional dependency means a data
  item’s value is uniquely associated with
• only on-to-one relationship between
  elements in the same file
• otherwise split into more tables
        Third normal form
• to remove transitive dependencies
• when one item is dependent on an item
  which is dependent from the key in the file
• relationship is split to avoid data being lost
• this will give greater flexibility for the design
  of the application + eliminate deletion
• in practice, 3 NF not used all the time -
  speed of retrieval can be affected
       Beyond data modeling
• Model must be normalised
  – Optimised model
  – “no surprise” model
  – resilience
• Outcome is a set of tables = logical design
• Then, design can be warped until it meets
  the realistic constraints of the system
• Eg: what business problem are we trying
  to solve? – see handout [riccardi p. 113, 127]
        Realistic constraints
• Users cannot cope with too many tables
• Too much development required in hiding
  complex data structure
• Too much administration
• Optimisation is impossible with too many
• Actually: RDBs can be quite slow!
      Key practical questions
• What are the most important tasks that the
  DB MUST accomplish efficiently?
• How must the DB be rigged physically to
  address these?
• What coding practices will keep the coding
  clean and simple?
• What additional demands arise from the
  need for resilience and security?
Analysis - Three Levels of Schema

    External Schema 1 External Schema 2 External Schema …

 Tables                       Logical Schema

                       Internal Schema
        4 way trade-off


Performance                     Ease of use

              Clarity of code
               Key decisions
• Oracle offers many different ways to do things
  – Indexes
  – Backups…
• Good analysis is not only about knowing these
  => understanding whether they are appropriate
• Failure to think it through => unworkable model
• Particularly, predicting performance must be
  done properly
  – Ok on the technical side, tricky on the business side
          Design optimisation
• Sources of problems:
   – Network traffic
   – Excess CPU usage
• But physical I/O is greatest threat (different from
  logical I/O)
• Disks still the slowest in the loop
• Solution: minimise or re-schedule access
• Also try to minimise the impact of Q4 (e.g.
  mirroring, internal consistency checks…)
     Using scenarios for analysis
•   Define standard situation for DB use
•   Analyse their specific requirements
•   Understand the implications for DB design
•   Compare and contrast new problems with
    old ones
Categories of critical operations
• Manual transaction processing = complex DE by
  small number of operators
• Automatic transaction processing: large number
  of concurrent users performing simple DE
• High batch throughput: automatic batch input
  into DB of very large number of complex
• Data warehousing: large volumes of new data
  thrown on top every day at fixed intervals +
  intensive querying
 Manual transaction processing
• Insurance telemarketing broker
  – Data entry
  – Retrieving reference info
  – Calculations
• On-line human-computer interaction!!
  – Instant validation (field by field)
  – Drop-down lists (DE accelerators)
• Quick response time
• Critical issue = user-friendly front end, but
  minimise traffic between interface and back end!
        Automatic transaction
• Large number of user performing simple tasks
• Real-time credit card system (e.g. authorisation)
  or check out (EPOS)
• Human interaction at its most simple – eg typing
  a code or swiping a card
• Minimum validation, no complex feed back…
• Large numbers mean potential problems are:
  – Connection opening / closing rate
  – Contention between concurrent users
  – SQL engine pbs + data consistency costs
• Design with multiple servers
       Automatic transaction
• Another eg: on-line shopping
• What specific problems would arise from
  shopping cart type applications?
• How do you handle lost customers?
 High batch throughput
• Eg mobile phone network operator
• Real time + huge volume of simultaneous
  complex transactions
  –   Number checks
  –   Account info
  –   Price info
  –   Pattern checks
• Large processing capacity required + need to
  tackle all transactions together in batches
  – DB query may not be only solution (or quickest)
  – Move customer account to cache
  – Copy updated figures for accounts to a log and
    updated accounts in slack periods (2.5GB an
             “Data warehouse”
• Huge store of data
• Large volume added every day
  – 99% new data, 1% corrections to existing data
• Substantial analysis required prior to
  – What to include
  – How to aggregate and organise it
  – Where data comes from
• Real Oracle territory because schedule is lax – ie
  not a real time application
• Key issues:
  – Getting partitioning right
  – Deciding how many summary levels
• Oldest trick in the book to speed up retrieval
   – Smaller bunch of data
   – Well labeled so it can be easily found
   – Smaller index
• Data manipulation – maintenance, copy and
  protection far easier
• Break down big problem (eg table) into small
          Internet Databases
• In between types 1 and 2
  – Many concurrent sessions
  – Reduced interaction front end back end
  – Internet = Extra response time (2 secs!)
• In practice, many sites are quite slow
• Key issues
  – “thin client”
  – Reduced dialogue
  – Management of sessions (eg coockies) to avoid
    multiple restarts
      Conclusion: Key issues
• At one end: very large numbers of small
• Threat of network or process contention
• At other end: small number of processes
  with complex data crunching and time
• Design of DB and application must reflect
  these constraints

To top