Docstoc

Slide 1 - envgennoxacuk

Document Sample
Slide 1 - envgennoxacuk Powered By Docstoc
					Implementing a Relational Database
Joe Wood anjw@ceh.ac.uk

Environmental Genomics Thematic Programme Data Centre http://envgen.nox.ac.uk

Implementing a relational database
establish requirements Data Requirements data analysis Conceptual Data Model database design Logical Schema Specification implement

schema and database

Environmental Genomics Thematic Programme Data Centre http://envgen.nox.ac.uk

Our design so far:
relation PublicDatabase DatabaseName: String DataType: String URL: String primary key: DatabaseName relation Sequence AccessionNumber: String ID: String primary key: AccessionNumber foreign key: SourceDatabase references PublicDatabase foreign key: SourceOrganism references Organism relation Organism OrganismNo: Integer Species: String Strain: String GenomeSeq: Boolean CommonName: String primary key: OrganismNo relation Feature FeatID: String Name: String primary key: FeatID foreign key: SourceSequence references Sequence

Environmental Genomics Thematic Programme Data Centre http://envgen.nox.ac.uk

Implementing our database
• Most of hard work is already done • Create relations using SQL • Define the data types for our columns • Define primary and foreign keys • Add constraints • Add any appropriate default values

Environmental Genomics Thematic Programme Data Centre http://envgen.nox.ac.uk

Our design so far:
relation PublicDatabase DatabaseName: String DataType: String URL: String primary key: DatabaseName

Environmental Genomics Thematic Programme Data Centre http://envgen.nox.ac.uk

Create Table
relation PublicDatabase DatabaseName: String DataType: String URL: String primary key: DatabaseName

CREATE TABLE publicdatabase ( databasename datatype url );

Environmental Genomics Thematic Programme Data Centre http://envgen.nox.ac.uk

Add data types
Reminder: • Numerical • integer,float,numerical • String/Text • varchar,text • Date/Time • timestamp,date • Boolean

Environmental Genomics Thematic Programme Data Centre http://envgen.nox.ac.uk

Add data types
relation PublicDatabase DatabaseName: String DataType: String URL: String primary key: DatabaseName

CREATE TABLE publicdatabase ( databasename varchar(50), datatype varchar(20), url varchar(200) );

Environmental Genomics Thematic Programme Data Centre http://envgen.nox.ac.uk

Primary Keys
relation PublicDatabase DatabaseName: String DataType: String URL: String primary key: DatabaseName

CREATE TABLE publicdatabase ( databasename varchar(50), datatype varchar(20), url varchar(200), primary key (databasename) );

Environmental Genomics Thematic Programme Data Centre http://envgen.nox.ac.uk

Foreign Keys
relation Sequence AccessionNumber: String ID: String primary key: AccessionNumber foreign key: SourceDatabase references PublicDatabase foreign key: SourceOrganism references Organism

CREATE TABLE sequence ( accessionnumber varchar(50), id varchar(50), sourcedatabase varchar(30), sourceorganism integer, primary key (accessionnumber), foreign key (sourcedatabase) references publicdatabase(databasename), foreign key (sourceorganism) references organism(organismnumber) );

Environmental Genomics Thematic Programme Data Centre http://envgen.nox.ac.uk

Foreign Keys
relation Sequence AccessionNumber: String ID: String primary key: AccessionNumber foreign key: SourceDatabase references PublicDatabase foreign key: SourceOrganism references Organism

CREATE TABLE sequence ( accessionnumber varchar(50), id varchar(50), sourcedatabase varchar(30), sourceorganism integer, primary key (accessionnumber), foreign key (sourcedatabase) references publicdatabase, foreign key (sourceorganism) references organism );

Environmental Genomics Thematic Programme Data Centre http://envgen.nox.ac.uk

feature table
CREATE TABLE feature ( featid varchar(50), name varchar(100), sourcesequence varchar(50), primary key (featid), foreign key (sourcesequence) references sequence(accessionnumber) );

Environmental Genomics Thematic Programme Data Centre http://envgen.nox.ac.uk

organism table
CREATE TABLE organism ( organismnumber integer species varchar(100), strain varchar(100), genomeseq boolean, commonname varchar(100), primary key (organismnumber) );

Environmental Genomics Thematic Programme Data Centre http://envgen.nox.ac.uk

Constraints
• Constraints restrict the values that can be inserted or updated in columns • Types of constraints • NOT NULL • UNIQUE • Simply add to column definition
url varchar(100) NOT NULL or url varchar(100) UNIQUE

Environmental Genomics Thematic Programme Data Centre http://envgen.nox.ac.uk

Constraints
• NOT NULL and UNIQUE implicit when primary key • CHECK constraint
numberoflegs integer check (numberoflegs>2)

Environmental Genomics Thematic Programme Data Centre http://envgen.nox.ac.uk

publicdatabase table
CREATE TABLE publicdatabase ( databasename varchar(50), datatype varchar(20), url varchar(200) UNIQUE, primary key (databasename) );

Environmental Genomics Thematic Programme Data Centre http://envgen.nox.ac.uk

Constraints
• Foreign key constraints • To keep links between tables working you need to preserve the matching values – referential integrity • These constraints automatically set up for you by RDBMS when you declare the primary and foreign keys • This will prevent you from deleting a record with a primary key before you have deleted all the child foreign key records

Environmental Genomics Thematic Programme Data Centre http://envgen.nox.ac.uk

Constraints
• Example from 'BigHit' database

Environmental Genomics Thematic Programme Data Centre http://envgen.nox.ac.uk

Constraints

Environmental Genomics Thematic Programme Data Centre http://envgen.nox.ac.uk

organism table
CREATE TABLE organism ( organismnumber integer, species varchar(100), strain varchar(100), genomeseq boolean, commonname varchar(100), primary key (organismnumber) );

How shall we create the unique primary key values for organismnumber?

Environmental Genomics Thematic Programme Data Centre http://envgen.nox.ac.uk

Sequences
• Could do complicated checks to work out the last number entered, add 1 and insert for each new record • Sequence is a database object in PostgreSQL which essentially an automatically incrementing numeric value (equivalent to 'autonumber' in Access)
CREATE SEQUENCE 'my_seq' (can specify increment,min and max) SELECT NEXTVAL('my_seq')

Environmental Genomics Thematic Programme Data Centre http://envgen.nox.ac.uk

Default values
• Still don't want to have to select the value each time • Can set a default value for column which is automatically filled in every time a record is inserted
CREATE TABLE organism ( organismnumber integer DEFAULT NEXTVAL('my_seq'), species varchar(100), strain varchar(100), genomeseq boolean, commonname varchar(100), primary key (organismnumber) );

Environmental Genomics Thematic Programme Data Centre http://envgen.nox.ac.uk

Create your database
• To create your database run your SQL table and other object creation statements in a single script • Example - demodatabase.sql • Be sure create tables in the right order • Can't create table that refers to a primary key in a table that doesn't exist yet • You also need data...

Environmental Genomics Thematic Programme Data Centre http://envgen.nox.ac.uk

Populate your database
• Insert data using INSERT sql statements
INSERT INTO organism (species,strain,genomeseq,commonname) VALUES ('Oryctolagus cuniculus',NULL,'false','rabbit');

• Default values will inserted automatically
CREATE TABLE organism ( organismnumber integer DEFAULT NEXTVAL('my_seq'), ...

Environmental Genomics Thematic Programme Data Centre http://envgen.nox.ac.uk

Populate your database
• Be sure to insert data in correct order • Don't try and insert a foreign key value when the primary key value hasn't been inserted yet • Run the demodatabase.sql script

Environmental Genomics Thematic Programme Data Centre http://envgen.nox.ac.uk

Querying your database
• Now that your database is set up and data has been inserted we can query it

Environmental Genomics Thematic Programme Data Centre http://envgen.nox.ac.uk

Database Name

Sequence Type

ID
Phosphorylase B kinase alpha regulatory chain

Accession Number KPB1_Rabit

Organism Rabbit

Swissprot UniProt Swissprot Swissprot EMBL dbEST TrEMBL EMBL PDB

Protein Protein Proten protein Nucleotide Nucleotide protein Nucleotide
protein_structure

Troponin T Glycogen phosphorylase Troponin I rabbit muscle phosphorylase mrna

Trt3_rabit PHS2_RABIT TRIC_RABIT OCPHOS2 rabbit rabbit

CK829726
pol protein Rabbit phosphorylase Glycogen Phosphorylase

Rabbit Rabbit Rabbit rabit

Q8MJF7 OXPKA 1ABB

Environmental Genomics Thematic Programme Data Centre http://envgen.nox.ac.uk

Querying your database

Environmental Genomics Thematic Programme Data Centre http://envgen.nox.ac.uk

Querying your database

Environmental Genomics Thematic Programme Data Centre http://envgen.nox.ac.uk

Querying your database

Environmental Genomics Thematic Programme Data Centre http://envgen.nox.ac.uk

Querying your database
• What have we gained? • No data redundancy • Data is consistent • Enforced quality control – no missing data • Only have change data once • Flexibility to run a variety of queries

Environmental Genomics Thematic Programme Data Centre http://envgen.nox.ac.uk

Views
• Views are queries that are saved in the database as objects • Appear much like a table which can be queried in the same way • Good if underlying query is very complex
CREATE VIEW viewname AS query

Environmental Genomics Thematic Programme Data Centre http://envgen.nox.ac.uk

Views

Environmental Genomics Thematic Programme Data Centre http://envgen.nox.ac.uk

Views

Environmental Genomics Thematic Programme Data Centre http://envgen.nox.ac.uk

Indices
• Searching data is slow • Indices make this searching faster • Implicit indices are set up for primary keys as these are used a lot for searching data

Environmental Genomics Thematic Programme Data Centre http://envgen.nox.ac.uk

Indices
• An index can be created on any column
CREATE INDEX orgname_idx on organism (commonname)

• An index is helpful on a column that is regularly searched on (i.e. Used in the WHERE clause)

Environmental Genomics Thematic Programme Data Centre http://envgen.nox.ac.uk

Normalisation
• Normalisation is the process of reducing data redundancy in your database design • Denormalisation increases data redundancy but is sometimes necessary to increase the usability of a database • Example – ensembl

Environmental Genomics Thematic Programme Data Centre http://envgen.nox.ac.uk

Denormalisation
select gene.seq_region_id, gene_stable_id.stable_id, transcript_stable_id.stable_id, xref.display_label from external_db inner join xref on external_db.external_db_id = xref.external_db_id inner join object_xref on xref.xref_id = object_xref.xref_id inner join translation on object_xref.ensembl_id = translation.translation_id inner join transcript on translation.transcript_id = transcript.transcript_id inner join gene on gene.gene_id = transcript.gene_id inner join seq_region on gene.seq_region_id = seq_region.seq_region_id inner join coord_system on seq_region.coord_system_id = coord_system.coord_system_id left outer join gene_stable_id on gene_stable_id.gene_id = gene.gene_id left outer join transcript_stable_id on transcript.transcript_id = transcript_stable_id.transcript_id where coord_system.name = 'chromosome' and seq_region.name = '1' and gene.seq_region_end < 30000000 and external_db.db_name = 'SWISSPROT';

Environmental Genomics Thematic Programme Data Centre http://envgen.nox.ac.uk

Denormalisation

Environmental Genomics Thematic Programme Data Centre http://envgen.nox.ac.uk

Denormalisation
• Implement a complex query as a table • Need to set up “triggers” to populate table and maintain data integrity • Triggers could run on each insert or as bulk loading processes • This increases database administration but can increase its usability • Note that this different from a view which is a saved query

Environmental Genomics Thematic Programme Data Centre http://envgen.nox.ac.uk


				
DOCUMENT INFO
Shared By:
Stats:
views:43
posted:1/22/2010
language:English
pages:40
Description: Slide 1 - envgennoxacuk