Docstoc

BioPerl

Document Sample
BioPerl Powered By Docstoc
					SQL, Databases, and
 Ensembl Modules




                      1
     Please look for next lecture
Ensembl API Tutorial:
http://www.ensembl.org/info/software/core/core_tutorial.html




                                                               2
       Survey of Databases
                 (if time)

Ensembl web interface
Ensembl Biomart




                             3
                          RDBMS
• Oracle and Sybase (many others)
   – industry standard, commercial products
   – development and management tools
• PostgresSQL
   – full-featured relational DBMS
   – open source
   – found in most linux distributions
   – handles unusual datatypes well which adds flexibility for future
     extensions
• MySQL
   – open-source relational DBMS
   – easy to setup and use
   – Linux/Windows/Mac
• Each has variations
    – SQL, datatypes, functions, features

                                                                        4
    Structured Query Language (SQL)
• SQL is the standard language used to create, modify,
  maintain, and query relational databases.
• SQL commands are issued within the context of the
  DBMS interface
• SQL commands can be passed to the DBMS by other
  programs (such as perl, PHP, Java, manually, etc.)

•   Many venders have extensions (Sybase-SQL)
•   SQL Example:
•   select FIELD from TABLE where CONDITION
•   select TABLE.FIELD from TABLE where CONDITION
•   select program from software where program like
    'blast%';                                            5
            SQL datatypes
• datatype -- describes the data stored in a
  particular column of a table
• typically is either numeric or character
  strings
• SQL defines subtypes that set different
  upper limits on the size of text or
  numerical data
• Also -- special types such as DATE,
  MONEY
                                               6
               SQL datatypes
• INT
• FLOAT
• REAL (larger float)
• CHAR -- fixed length text string
• TEXT -- variable length text
• BLOB -- variable length binary field
• DECIMAL -- real number stored as character string
• DATE
• TIMESTAMP -- value updates every time the record is
  modified
• ENUM -- limited set of options (numeric or named)
• SET -- value is one of a limited set
                                                        7
     Creating a new table in the
             database
CREATE TABLE tablename (columnane type [modifiers],
   columanme type [modifiers])
create table software_package (
  packid INT NOT NULL PRIMARY KEY AUTO_INCREMENT,
   packname VARCHAR(100),
   packurl VARCHAR(255),
   function TEXT,
   keyword ENUM,
   os SET,
   format SET,
   archfile VARCHAR(255) );
                                                8
               Inserting data
INSERT INTO table(colname1, colname2, columname3)
  VALUES ('value1','value2','value3')




                                                    9
   Accessing Data -- SELECT
SELECT [fields] FROM [table] where [condition]
select * from [table] where [condition]

select packname, packurl from software

select packname, packurl from software where
  keyword = "sequence alignment";
select packname, packurl from software where
  packname like "B%";
                                                 10
         Joining multiple tables
• SELECT can "join" two related tables
• relationships between tables are created by
  replicating information ("primary key") from one
  table as a "foreign key" in another table

select attribute from table1, table2 where primarykey1 =
  foreignkey2

SELECT packname, packurl, reference_date FROM
  software, reference WHERE software.package_id =
  reference.package_id AND reference_date >= 1998;
                                                           11
                                              Example
Agents
AgentID      First        Last             HireDate          Phone
       100   Mike         Hernandez                5/16/95   553-3992
       101   Greg         Piercy                 10/15/95    790-3992
       102   Kathy        Ehrlich                   3/1/96   351-4993

Clients
ClientID    AgentID       First            Last              Phone      City
       9001        100    Stew art         Jameson           553-3999   El Paso
       9002        101    Shannon          McLain            555-5555   Atlanta
       9003        102    Estela           Pundt             541-4993   El Paso

Entertainers
EntertainerIDAgentID      First            Last              É.
      3000          100   John             Slade             É
      3001          101   Mark             Jebavy            É
      3002          102   Teresa           Weiss             É

Engagements
ClientID    EntertainerIDEngagementDate                      StartTime StopTime
       9003       3001            4/1/96                        1:00 PM       3:30 PM
       9002       3000           4/13/96                        9:00 PM       1:30 AM
       9001       3002            5/2/96                        3:00 PM       6:00 PM



       select Clients.Last, Clients.First, Clients.Phone from Clients where
       Clients.City = “El Paso” order by Clients.Last, Clients.First

       select Last, First, Phone from Clients where City = "El Paso" order
       by Last, First                                                                   12
                         SQL Basics
•   select * from table_name
•   select count(*) from table_name
•   Find start times of Entertainers where Jameson is the Client
•   Nested query (may not always be supported)
     – select StartTime from Engagements where ClientID = (select ClientID
       from Clients where Last = “Jameson”)
• Join
     – select Engagements.StartTime, Clients.ClientID from Engagements,
       Clients where Engagements.ClientID = Clients.ClientID and Client.Last
       = “Jameson”
     – select Engagements.StartTime from Engagements,Clients where
       Engagements.ClientID = Clients.ClientID and Client.Last = “Jameson”
• insert into TABLE values (val1, val2, …)
• insert into TABLE (FIELDa, FIELDb, …FIELDn) values (value1,
  value2, …)
• insert into Clients values (“Terry”, “Braun”, 11/10/03,555-6666)

                                                                             13
                                 …as a spreadsheet
Agen                                   Client                                         Enter   Enter                   Enga     Enga
ts                                     s                                              taine   taine                   geme     geme
                                                                                         rs      rs                     nts      nts

Agen   First   Last     HireD   Phon   Client   Agen   First   Last    Phon   City    Enter   Agen    First   Last    Client   Enter   Enga    Start    Stop
ts                      ate     e      ID       tID                    e              taine   tID                     ID       taine   geme    Time     Time
                                                                                      rID                                      rID     ntDat
                                                                                                                                       e

 100   Mike    Hern     5/17/   553-   9001      100   Stew    Jame    553-   El      3000     100    John    Slade   9001     3002    5/3/0    3:00    6:00
               ande        99   3992                   art     son     3999   Paso                                                         0      PM      PM
               z

 101   Greg    Pierc    10/1    790-   9002      101   Shan    McLai   555-   Atlan   3001     101    Mark    Jebav   9002     3000    4/14/    9:00    1:30
               y        6/99    3992                   non     n       5555   ta                              y                           00      PM     AM

 102   Kath    Ehrlic   3/2/0   351-   9003      102   Estel   Pund    541-   El      3002     102    Teres   Weis    9003     3001    4/2/0    1:00    3:30
       y       h            0   4993                   a       t       4993   Paso                    a       s                            0      PM      PM




   Flattened table structure, and lined up the rows.


   Appears to be okay….
   Except…
   The relationships in this trivial example are more complicated than a
   simple flat structure such as a spreadsheet.
                                                                                                                                                   14
   Example) The first Engagement is between Client 9001, and Entertainer 3002
            SQL Basics
delete TABLE where FIELD = value

update TABLE set FIELD = value

drop DATABASE




                                   15
        exon                         intron                         primer_pair
        transcript_id                transcript_id                  id
        exon_num                     intron_num                     transcript_id
        sequence_start               sequence_start                 left_primer_id
        sequence_stop                sequence_stop                  right_primer_id


 project                               transcript                                     sequence
                                       id                                             id
 id
                                       sequence_id                                    target_id
 name                                  source                                         type
                                       source_id                                      sequence
 description
                                                                                      chr_name
                                                      target
 date                                                                                 strand
                                                      id
                                                                                      genomic_start
                                                      date
                                                                                      genomic_stop
                                                      gene_name
                         set_target_join                                              source
set_table
                                                      description
                         set_id                                                       source_id
id
                                                      accession
project_id               target_id                                                    refresh
                                                      status
                         rank
name
                         cas_rank
date
                         cas_options
description                                                                                           16
                                                                    Sample Data
        exon                         intron                         primer_pair
        transcript_id                transcript_id                  id
        exon_num = 3                 intron_num = 3                 transcript_id
        sequence_start               sequence_start                 left_primer_id
        sequence_stop                sequence_stop                  right_primer_id


 project                               transcript                                     sequence
                                       id                                             id
 id
                                       sequence_id                                    target_id
 name = pro1                           source = Ensembl                               type = nucleotide
                                       source_id                                      sequence = ATG…
 description
                                                                                      chr_name = 15
                                                      target
 date                                                                                 strand = 1
                                                      id
                                                                                      genomic_start = 15,123,120
                                                      date
                                                                                      genomic_stop = 16,378,131
                                                      gene_name = BBS4
                         target_set_info                                              source
set_table
                                                      description
                         set_id                                                       source_id
id
                                                      accession
project_id               target_id                                                    refresh
                                                      status
                         rank = 5
name = testset
                         cas_rank
date
                         cas_options
description                                                                                                   17
    Extracting Data from a DB
• This simple figure provides me with
  enough understanding of the database
  structure to extract data




                                         18
               Local SQL Example:
• We have gene name (BBS4), and would like to find project name,
  and the exon number, start/stops
• select id from target where gene_name = "BBS4"
    – 11
• select set_table.id from set_table, target_set_info where
  target_set_info.set_id = set_table.id and target_set_info.target_id =
  TARGET_ID
    – 3
• select project.name from project, set_table where
  set_table.project_id = project.id and set_table.id = SET_ID
    – project333
• select id from sequence where target_id = TARGET_ID
    – 33
• select id from transcript where sequence_id = SEQUENCE_ID
    – 68
• select exon_num, sequence_start, sequence_stop from exon where
  transcript_id = TRANSCRIPT_ID

• Whew.
                                                                      19
        MySQL demo (gscr)
• mysql -h gscr -u tabraun –p
• use trapss;




                                20
        TrAPSS Database
Show tables;
Show columns from table TABLE_NAME;




                                      21
        exon                         intron                         primer_pair
        transcript_id                transcript_id                  id
        exon_num = 3                 intron_num = 3                 transcript_id
        sequence_start               sequence_start                 left_primer_id
        sequence_stop                sequence_stop                  right_primer_id


 project                               transcript                                     sequence
                                       id                                             id
 id
                                       sequence_id                                    target_id
 name = pro1                           source = Ensembl                               type = nucleotide
                                       source_id                                      sequence = ATG…
 description
                                                                                      chr_name = 15
                                                      target
 date                                                                                 strand = 1
                                                      id
                                                                                      genomic_start = 15,123,120
                                                      date
                                                                                      genomic_stop = 16,378,131
                                                      gene_name = BBS4
                         target_set_info                                              source
set_table
                                                      description
                         set_id                                                       source_id
id
                                                      accession
                         target_id                                                    refresh
project_id
                                                      status
                         rank = 5
name =testset
                         cas_rank
date
                         cas_options
description

     select id from target where gene_name = "BBS4";                                                          22
         exon                         intron                         primer_pair
         transcript_id                transcript_id                  id
         exon_num = 3                 intron_num = 3                 transcript_id
         sequence_start               sequence_start                 left_primer_id
         sequence_stop                sequence_stop                  right_primer_id


  project                               transcript                                     sequence
                                        id                                             id
  id
                                        sequence_id                                    target_id
  name = pro1                           source = Ensembl                               type = nucleotide
                                        source_id                                      sequence = ATG…
  description
                                                                                       chr_name = 15
                                                       target
  date                                                                                 strand = 1
                                                       id = 23
                                                                                       genomic_start = 15,123,120
                                                       date
                                                                                       genomic_stop = 16,378,131
                                                       gene_name = BBS4
                          target_set_info                                              source
 set_table
                                                       description
                          set_id                                                       source_id
 id
                                                       accession
                          target_id = 23                                               refresh
 project_id
                                                       status
                          rank = 5
 name = testset
                          cas_rank
 date
                          cas_options
 description
select set_table.id from set_table, target_set_info where target_set_info.set_id =
set_table.id and target_set_info.target_id = 23; (3 rows ==> in 3 sets)           23
         exon                         intron                         primer_pair
         transcript_id                transcript_id                  id
         exon_num = 3                 intron_num = 3                 transcript_id
         sequence_start               sequence_start                 left_primer_id
         sequence_stop                sequence_stop                  right_primer_id


 project                                transcript                                     sequence
                                        id                                             id
 id
                                        sequence_id                                    target_id
 name = pro1                            source = Ensembl                               type = nucleotide
                                        source_id                                      sequence = ATG…
 description
                                                                                       chr_name = 15
                                                       target
 date                                                                                  strand = 1
                                                       id
                                                                                       genomic_start = 15,123,120
                                                       date
                                                                                       genomic_stop = 16,378,131
                                                       gene_name = BBS4
                          target_set_info                                              source
set_table
                                                       description
                          set_id =                                                     source_id
id = 7
                                                       accession
                          target_id                                                    refresh
project_id
                                                       status
                          rank = 5
name =testset
                          cas_rank
date
                          cas_options
description

  select project.name from project, set_table where set_table.project_id =                                     24
  project.id and set_table.id = 7;
        exon                         intron                         primer_pair
        transcript_id                transcript_id                  id
        exon_num = 3                 intron_num = 3                 transcript_id
        sequence_start               sequence_start                 left_primer_id
        sequence_stop                sequence_stop                  right_primer_id


 project                               transcript                                     sequence
                                       id                                             id
 id
                                       sequence_id                                    target_id = 23
 name = pro1                           source = Ensembl                               type = nucleotide
                                       source_id                                      sequence = ATG…
 description
                                                                                      chr_name = 15
                                                      target
 date                                                                                 strand = 1
                                                      id
                                                                                      genomic_start = 15,123,120
                                                      date
                                                                                      genomic_stop = 16,378,131
                                                      gene_name = BBS4
                         target_set_info                                              source
set_table
                                                      description
                         set_id                                                       source_id
id
                                                      accession
                         target_id                                                    refresh
project_id
                                                      status
                         rank = 5
name = testset
                         cas_rank
date
                         cas_options
description

     select id from sequence where target_id = 23;                                                            25
        exon                         intron                         primer_pair
        transcript_id                transcript_id                  id
        exon_num = 3                 intron_num = 3                 transcript_id
        sequence_start               sequence_start                 left_primer_id
        sequence_stop                sequence_stop                  right_primer_id


 project                               transcript                                     sequence
                                       id                                             id
 id
                                       sequence_id = 2216                             target_id
 name = pro1                           source = Ensembl                               type = nucleotide
                                       source_id                                      sequence = ATG…
 description
                                                                                      chr_name = 15
                                                      target
 date                                                                                 strand = 1
                                                      id
                                                                                      genomic_start = 15,123,120
                                                      date
                                                                                      genomic_stop = 16,378,131
                                                      gene_name = BBS4
                         target_set_info                                              source
set_table
                                                      description
                         set_id                                                       source_id
id
                                                      accession
                         target_id                                                    refresh
project_id
                                                      status
                         rank = 5
name = testset
                         cas_rank
date
                         cas_options
description

     select id from transcript where sequence_id = 2216;                                                      26
        exon                           intron                         primer_pair
        transcript_id = 5053           transcript_id                  id
        exon_num = 3                   intron_num = 3                 transcript_id
        sequence_start                 sequence_start                 left_primer_id
        sequence_stop                  sequence_stop                  right_primer_id


 project                                 transcript                                     sequence
                                         id= 5053                                       id
 id
                                         sequence_id                                    target_id
 name = pro1                             source = Ensembl                               type = nucleotide
                                         source_id                                      sequence = ATG…
 description
                                                                                        chr_name = 15
                                                        target
 date                                                                                   strand = 1
                                                        id
                                                                                        genomic_start = 15,123,120
                                                        date
                                                                                        genomic_stop = 16,378,131
                                                        gene_name = BBS4
                           target_set_info                                              source
set_table
                                                        description
                           set_id                                                       source_id
id
                                                        accession
                           target_id                                                    refresh
project_id
                                                        status
                           rank = 5
name = testset
                           cas_rank
date
                           cas_options
description

     select exon_num, sequence_start, sequence_stop from exon where                                             27
     transcript_id = 5053;
SQL Examples – One Large Query
select distinct exon_num, project.name,
  sequence_start, sequence_stop from exon,
  transcript, sequence, target, target_set_info,
  set_table, project where target.gene_name =
  "BBS4" and project.name = "pro1" and
  set_table.name = "testset" and
  set_table.project_id = project.id and
  target_set_info.set_id = set_table.id and
  target_set_info.target_id = target.id and
  sequence.target_id = target.id and
  transcript.sequence_id = sequence.id and
  exon.transcript_id = transcript.id order by
  exon_num;                                        28
Executing SQL commands from
           a text file
          http://www.mysql.org/doc/refman/5.0/en/batch-commands.html
•    However, it is also possible to put your SQL statements in a file and then
    tell mysql to read its input from that file. To do so, create a text file text_file
    that contains the statements you wish to execute. Then invoke mysql as
    shown here:

•   shell> mysql db_name < text_file

•   If you place a USE db_name statement as the first statement in the file, it is
    unnecessary to specify the database name on the command line:

•   shell> mysql < text_file

•   If you are already running mysql, you can execute an SQL script file using
    the source or \. command:

•   mysql> source file_name
•   mysql> \. file_name                                                                   29
               Remotely
mysql -h gscr.eng.uiowa.edu -u tabraun -p




                                            30
             Setting up MySQL
• server -- database application that is running on
  the machine where the data is stored
  – runs as a "daemon" on Unix machines
  – a process that is always on, listening for and
    responding to requests (from clients)
  – mysqld -- server
• client -- programs that connect to the server and
  request data
  –   can be on same machine as server, or
  –   located on other machines and connected by network
  –   mysql -- client
  –   mysqladmin -- administration tool                31
             MySLQ help
http://dev.mysql.com/

Show tables;
Show columns from table TABLE_NAME;

Show TrAPSS (time permitting)


                                      32
       Survey of Databases
                 (if time)

Ensembl web interface
Ensembl Ensmart




                             33

				
DOCUMENT INFO
Shared By:
Categories:
Tags:
Stats:
views:3
posted:9/29/2011
language:English
pages:33