Docstoc

AdvancedSQLQuerying

Document Sample
AdvancedSQLQuerying Powered By Docstoc
					Advanced SQL Querying of Bioinformatics Databases
Yannick Pouliot, PhD
Bioresearch Informationist lanebioresearch@stanford.edu Lane Medical Library & Knowledge Management Center 3/5/2008

Lane Medical Library & Knowledge Management Center http://lane.stanford.edu

The Bioinformationist: At Your Service


Yannick Pouliot, Lane Bioresearch Informationist


≈ computational biologist in residence



Role: Support laboratory researchers regarding biocomp resources and their use


…especially postdocs



lanebioresearch@stanford.edu

Lane Medical Library & Knowledge Management Center http://lane.stanford.edu

2

Objectives


Understanding biologically-meaningful complex SQL queries
 

How they are built What to expect

And please: Don’t get hung up - ask questions

Lane Medical Library & Knowledge Management Center http://lane.stanford.edu

3

So Why Are We Here?

Lane Medical Library & Knowledge Management Center http://lane.stanford.edu

4

Bioinformatics Databases: Who Supports Direct Querying? Direct Queryability of Selected Bioinformatics Databases
Database ArrayExpressWarehouse Direct Querying? Eventually How? http://www.ebi.ac.uk/aedw/ Modality SQL DB Engine Oracle

BioWarehouse

Yes

http://biowarehouse.ai.sri.com/ SQL - need account http://www.ensembl.org/info/d SQL ata/download.html ask for account SQL

Oracle/MySQL

Ensembl

Yes

MySQL

Mouse Genome Database

Yes

Sybase

PharmGKB Saccharomyces Genome Database

Yes

http://www.pharmgkb.org/hom SOAP-based e/projects/webservices/

Oracle

EventuallyMaybe

Oracle Oracle
5

Stanford Microarray Database No
Lane Medical Library & Knowledge Management Center http://lane.stanford.edu

Flavors of SQL




ANSI SQL: The “standard” Vendor-specific variants of SQL exist
 



Mostly peripheral add-ons to the language Some minor changes in syntax (MS Access) Examples:
  

Oracle MySQL MS Access

Lane Medical Library & Knowledge Management Center http://lane.stanford.edu

6

Comparing Vendor SQLs
Statistical functions Oracle Ext ensive Data mining Large collect ion of dat a mining t ools (neural net s, decision t rees, Other Installation Pow erf ul regular Somew hat non-t rivial expressions t ext search f or beginners t o t o engine inst all Comput at ional biology f unct ions (e.g., BLAST) Free graphical dat a mining environment Need t o w rit e your ow n f unct ions f or st at ist ical analysis, dat a mining SQL somet imes dif f ers f rom " norm" Weaknesses Memory leaks in new er f unct ions common

MySQL

Limit ed

None

Easy

MS Access

Good

Use Excel

Trivial

Noteworthy: MySQL has limited statistical functions
→ E.g., no median function
Lane Medical Library & Knowledge Management Center http://lane.stanford.edu

7

DBs and Tools Discussed Here
Databases


BioWarehouse


SRI International, Menlo Park
European Bioinformatics Institute, UK



Ensembl


Querying tools


Generic query generators
 

WinSQL MySQL Query Browser Excel MATLAB - FAQ
Spotfire – FAQ Excel
MATLAB
8



Calculation tools:
 



Visualization tools
 


Lane Medical Library & Knowledge Management Center http://lane.stanford.edu

A few words on query builders

Lane Medical Library & Knowledge Management Center http://lane.stanford.edu

9

MATLAB’s Visual Query Builder
 

Simple to use Great way to import data into powerful calculating tool (MATLAB)

Lane Medical Library & Knowledge Management Center http://lane.stanford.edu

10

MS Query: Excel’s SQL Query Builder
Pros 1. Part of MS Excel (“free”) 2. Returns data straight into Excel 3. Combines data safety of relational DBs with calculating power of Excel See Lane class Biodata Management Using MS Access for extensive description
Lane Medical Library & Knowledge Management Center http://lane.stanford.edu

Cons 1. MS Query can feel retarded at times 2. Works best with MS database products…

11

Selected MySQL Resources
Invaluable: VisiBone’s MySQL CheatSheet

Lane Medical Library & Knowledge Management Center http://lane.stanford.edu

12

And now for increasingly complex SQL queries using MySQL
Lane Medical Library & Knowledge Management Center http://lane.stanford.edu

13

Caveat: Publically-Queryable SQL Databases (“Internet querying”) Have Limitations


 

Often only partially populated → can’t get results for queries that should otherwise return data (PublicHouse) May not have enough compute power (Ensembl) May not be sufficiently indexed → slow/impossible to get result (Ensembl)

So why bother? These DBs are useful for…  Quick/simple queries  Learning system before installing
Lane Medical Library & Knowledge Management Center http://lane.stanford.edu

14

SELECT Statement: A Refresher
SELECT [DISTINCT] select_list FROM table_list

[WHERE select_list]
[START WITH] [CONNECT BY] [GROUP BY parameter] [HAVING search_conditions] [ORDER BY order_list [ASC | DESC] ]

Functions usable in select_list: summarization:  DISTINCT  COUNT  SUM  MIN/MAX  AVERAGE/STDEV  LIKE (used in WHERE clause) “Modifiers” of select_list:  ORDER BY  LIMIT

Lane Medical Library & Knowledge Management Center http://lane.stanford.edu

Full description of select

15

Data mining queries using PublicHouse
(a publically accessible instance of BioWarehouse)

(MySQL-based)

Lane Medical Library & Knowledge Management Center http://lane.stanford.edu

16

BioWarehouse: An Integration of Multiple Public Domain Bioinformatics Databases

Lane Medical Library & Knowledge Management Center http://lane.stanford.edu

17

BioWarehouse: Inter-Relating Heterogeneous Biological Data Types

Lane Medical Library & Knowledge Management Center http://lane.stanford.edu

18

Caution #1: PublicHouse is only partially populated → sometimes can’t get results for queries that would otherwise return data

Lane Medical Library & Knowledge Management Center http://lane.stanford.edu

19

Caution #2: Think Before Querying


Easy to write queries that…



… retrieve nonsense … never complete


Scotty to Captain Kirk: “Where going in circles, and at warp 6, we’re going mighty fast…”



Understanding schema is only way to prevent this
Use “count” to determine the number of rows in table BEFORE returning large datasets Remember: the more tables are joined, the slower the query If fields not indexed, response time may be 1000X slower
Lane Medical Library & Knowledge Management Center http://lane.stanford.edu



Tips






20

Query #1: Return proteins from CMR with predicted molecular weight between 125 and 135 AND predicted pI between 4.7 and 7.7

SELECT d.Name as Species, b.Name as ProteinName, b.MolecularWeightCalc FROM BioSourceWIDProteinWID a, Protein b, BioSource c, Taxon d WHERE a.ProteinWID=b.WID and a.BiosourceWID=c.WID and c.TaxonWID=d.WID and b.DataSetWID=5 and (b.MolecularWeightCalc >= 125 and b.MolecularWeightCalc <= 135) and (b.PICalc >= 4.7 and b.PICalc <= 7.7)

group by Species order by Species, b.MolecularWeightCalc;
Lane Medical Library & Knowledge Management Center http://lane.stanford.edu

21

Query #2: Same as Query #1, But Count The Number Of Proteins Per Species

select count(b.wid) as NumberOfProteins, d.Name as Species from BioSourceWIDProteinWID a, Protein b, BioSource c, Taxon d where a.ProteinWID=b.WID and a.BiosourceWID=c.WID and c.TaxonWID=d.WID and b.DataSetWID=5 and (b.MolecularWeightCalc >= 125 and b.MolecularWeightCalc <= 135) and (b.PICalc >= 4.7 and b.PICalc <= 7.7) group by Species order by NumberOfProteins desc, Species asc
Lane Medical Library & Knowledge Management Center http://lane.stanford.edu

22

Query #3: Same as Query #1, But Calculate Range of Protein Mol Weight as a Function of Species

select d.Name as Species, avg(b.MolecularWeightCalc) as AvgMolWeight, min(b.MolecularWeightCalc) as MinMolWeight, max(b.MolecularWeightCalc) as MaxMolWeight from BioSourceWIDProteinWID a, Protein b, BioSource c, Taxon d where a.ProteinWID=b.WID and a.BiosourceWID=c.WID and c.TaxonWID = d.WID and b.DataSetWID=5 and (b.PICalc >= 4.7 and b.PICalc <= 7.7) group by Species order by AvgMolWeight asc, Species asc;
Lane Medical Library & Knowledge Management Center http://lane.stanford.edu

23

Query #4: Generate Frequency Distribution For Avg Mol Weight ( Variant of Query #3)
select If(b.MolecularWeightCalc < e.AvgMolWeight, "Below avg", If(b.MolecularWeightCalc >= e.AvgMolWeight, "Above avg", "" )) AS MolWeightCalc, COUNT(*) AS Frequency from BioSourceWIDProteinWID a, Protein b, BioSource c, Taxon d, ( select avg(b.MolecularWeightCalc) as AvgMolWeight from BioSourceWIDProteinWID a, Protein b, BioSource c, Taxon d where a.ProteinWID=b.WID and a.BiosourceWID=c.WID and c.TaxonWID=d.WID and b.DataSetWID=5 and d.WID = 1205383 This is basically Query #3; I group by d.wid) e restricted to single species here where a.ProteinWID=b.WID and a.BiosourceWID=c.WID and c.TaxonWID=d.WID and b.DataSetWID=5 and d.WID = 1205383 and (b.PICalc >= 4.7 and b.PICalc <= 7.7) group by If(b.MolecularWeightCalc < e.AvgMolWeight, "Below avg", If(b.MolecularWeightCalc >= e.AvgMolWeight, "Above avg", ""))
Lane Medical Library & Knowledge Management Center http://lane.stanford.edu

24

Query #5: Using Limit
select a.Name as GeneName, a.Type as GeneProductType, b.Length as SourceDNALength from Gene a, Subsequence b where a.NucleicAcidWID = b.NucleicAcidWID order by GeneName limit 1,100

Lane Medical Library & Knowledge Management Center http://lane.stanford.edu

25

Query #6: Using Distinct

Lane Medical Library & Knowledge Management Center http://lane.stanford.edu

26

Data mining query examples using Ensembl
(MySQL-based)

Lane Medical Library & Knowledge Management Center http://lane.stanford.edu

27

Understanding The Ensembl Schema
Ensembl CORE schema documentation




First place to go to answer: “what does this table store?”



Problem: no graphical representation of overall schema
 Relationships harder to appreciate  Use e.g. WinSQL’s or MySQL Query Browser’s Catalog function and go from there…

Lane Medical Library & Knowledge Management Center http://lane.stanford.edu

28

Understanding the Ensembl Schema Using the Catalog function
(generic query generator function)

Lane Medical Library & Knowledge Management Center http://lane.stanford.edu

29

Ensembl “Fundamental” Tables
Fundamental tables assembly assembly_exception attrib_type coord_system dna dnac exon exon_stable_id exon_transcript gene gene_stable_id karyotype meta meta_coord prediction_exon prediction_transcript seq_region seq_region_attrib supporting_feature transcript transcript_attrib transcript_stable_id translation translation_attrib translation_stable_id Features and analyses alt_allele analysis analysis_description density_feature density_type dna_align_feature map marker marker_feature marker_map_location marker_synonym misc_attrib misc_feature misc_feature_misc_set misc_set prediction_transcript protein_align_feature protein_feature qtl qtl_feature qtl_synonym regulatory_factor regulatory_factor_coding regulatory_feature regulatory_feature_object regulatory_search_region repeat_consensus repeat_feature simple_feature ID Mapping (Map identifiers between releases) gene_archive mapping_session peptide_archive stable_id_event

Exernal references (IDs to objects in other dbs) external_db external_synonym go_xref identity_xref object_xref xref

Miscellaneous interpro

Lane Medical Library & Knowledge Management Center http://lane.stanford.edu

30

Query #7: Retrieving First 100 Genetic Markers on Human Chromosome 19, Using WinSQL Query Generator Tool

… using inner join
Lane Medical Library & Knowledge Management Center http://lane.stanford.edu

31

Defining Inner and Outer Joins

Lane Medical Library & Knowledge Management Center http://lane.stanford.edu

32

Query #8: Retrieving First 100 Transcripts Possibly Related to Cadherins Using Text Parsing

Lane Medical Library & Knowledge Management Center http://lane.stanford.edu

33

Beyond Stand-Alone Query Generators: Using Excel As A Query & Report Generator

Lane Medical Library & Knowledge Management Center http://lane.stanford.edu

34

ExampleExcelQueryingEnsembl1.xls

Lane Medical Library & Knowledge Management Center http://lane.stanford.edu

35

Oracle Data Mining

Lane Medical Library & Knowledge Management Center http://lane.stanford.edu

Oracle vs. MySQL


Oracle has many more built-in

 

Statistical functions Algorithms Add-on tools provided by Oracle


“ODM graphical data mining environment”

Lane Medical Library & Knowledge Management Center http://lane.stanford.edu

37

SOME of Oracle’s built-in data mining functions

Lane Medical Library & Knowledge Management Center http://lane.stanford.edu

38

The Future (Now): Merging Data Retrieval With Data Analysis
Objective: identify optimal candidate genes for orphan enzymatic activities

Approach: data mining using BioWarehouse
Relies on 1. Oracle ODM (Oracle 10g) 2. PL/SQL (pipeline)

Process 1. Input data is retrieved from DB 2. Data are computed within DB
BLAST, regular expressions for parsing text
3.

Results written back to DB

Lane Medical Library & Knowledge Management Center http://lane.stanford.edu

39

Query #9: Data retrieval and computational biology analysis merged into one

Lane Medical Library & Knowledge Management Center http://lane.stanford.edu

40

Selected Resources

Lane Medical Library & Knowledge Management Center http://lane.stanford.edu

Oracle Resources

Lane Medical Library & Knowledge Management Center http://lane.stanford.edu

42

MySQL Reference Manual → Use It

Lane Medical Library & Knowledge Management Center http://lane.stanford.edu

43

Excel Data Mining Resources

Lane Medical Library & Knowledge Management Center http://lane.stanford.edu

44

SQL Resources

Lane Medical Library & Knowledge Management Center http://lane.stanford.edu

45

Oracle Documentation



All Oracle documentation:
http://www.stanford.edu/dept/itss/docs/oracle/10g/nav/portal_3.htm SQL reference: http://www.stanford.edu/dept/itss/docs/oracle/10g/server.101/b1075 9/toc.htm


Quick SQL reference:
http://www.stanford.edu/dept/itss/docs/oracle/10g/server.101/b10759/toc.htm



Oracle data mining:
http://www.stanford.edu/dept/itss/docs/oracle/10g/datamine.101/b10698/toc.htm

Lane Medical Library & Knowledge Management Center http://lane.stanford.edu

46


				
DOCUMENT INFO
Shared By:
Categories:
Tags:
Stats:
views:49
posted:8/2/2009
language:English
pages:46