ArrayExpress approved database list - Free

DB               Link
atcc             ATCC®
affymetrix       Affymetrix
astra_hpylori    Helicobacter pylori
                 Genome Database

blocks           Blocks

candidadb        CandidaDB

catma            CATMA

compugen         Compugen

cp450            Cytochrome P450
dbsnp            dbSNP

embl             DDBJ/EMBL/GenBank

ensembl          Ensembl

entrez           Entrez

entrez_protein   Entrez Protein

ec               Enzyme Commission

expasy           ExPASy

flybase          FlyBase
flybase_bt       FlyBase: Body Part
flybase_dv       FlyBase: Developmental
gdb              GDB

genecards        GeneCards™

genedb           GeneDB

genesnps         GeneSNPs

genew            Genew
genmapp          GenMAPP
go               GO

gpcrdb           GPCRDB

gxd              GXD

hgmd          HGMD®

hgvbase       HGVbase
howdy         HOWDY

hugo          HUGO

image         IMAGE Consortium

incyte        Incyte Genomics
              BioKnowledge® Library
interpro      InterPro
jsnp          JSNP

kegg          KEGG
locus         LocusLink
medline       MEDLINE
mgc           MGC
mgd           MGD
MO            MGED Ontology

mips          MIPS
mtb           MTB
ncbitax       NCBI Taxonomy
netaffx       NetAffx™
nextdb        NEXTDB
nia_nih       NIA/NIH Mouse

omim          OMIM™
omni1         OmniArray

omni2         -
pdb           PDB
pfam          Pfam

pharmgkb      PharmGKB

pir           PIR
pkr           PKR
pkr_hanks     PKR: Hanks Classification

populusDB     PopulusDB

pseudomonas   Pseudomonas genome
pubmed        PubMed
refseq        RefSeq
rgd           RGD
rgd_qtl       RGD: QTL
riken         Riken
rzpd          RZPD

sanger                  Sanger Institute Human
                        Genome Project
scop                    SCOP
sgd                     SGD™
stack                   STACKdb™

subtilist               SubtiList
sulfolobus              Sulfolobus P2
swall                   SWALL

swissprot               Swiss-Prot

tair                    TAIR
tigr_atdb               TIGR: AtDB
tigr_cmr                TIGR: CMR
tigr_cmr_hpylori26695   TIGR: CMR H. pylori
tigr_cmr_hpylorij99     TIGR: CMR H. pylori J99

tigr_egad               TIGR: EGAD
tigr_ego                TIGR: EGO

tigr_mgi                TIGR: MGI
trembl                  TrEMBL

tsc                     The SNP Consortium Ltd.

toxoest                 ToxoEST
tuberculist             TubercuList

unigene                 UniGene
uw_ecoli                UW E. coligenome project

wormbase                WormBase

Name/Comments                                              Examples
American Type Culture Collection.                          HB-10204, CRL-2567
Affymetrix                                                 -
Contains annotation and relationships for all putative     JHP244, TIGR Hp244
open reading frames from H. pylori strain J99 and strain
Multiple alignments of conserved regions of protein        IPB001206, PR00717
Genomic and protein sequence information and relevant      CA1611, CA4276
annotation related to the human fungal pathogen
Candida albicans.
Complete Arabidopsis Transcriptome MicroArray              CATMA4a39650, CATMA1a00035
contains Gene Sequence Tags (GSTs) covering most
Arabidopsis genes, primarily for use in transcription
profiling DNA arrays.
Database for Compugen 75bp oligo sets which carry an       CGEN_MOUSE_3001299_1, CGEN_B_SUBTILIS_1003153_0
ID of type: CGEN_MOUSE_3001299_1
Cytochrome P450 homepage.                                  CYP2X1,CYP253A1
Single nucleotide polymorphisms database. NCBI             rs241,ss3453405
assigns reference SNP (rs) IDs to SNPs that appear to
be unique in the database.
DNA Data Bank of Japan/European Molecular Biology          X55054, AL416345
Laboratory/genetic sequence database.
Up-to-date sequence annotation for eukaryotic              ENSRNOG00000002833, ENSG00000105723
Ensembl family ID.                                         ENSF00000001212
Ensembl gene ID.                                           ENSRNOG00000002833, ENSG00000105723
Ensembl transcript ID.                                     ENSMUSP00000023507, SINFRUP00000149297, CG2621-PB
Text-based search tool at NCBI for major databases         -
(incl. PubMed, Nucleotide and Protein Sequences,
Protein Structures, Complete Genomes, Taxonomy).
Protein entries from various sources (incl. SwissProt,     -
PIR, PRF, PDB, and translations from annotated coding
regions in GenBank and RefSeq).
NC-IUBMB, general information on enzyme                    EC, EC
nomenclature plus a list of EC numbers.
Expert Protein Analysis System. Analysis of protein        See Swiss-Prot
sequences and structures as well as 2-D PAGE.
Drosophila genome database.                                FBgn0003721, FBgn0006354
FlyBase controlled vocabularies for body parts.            FBbt:00005151, FBbt:00003209
FlyBase controlled vocabularies for developmental          FBdv:00005362, FBdv:00005286
The Genome Database. Human genes and genomic               ?
Database of human genes, their products and their          GC03U990103, GC0YM020047
involvement in diseases.
Database resource for Schizosaccharomyces pombe,           Tb10.61.1880, LmjF16.0300, SPAC1002.09c
Leishmania major and Trypanosoma brucei.
Integrates gene, sequence and polymorphism data into       78, 273
individually annotated gene models.
Human gene nomenclature database search engine.            9726, 4699
Gene MicroArray Pathway Profiler.                          ?
Gene Ontology (biological process, cellular component      GO:0004672, GO:0008150
and molecular function).
Information system for G protein-coupled receptors         See Swiss-Prot
Gene Expression Database (mouse).                          MGI:1270901, MGI:1204331

Human Gene Mutation Database. Contains known                   CD984127, CI962218
(published) gene lesions underlying human inherited
Curated human polymorphisms.                                   SNP000002345, SNP000006551
Human Organized Whole genome Database. Integrated              -
human genomic information.
Human Genome Organisation - human gene symbols.                -

Integrated Molecular Analysis of Genomes and their             IMAGE:5535369, IMAGE:38269
Expression (I.M.A.G.E) consortium clone resource
Private database - access restricted to Incyte                 -

Useful resource for whole genome analysis.                     IPR004014, IPR006415
Database of Japanese Single Nucleotide                         IMS-JST120307, IMS-JST070157
Kyoto Encyclopedia of Genes and Genomes.                       cdi:DIP1552, pst:PSPTO3125
Contains information on genetic loci.                          18033, 173149
Bibliographic database.                                        11832201, 12640006
Mammalian Gene Collection.                                     7348, 6001
Mouse Genome Database.                                         MGI:1270901, MGI:1204331
The Microarray Gene Expression Data Society Ontology.          -
An ontology for microarray experiments.
Munich Information Center for Protein Sequences.               See SGD for cerevisiae; N. crassa: 1nc100_360, 6nc360_040
Mouse Tumor Biology database.                                  MTB:12285, MTB:9404
Taxonomy browser.                                              9606, 7227
Affymetrix NetAffx™ analysis center.                           1552656_s_at, 1555286_at
The Nematode Expression Pattern DataBase                       CELK01662, CELK02199
Mouse genomics home page of Laboratory of Genetics,            C0001C09-3, J0705A10-3
National Institute on Aging, National Institutes of Health.

Online Mendelian Inheritance in Man.                           *605004, #232220
OmniArray MicroArray Analysis tool. B.pseudomallei             ?
sequence database 1
B.pseudomallei sequence database 2                             ?
Protein Data Bank                                              1Q3W, 1PYX
Multiple sequence alignments and hidden Markov                 PF02116, PF05462
models of common protein domains.
The Pharmacogenetics and Pharmacogenomics                      PA356, PA36679
Knowledge Base. Variation in drug response based on
human variation.
Protein Information Resource                                   S23506, I48691
The Protein Kinase Resource.                                   1010067, 1003334
Eukaryotic protein kinase superfamily organised into           See Swiss-Prot
distinct families that share basic structural and functional
properties (classification by Steven K. Hanks).

Populus tremula x tremuloides genomic sequence                 F028P01, UB44DPG10
Pseudomonas aeruginosa genome annotation.                      PA1498, PA0555

Bibliographic database.                                        15111598, 789142
NCBI Reference Sequence project.                               AP_123456, NM_123456
Rat Genome Database.                                           RGD:632259, RGD:620351
Rat Genome Database: Quantitative Trait Locus.                 RGD:61455, RGD:61365
                                                               4930483C13, 2810043G22
Resource Center and Primary Database.                          IMAGp998E178880Q, RZPDp988C03106D

Human mapping and sequencing information.                See Ensembl

Structural Classification of Proteins.                21953, 49268
Saccharomyces Genome Database.                        YOL128c, YKR031c, Q0010
Sequence Tag Alignment and Consensus                  cn32980, cl5775
Knowledgebase. Non-redundant, gene-oriented clusters.

Bacillus subtilis database.                              BG13816, BG12792
Sulfolobus P2 annotation database.                       SSO5479, SSO8380
Non-redundant protein sequence database (Swiss-          See Swiss-Prot
Curated protein sequence database that provides a high   O77438, Q9NBW1
level of annotation, a minimal level of redundancy and
high level of integration with other databases.
The Arabidopsis Information Resource.                    At4g00010, At4g00650
The TIGR Arabidopsis thaliana Database.                  See TAIR
Compr                                                    EC2528, Rv2204c
Comprehensive Microbial Resource at TIGR for             HP1508, HP0121
Helicobacter pylori 26695.
Comprehensive Microbial Resource at TIGR for             NT01HP0007, NT01HP1594
Helicobacter pylori J99.
The Expressed Gene Anatomy Database at TIGR.             HG32820, HT1920
TIGR ortholog database - linking orthologous genes       See TIGR: MGI
across eukaryotic organisms.
TIGR Mouse Gene Index.                                   TC1110030, NP799252
Computer-annotated supplement of SWISS-PROT that         See Swiss-Prot
contains all the translations of EMBL nucleotide
sequence entries not yet integrated in SWISS-PROT.
The TSC database contains details of single nucleotide   TSC1267585, TSC1103198
polymorphisms (SNPs) that have been discovered and
characterised by the TSC.
Toxoplasma gondii clustered EST database.                Ctoxoqual_20, 7587794
Genomic information on tubercle bacilli such as M.       Rv1326c, Rv1328
Non-redundant, gene-oriented clusters.                   Gga.4719, Rn.10124
The University of Wisconsin E.coli Genome Project.       23909, 423

Genome and biology of C. elegans (as of 28/8/2002, the Y18D10A.5, CBG07972
genome browser shows preliminary gene predictions for
C. briggsae).

Perl regexp (UNTESTED! just shorthand for now)

/^(JHP\d{1,4}|TIGR Hp\d{1,5})$/









/^EC \d+\.\d+\.\d+\.\d+$/









SGD-like for cerevisiae, /^\dnc\d{3}_\d{3}$/ for N. crassa








Other comments
More work needed
See NetAffx?

updated link to

NB. regexp doesn't cover Fugu or Drosophila IDs
Multiple database search tool (see other IDs)

Multiple ID formats from other databases (e.g. EMBL)

lack of ':' not a typo.

N.B. regexp is a bit of a guess; database connection timeout

more species available, not very practical to create full list at this stage

gene_ids are simple numbers; SNPs seem to link directly to dbSNP

Does not appear to have accession numbers

Multiple ID formats from other databases (e.g. EMBL)

Gene symbols; generally quite free-form

Accessions available via Entrez

These accessions for GENES only

semi-free text

species-dependent; like SGD for S. cerevisiae; many species

not sure about the regexp

irritating registration for login

failed connection


links to Swiss-Prot entries

not sure about the RE

unsure which id is the accession

all over the place

Multiple species make for complex IDs
no seq db?

no seq db?

EST cluster IDs; presumably the clone ids originally come from somewhere else? tgzz34f05.r1, TgESTzz72h06.y1

Depends on whether C. elegans or C. briggsae; C. elegans IDs quite complex

e else? tgzz34f05.r1, TgESTzz72h06.y1

