CODATA A Proposition of XML Format for Proteomics Database by the36chambers

VIEWS: 6 PAGES: 20

									CODATA 2002

A Proposition of XML Format for Proteomics Database
Ken’ichi KAMIJO, Toshimasa YAMAZAKI, and Akira TSUGITA Proteomics Research Center, Fundamental Research Labs., NEC Corp.
1

Data Format Standardization
n
q q q

CODATA 2002

Download entries from public DBs as a flat-file
easy for a person to read different formats for every DB sometimes needs special access methods and special applications for each format

n

Needs machine-readable formats for software tools To boost studies by exchanging data among researchers
Activates standardization
2

n

XML format
n

CODATA 2002

XML (eXtensible Markup Language)
q q q

Highly readable for machine and person Can represent information hierarchy and relationships Details can be added right away Easy to translate to other formats Logical-check by a Document Type Definition (DTD)

n

Convenient for exchanging data
q q

<tag_source element_growth=“8 weeks”> rice leaf </tag_source> Example
3

XML in Bioinformatics
"The Extensible Markup Language (XML) is the universal format for structured documents and data on the Web." -W3C XML Web site, 2000-07-06.
User (Researcher) Local access

CODATA 2002

GenBank, EMBL, DDBJ, PIR, PDB, etc. Public DBs

Easy to handle XML XML

Converter XML

Wrapper Easy to distribute Easy to re-use XML

Applications

Internet
XML DB Security Gate
XML Wrapper Item selection XML Easy to control priority level

Application

Private DBs

User (Researcher)

4

Analysis flow in Life Science
Proteome Analysis
Tissue disruption Tissue disruption Extraction Extraction Concentration Concentration 2DE 2DE Spot picking Spot picking (LC) (LC) Mass Spectrometer Mass Spectrometer (Detector) (Detector)

CODATA 2002

(N-/C-terminal seq.) (N-/C-terminal seq.) Data Acquisition

Experiment Design

Sample preparation

Experiment (Analysis)

Result Analysis

Data Mining

Knowledge Discovery

Report

Protein identification Protein identification (PMF, PST) (PMF, PST)

Chromosome Chromosome Genome Genome Functions/Structure Functions/Structure

Related proteins Related proteins Bindings Bindings
5

Conventional XMLs in Life Science
XML

CODATA 2002

DNA array data DNA array data (MAGE-ML) (MAGE-ML)

Experiment Design

Sample preparation

Experiment (Analysis)

Data Acquisition

Result Analysis

Data Mining

XML

Gene/Protein Knowledge Gene/Protein Report Sequence and Discovery Sequence and Features Features (AGAVE, BSML, (AGAVE, BSML, PSDML, BioML, PSDML, BioML, ProML) ProML)
6

Our XML-based data model
Our XML
Experiment Design Sample preparation Experiment (Analysis)
n n n n

CODATA 2002

Proteome-analysis oriented Proteome-analysis oriented Describes Describes
Data
q q q q q q q q

q q Acquisition

Result Analysis

Data Mining

Knowledge Report n Includes other open XMLs Discovery n Includes other open XMLs

q q

Sample preparation Sample preparation Methodology Methodology 2D gel image //LC results 2D gel image LC results Spot information Spot information Sequence and feature Sequence and feature 3D structure 3D structure

used in life science used in life science
Now Available : HUP -ML (Human Proteome Markup Language) DTD and Editor http://www.jhupo.org/

7

XML for Proteomics
n

CODATA 2002

Information Structure:
Proteome Gel info. Source info. Sample preparation info. Methodology info. Gel Image / LC info. Spot info.

<proteome> <gel id=“1”> <source_info> <gel_img > <sample_preparation> <gel_conditions> <marker> <detection> <gel_image> <spot id="1"> <spot id="2">

<gel

id=“2”>
8

Example:
Granule cell

CODATA 2002

By A. Tsugita et al.(2002)

Human Kidney Glomerulus Proteome
Extraglomerular mesangial cell Macra densa cell

Efferent arteriole
Glomerular epithelial cells (podocyte)

Afferent arteriole

Mesangial matrix

Mesangial cell Bowman’s capsule epithelial cell Glomerular endothelial cell Glomerular basement membrane

Nephron

Proximal tubule epithelial cell

Glomerulus

9

Sample of ProteomeXML (1)

CODATA 2002

Source information
--<source_info source_info_ID=“HKG-1" <source_info source_info_ID=“HKG-1" creDate="2002-07-20T12:00:00" creDate="2002-07-20T12:00:00" modDate="2002-08-10T17:20:00"> modDate="2002-08-10T17:20:00"> <source>Homo sapiens</source> <source>Homo sapiens</source> <common_name>Human</common_name> <common_name>Human</common_name> <strain /> <strain /> <cultiva /> <cultiva /> <cell_line /> <cell_line /> <tissue>Kidney Glomerulus</tissue> <tissue>Kidney Glomerulus</tissue> <plasmid /> <plasmid /> <growth_phase unit="year">48</growth_phase> <growth_phase unit="year">48</growth_phase> <induction /> <induction /> <host /> <host /> <description>Normal</description> <description>Normal</description> </source_info> </source_info>
10

Sample of ProteomeXML (2)
- -<sample_preparation> <sample_preparation> <tissue-disruption>Standard sieving technique using <tissue-disruption>Standard sieving technique using four stainless sieves. The glomeruli on the 150 micro four stainless sieves. The glomeruli on the 150 micro m sieves were collected ice cold phosphate -buffered m sieves were collected ice cold phosphate -buffered saline (PBS).</tissue-disruption> saline (PBS).</tissue-disruption> - -<extraction> <extraction> - -<procedure> <procedure> <process seq="1" action="spin-down " " <process seq="1" action="spin-down sample="collection" /> sample="collection" /> <process seq="2" action="homogenize" <process seq="2" action="homogenize" sample="precipitate" >> sample="precipitate" <add_solution solution_ID="sol-A“/> <add_solution solution_ID="sol-A“/> </process> </process> <process seq="3" action="stand" <process seq="3" action="stand" time="60" time_unit="min" time="60" time_unit="min" temp="37" temp_unit="degree in C" /> temp="37" temp_unit="degree in C" /> <process seq="4" action="centrifuge" <process seq="4" action="centrifuge" sample="suspension" sample="suspension" time="20" time_unit="min"> time="20" time_unit="min"> <times_g>12000</times_g> <times_g>12000</times_g> </process> </process>

CODATA 2002

Sample preparation

<process seq="5" action="store" <process seq="5" action="store" sample="supernatant" sample="supernatant" temp="-80" temp_unit="degree in C" temp="-80" temp_unit="degree in C" Procedure : time_unit="min" /> time_unit="min" /> </procedure> (action, target, condition ) lists </procedure> <comment_extraction /> <comment_extraction /> </extraction> </extraction>
- -<solution solution_ID="sol-A" label="2-DE lysis solution"> <solution solution_ID="sol-A" label="2-DE lysis solution"> <item_solution con="9.8" unit="M" name="Urea" /> <item_solution con="9.8" unit="M" name="Urea" /> <item_solution con="2" unit="% w/v" name="NP-40" /> <item_solution con="2" unit="% w/v" name="NP-40" /> <item_solution con="2" unit="% v/v" name="Pharmalyte(pH3-10)" <item_solution con="2" unit="% v/v" name="Pharmalyte(pH3-10)" /> /> <item_solution con="10" unit="mM" name="DDT" /> <item_solution con="10" unit="mM" name="DDT" /> <item_solution con="0.5" unit="micro g/mL" name="E-64" /> <item_solution con="0.5" unit="micro g/mL" name="E-64" /> <item_solution con="0.5" unit="mM" name="PMSF" /> <item_solution con="0.5" unit="mM" name="PMSF" /> <item_solution con="40" unit="micro g/mL" name="TLCK" /> <item_solution con="40" unit="micro g/mL" name="TLCK" /> <item_solution con="1" unit="micro g/mL" name="aprotinin" /> <item_solution con="1" unit="micro g/mL" name="aprotinin" /> <item_solution con="10" unit="micro g/mL" name="chymostain" <item_solution con="10" unit="micro g/mL" name="chymostain" /> /> Solution list <item_solution con="0.5" unit="mM" : name="EDTA"/> <item_solution con="0.5" unit="mM"name="EDTA" /> <item_solution con="0.01" unit="% w/v" name="BPB" /> <item_solution con="0.01" unit="% w/v" name="BPB" /> <comment_solution solution item information <comment_solution/> /> </solution> </solution> 11

Sample of ProteomeXML (3)
Gel condition - -<gel_conditions gel_conditions_ID="" creDate="2002-07-20T12:00:00" <gel_conditions gel_conditions_ID="" creDate="2002-07-20T12:00:00"

CODATA 2002

modDate="2002-08-10T17:20:00"> modDate="2002-08-10T17:20:00"> - -<first_dim> <first_dim> Gel Information : - -<gel_info> <gel_info> <gel_name maker="">linear dry strip</gel_name >> Size, pH, ..... <gel_name maker="">linear dry strip</gel_name <gel_pH low="3" high="10" /> <gel_pH low="3" high="10" /> <gel_size length="24" unit="cm" /> <gel_size length="24" unit="cm" /> </gel_info> </gel_info> - -<protein_solution solution_size="400" solution_unit="micro L" <protein_solution solution_size="400" solution_unit="micro L" protein_amount="100" protein_unit="micro g" guiding_dye ="PBP"> protein_amount="100" protein_unit="micro g" guiding_dye ="PBP"> <description>including standard proteins</description> <description>including standard proteins</description> </protein_solution> </protein_solution> <rehydrate temp="20" temp_unit="degree in C" time="12" unit="hour" /> <rehydrate temp="20" temp_unit="degree in C" time="12" unit="hour" /> - -<running> <running> <apply step="1" current="50" current_unit="micro A“ <apply step="1" current="50" current_unit="micro A“ voltage="500" voltage_unit="V" temp="20" temp_unit="degree in C" voltage="500" voltage_unit="V" temp="20" temp_unit="degree in C" time="1" unit="hour" /> time="1" unit="hour" /> <apply step="2" current="50" current_unit="micro A“ <apply step="2" current="50" current_unit="micro A“ voltage="1000" voltage_unit="V" temp="20" temp_unit="degree in C" voltage="1000" voltage_unit="V" temp="20" temp_unit="degree in C" Running : time="1" unit="hour" /> time="1" unit="hour" /> <apply step="3" current="50" current_unit="micro A" (action, condition ) <apply step="3" current="50" current_unit="micro A" voltage="8000" voltage_unit="V" temp="20" temp_unit="degree in C" voltage="8000" voltage_unit="V" temp="20" temp_unit="degree in C" time="10" unit="hour" /> time="10" unit="hour" /> </running> </running> <IEF pH_low="3" pH_high="10" load_direction="cathode to anode" /> <IEF pH_low="3" pH_high="10" load_direction="cathode to anode" />

lists

12

Sample of ProteomeXML (4)
PIR data area

CODATA 2002

Spot information area
13

XML Editor for Proteomics Information
Our XML Document

CODATA 2002

Gel Image

Gel Info.

Spot Info.

14

XML Editor ( Example)

CODATA 2002

Spot list

15

XML Editor ( Browsing)

CODATA 2002

Click!

Click!

Click!

XML Editor
16

XML Editor ( Source Information)
Source Information

CODATA 2002

<source> <common_name> <strain> <cultiva> <cell_line> <tissue> <plasmid> <induction> <host> <growth_phase>

It is possible to import form ‘templates’ or other XML documents.
17

Features of our data model
Our proteomics XML:
n

CODATA 2002

describes sample preparations
q

Improves reliability of analysis results share know-how improves skills

n

can distribute experimental information
q q

n n

handle both gel-image and analysis results describes analysis information
q

image recognition

Now Available : HUP -ML (Human Proteome Markup Language) DTD and Editor http://www.jhupo.org/

18

Future works
n

CODATA 2002

Open DTD and/or XML Schema
q

Collaboration with AOHUPO

n n

Develop XML viewer for free distribution Prototype WWW-based management system
q

for registration, viewing, and retrieval of entries

n n

Convert from other XML formats Relation to other analysis tools
q q

image-analysis software homology-analysis tools, etc.
AOHUPO: Asia Oceania Human Proteome Organiazaion
19

Our XML Workflows
DTD or Schema XML Document

CODATA 2002

DB MS Stylesheet

XML Editor XML Document

Validate

Transform

XML Application

DB

could be supported by AOHUPO. could be developed by third party.
Now Available : HUP -ML (Human Proteome Markup Language) DTD and Editor http://www.jhupo.org/

20


								
To top