ICR ASBP Analysis Service Parameters Proposal
Draft proposal
In alphabetical order… Brian Davis, Kiran Keshav, Ted Liefeld, Curt Lockshin, Patrick McConnell. Martin Morgan, Sal Mungal, Jared Nedzel, Baris Suzek, Claire Wolfe, Nov 8, 2007
Analytic Services are DIFFERENT from Data Services
• Differences • Data Services
e.g. caArray
• Long lifetimes
• Remain useful for many years • May be extended/grow, but seldom disappear
• Grow/change slowly • Few in number
• 10’s-100’s of services
• Analytic Services e.g. Hierarchical Clustering
• Short(er) lifetimes
• Replaced by newer algorithms or variants frequently • E.g. Blast - 13 variants at http://www.ncbi.nlm.nih.gov/BLAST/
• Change often
• Some GenePattern Algorithms have had >10 updates • Parameters added/removed, implementations improved
• Many in number
• GenePattern+Bioconductor+geWorkbench have >400 between them
Analysis Services are DIFFERENT than Data Services
• Class registration of input and output for caGrid supports (relatively stable) data services • Data models have long lives • Overhead of registration small compared to service implementation • Registered classes remain valid for long period • Geared towards supporting new services
• Starting from new data model to be put on the grid
•
Analytic services - More of them, more variable, shorter lifespan • Overhead of class registration a significant portion of development effort • (Many) analytic services are preexisting
• GenePattern+BioConductor+geWorkbench have >400, ~9 on caGrid • Developers must ‘go back’ to re-model the service parameters in caBIG way
•
• Parameters change often, each version may have different parameters Conclusion: need to modify registration process in caBIG to get more analytic services on caGrid
Process for analytic services
• Re-annotating reused classes (Solution: Service Loader) •Annotation of parameter classes
•Model reused classes (Solution: Service Loader) •Modeling parameters
• SIW Roundtrip partially working for reused model (Solution: bug fixes in SIW 3.2.1 + Service Loader)
• Reloading reused classes (Solution: Service Loader) •Loading parameter classes
Outstanding issues in RED
caGrid and analytical services: steps in the Introduce toolkit
•XSD generation Wrong XSD using EA and caCORE SDK not used for Analytical Services
XSD File
•Redefinition of interfaces/ operations modeled in EA
•Annotation and tagging (CDE id) of parameter classes in EA (needs caDSR load)
XML File
Import dataypes
Create operations
Add Service Metadata and Domain Model
Create Skeleton / Implement Methods
caDSR
GME
caDSR
•Loading classes to caDSR/schemas to GME before development
Outstanding issues in RED
Issues and Solutions: 1) Use specialized “service loader” and improvement in “Roundtrip” Issue 1- Model Reuse
• Significant time investment to reuse models.
• Hard to include in UML • Round trip did not work well • Required re-annotation, re-generation of XSDs
Solution 1- Use New “Service Loader” to Register Reuse of Models
• • Significantly reduce registration time (~2 developer FTE weeks) to register models reusing other models Replace with Service Loader based process
• Re-used Models not included in UML - unless modified/extended • Register model re-use in introduce. Recorded in service metadata • Service metadata submitted to Service Loader to record use in caDSR
•
Prevents partially re-used model mismatch problems (eg GP/caArray/caB2B)
NOTE: Still Need to “Test Drive” new process to ensure it works
• Created Demo Service to test using Service Loader process for model reuse • Used as an example for new Analysis Service developers • Used to provide scaffolding for developing white paper describing how to create analysis services
Exploring Further Solutions & Additional Time Savings: Parameters
Issue 2 – Modeling and registration of parameters • Parameters change frequently, requiring model changes, re-annotation, and loading into caDSR (3 Months?) • Parameters, unlike input and output data classes, are not intended for semantic interoperability or reusability • Parameters are not semantically rich or meaningfully annotatable • Parameters meaningful only within the context of the service
Solution 2 – Treat parameters differently
• • Time savings estimated at ~1 developer FTE week effort per service over 2-3 calendar months Additional Curator FTE savings due to reduced model loading workload
Proposal - Generic Parameter Passing Model
Use a generic parameter model to pass parameters to the services - Reuse model and allow Service Loader to register our model reuse - This model registered once, reused often
Simple reusable metadata model facilitates auto-generation of Parameter metadata & service implementation
Proposal 2- Generic Parameters Metadata Model
• • Extend caGrid Service Metadata (already supported) with Parameter metadata Model (as discussed with caGrid Team) All metadata is handled at caGrid level Draft model:
Exploring Further Solutions & Additional Time Savings: Parameters
Issue 2 – Modeling and registration of parameters
• • • • Parameters change frequently, requiring model changes, re-annotation, and loading into caDSR (3 Months?) Parameters, unlike input and output data classes, are not intended for semantic interoperability or reusability Parameters are not semantically rich or meaningfully annotatable Parameters meaningful only within the context of the service
Solution 2 – Treat parameters differently
• • • • • Time savings estimated at ~1 developer FTE week effort per service over 2-3 calendar months Additional Curator FTE savings due to reduced model loading workload Generic Parameter Passing Model reused in Domain Model Generic Parameter Metadata Model to be in service metadata ONLY • Enhanced service metadata to define parameters Parameters are NOT registered as CDEs in caDSR (not semantically annotated) • The parameters are found in the index service
Pros and Cons Pros:
• SAVE TIME (~1 developer FTE week per service) • More analytic services on caGrid/available to caBIG • Actual parameters and descriptions of parameters are still available at Grid level • No caDSR/GME registration dependency (if all classes are reused)
Cons
• No parameter re-use • No concept based-discovery of services • No semantic interoperability based on parameters (is this likely, anyway?) • No CDEs for parameters • A different place to look for parameter metadata (not caDSR) • Proposed model is not appropriate for non-caGrid services
• Could be adapted to support non-caGrid silver services
Time Savings from Adoption of Proposals
Time to semantically annotate and grid-enable an Analytic Service
Total Calendar Time No Change to Process 6-9 Month Total Developer Time 5 FTE Weeks (200 hours) Comments
Estimated time it took for Reference Implementations in 2006-7 (GenePattern, geWorkBench, Bioconductor) Parameters still need to be registered in domain model Time for using service loader and using generic model proposal.
Adoption of Analytic Service Loader Proposal Adoption of Analytic Service Loader AND Parameter Proposal
3.5 Months
3 FTE Weeks (120 hours) 2 Weeks (80 hours)
1.5 Months
NEXT STEPS
•
Suggested Next Steps • Modifications to proposal based on input from NCI
• Create final draft of generic parameters proposal
• Meeting with caGrid team on Extension to Service Metamodel • Presentation to Arch and VCDE WorkSpace’s • Develop proof of concept services
• Test drive of Analytic Service Loader • Test use of generic parameter services
• Register generic parameter models in caDSR
Appendix: additional slides
• •
Extra slides Not in any order
ASBP Meeting Logistics
• Meeting 2nd Friday of every month • Next Meeting: November 9th @ 2:00 pm EST • Topics
• Continued development of demonstration service & white paper
• Analysis of CGEMs model for demo service (Hrishi)
• Follow-up about registration of Parameters
•
liefeld@broad.mit.edu
Parameter Modeling Comparison
Modeling one parameter set (below) Including all CADSR tags & stereotypes Clean up of tags, adding Concept codes Generating extended metadata using new model for GenePattern modules EA->Schema->jaxB->custom java code
executeAnalysis java.lang.String reference gene accession from data file to find neighbors for true gene.accession 1 java.lang.String 50 number of neighbors to find true num.neighbors 2
…
Modeling Time Comparison
One-time cost to create metadata-generation code <120 minutes
Parameter Set Time to model & first pass annotation Estimated registration & semantic annotation # of parameters 95 min ?? Working days 2-3 calendar months 4
Generic Parameter metadata generation ~1 min 0 min 270
# of value domains
# modules Estimate for all modules to draft introductory stage
2
1 ~3 and a half person weeks + registration & XSD
125
82 ~1 min
Note: This does not include the semantic annotation or XSD editing which are typically the most time consuming portions of the process
caGrid Service Metadata
From caDSR Registration
Current caGrid Parameter Modeling
A current caGrid parameter representation
Cons: •Modeling, semantic annotation and caDSR registration –significant cost
Pros: •CDE based discovery •Parameter information on caDSR