Open Grid Services Infrastructure

Document Sample
Open Grid Services Infrastructure Powered By Docstoc
					Open Grid Services Infrastructure
The Open Grid Services Infrastructure (OGSI) was published by the Global Grid Forum
(GGF) as a proposed recommendation in June 2003.[1] It was intended to provide an
infrastructure layer for the Open Grid Services Architecture (OGSA). OGSI takes the
statelessness issues (along with others) into account by essentially extending Web services to
accommodate grid computing resources that are both transient and stateful.

[edit] Obsolescence
Web services groups started to integrate their own approaches to capturing state into the Web
Services Resource Framework (WSRF). With the release of GT4, the open source tool kit is
migrating back to a pure Web services implementation (rather than OGSI), via integration of the
WSRF. [2]

"OGSI, which was the former set of extensions to Web services to provide stateful interactions --
I would say at this point is obsolete," Jay Unger said. "That was the model that was used in the
Globus Toolkit 3.0, but it's been replaced by WSRF, WS-Security, and the broader set of Web
services standards. But OGSA, which focuses on specific service definition in the areas of
execution components, execution modeling, grid data components, and information
virtualization, still has an important role to play in the evolution of standards and open source
tool kits like Globus."[2]

[edit] References
   1. ^ "Open Grid Services Infrastructure, version 1.0, Proposed Recommendation" (pdf).
      Global Grid Forum. 2003-06-23.
   2. ^ a b "Grid Stack: Security debrief: Grid's ties to Web services". IBM. 2005-05-17.
OGSA-DAI is an innovative solution for distributed data access and management.

OGSA-DAI allows data resources (e.g. relational or XML databases, files or web services) to be
federated and accessed via web services on the web or within grids or clouds. Via these web
services, data can be queried, updated, transformed and combined in various ways.

OGSA-DAI is about sharing data, whether it be within a single organisation, between a group of
partners or with the public. By sharing data we can identify, understand and exploit complex
interactions between disparate variables and so convert data into information. This in turn can
help us increase our scientific knowledge or our business advantage.

Here, we provide a high-level overview of OGSA-DAI's novel approach to distributed data

OGSA-DAI: An Overview
To introduce the importance of sharing distributed data we'll look at a key motivating example
on the early warning of disease outbreaks.

       Detecting possible disease outbreaks
       How distributed data complicates analysis
       How to analyse distributed data
       Avoiding bloated clients - why a server is useful
       Making distributed analysis easier with query processing
       Converting data into a useful form and delivering it where it's needed
       A demo
       Other examples of OGSA-DAI:
          o Analysing transport data
          o Visualising social sciences data

Detecting possible disease outbreaks
Every time we visit a health centre or hospital, or a doctor visits us at home, this information is
recorded in our medical records. This includes our personal details e.g. name and address and
medical details e.g. visits, symptoms and treatments.

Now, suppose we wanted to detect whether an outbreak of swine flu was imminent. One way
we could detect this would be to look at this patient data to see if the number of patients
displaying swine flu symptoms exceeded some critical threshold within a given region.

So, imagine we have a region. Recording the locations of patients exhibiting swine flu
symptoms might show us these occurrences across our region. If our critical threshold was 10
then the cluster of 12 points in the centre would give us cause for concern.
This is not too difficult. But it does assume that all the patient data is readily available and can
be easily accessed and analysed. In reality things aren't this straightforward.

How distributed data complicates analysis

Our region may be covered by a number of health centres whose catchment areas overlap.

Showing patients with swine flu symptoms for the yellow health centre might give us this. There
is no cluster greater than our threshold of 10.

And for the green health centre we have these occurrences. Again there is no cluster greater
than our threshold.
This is because our cluster is within the area where the catchment areas overlap. Only by
combining the data from both health centres do we see the true picture, that there is a cluster of
patients with swine flu symptoms that is above our threshold.

How to analyse distributed data

So, if our data is held within multiple sources, or databases, to identify clusters of patients there
are a number of activities we need to do.

       Firstly, we need to get the data from each health centre on the numbers of patients they
       have recorded who have swine flu symptoms together with their post codes.
       We then need to collect, or union, this data together.
       Then we can get the final total counts of occurrences for each post code.

These activities are shown here along with example queries expressed in the popular query
language SQL. These ask each database for the number of patients with a "FLU" symptom and
output the total counts of patients per post code.
Avoiding bloated clients - why a server is useful

We could write an application to do this, a client that would get the data from the databases,
combine it, and then visualise it. The client would need to handle the fact that the databases are
located at different sites, may have different data formats or be different products. They may
also have different ways of authenticating with the databases and logging in. However, if there
were a number of clients and a health centre changed its usernames and passwords or
database product then all the clients would need to be updated.

So, it can be useful to introduce a server. The server can manage the connections with each
database. So if a health centre changes its database, only the server needs updated. The client
only needs to connect to the server and so would be protected from such changes.

The server would also manage execution of activities on the client's behalf. All the client needs
to do is tell the server what activities it wants the server to run.

There is another reason for introducing a server and that is that a client might only be interested
in a very small subset of the data. Using a server with large amounts of processing power
means that server can access and filter large amounts of raw data on the client's behalf and
only return to the client exactly what it needs. Clients can then be very lightweight.

If we have our server then our client would now just tell the server to carry out these activities on
its behalf. The server would carry out these activities and return the data to the client.

We have three activities to get the information we need from the database, two to query the
databases and one to combine and summarise this data. It would be easier for the client if the
databases could be made to look like a single database instead of two separate databases.
Then they'd only need to request the execution of one query activity.
Making distributed analysis easier with query processing

Furthermore, given the expressive power of query languages, they could express how the data
should then be combined and summarised within their query rather than request a separate
activity for this. In other words, it would be easier if the client could just specify the query shown,
and the server take care of determining what queries need to be sent to each database and
what additional activities need to be run to answer the client's query.

This is called distributed query processing (DQP).

With distributed query processing the activities that the client needs to tell the server to do
become much simpler.

Converting data into a useful form and delivering it where it's needed

If the client is visualising the data it will need to transform it into a suitable format, for example a
JPG binary image file or a document written in the geographical markup language KML. As this
is just a data transformation why can't our server handle that too?
And, instead of delivering the visualisation data to the client, why doesn't it just hold it on the
server until the client is ready for it. The server could return a URL which tells the client where to
get the data from. This would allow the client to do other things while the activities are running
and also for other clients to access the results too, without having to rerun what might have
been a time-consuming query. They can just get the results from the URL.

So, adopting this solution gives us a new set of activities where the server now converts the
data into a visual format, stores it on the server and returns a URL to the client from which they
can get their data later.

Now, how might the server execute our activities?


OGSA-DAI is a framework that allows groups of activities like this to be executed, activities that
involve accessing, updating, combining, transforming or delivering data that could be distributed
across a number of databases and held in various formats.

It consists of a workflow executor which executes groups of activities, or, as they're called in
OGSA-DAI, workflows.

It also has a distributed query processor which allows a single query to cite tables in multiple
databases. It will automatically parse this query and output a query plan which specifies the
workflows to execute to get the required data from each database.

Data is streamed through OGSA-DAI and different activities work on different parts of the data
stream at the same time. For example data retrieved from a database by a query may be
transformed and delivered while other data from the same query is still being retrieved. This
leads to more efficient execution times and reduced memory overheads.

OGSA-DAI is a 100% Java free open source product licenced under the flexible Apache 2.0

It is independent of any specific applications area having been designed to be highly-
customisable to satisfy data management requirements in a wide number of fields.

A demo

We have produced a demonstrator that implements our early warning scenario. This demo:

       Runs a query across two health databases and uses a third database to map post codes
       to latitudes and longitudes.
       It converts the data into KML, the geographical markup language.
       The client uses Google Maps to visualise the KML.
       The demonstration server provides the client WWW pages and manages interactions
       with the OGSA-DAI server.

Please feel free to visit the demo

In this demonstration the only application-specific components, that is, the only components
specifically relating to health data are the databases and the client code.

The workflow executor, distributed query processor, database query, and KML conversion
components are all standard OGSA-DAI components, independent of health or any specific
applications domain.

Other examples of OGSA-DAI at work
Analysing transport data

OGSA-DAI has been used in a number of applications areas.

FirstDIG was a project that involved EPCC and FirstGroup plc, the UK's largest transport
operator. They had data spread across their departments. This data included:

       Customer contact, for example questions, compliments and complaints from customers.
       Daily vehicle mileages for bus services.
       Daily tickets sold and the money taken for the bus services.
       Schedule adherence, recorded via a satellite tracking system that records whether a bus
       is arriving and departing on time from a bus stop
These data was held in relational databases and COBOL files.

This was in OGSA-DAI's early days so data integration was done by the client, but nowadays it
could be done on the server. Here, OGSA-DAI served as a single access point for the
databases. The client didn't need to handle individual database connections, locations, logins or
passwords. This allowed the data to be easily mined to see how late buses would affect ticket
revenues and complaints, for example.

Visualising social sciences data

SEE-GEO was a project that looked at SEcurE access to GEOspatial services. One aspect of
this work was combining census data and borders data. In SEE-GEO the data sources were not
traditional databases but web services.

Using OGSA-DAI they constructed a portal which allowed a user to submit a query for example
to "show me the population distribution of Leeds according to census output areas."

       The query parameters would populate an OGSA-DAI workflow. This workflow would get
       the relevant census data and then the relevant data on geographical regions (the
       borders data).
       It would then join these, producing a set of geographical regions annotated with the
       census data.
       This data would be transformed into an image file by the use of an image creation
       service - converting the annotated regions into a set of shaded polygons.
       The image would then be delivered to a map server and the URL of the image returned
       to the portal.
       The portal would then get the image and display it to the user.

OGSA-DAI is an innovative distributed data management product that contributes to a future in
which researchers and business users move away from technical issues such as data location,
data structure, data transfer and the ins and outs of data integration and instead focus on
application-specific data analysis and processing.

OGSA-DAI has been under development since 2002 and is currently an open source project
managed by EPCC, The University of Edinburgh. It has contributed to, and continues to
contribute to the success of projects and organisations worldwide.

Shared By:
Description: Open Grid Services Infrastructure