taylor by nuhman10

VIEWS: 6 PAGES: 22

									CTCD Data Management Strategy




        Prepared by: Paul Taylor
        Version:     1.0 Draft
        Date:        27/09/2011
CTCD                                                                                                            Data Management


                                                        Contents
1      INTRODUCTION                                                                                                                     3

2      CTCD DATA ASSETS AND REQUIREMENTS                                                                                                4
    2.1        Introduction                                                                                                             4
    2.2        Requested data sets                                                                                                      4
    2.3        Internally generated data sets                                                                                           7
    2.4        Progress of data delivery                                                                                                7
3      DATA MANAGEMENT STRATEGY                                                                                                         9

4      DATABASE ISSUES                                                                                                                11
    4.1        Database system selection                                                                                              11
    4.2        MySQL specific issues                                                                                                  11
    4.3        Data backups                                                                                                           12
5      DATABASE STRUCTURE                                                                                                             14
    5.1        CTCD Users                                                                                                             14
    5.2        Metadata structure                                                                                                     14
    5.3        Other data tables                                                                                                      19
6      DATABASE USER INTERFACE                                                                                                        20
    6.1        Options                                                                                                                20
    6.2        Metadata queries                                                                                                       20
    6.3        Data Queries                                                                                                           21
7      SUMMARY                                                                                                                        22


Table 2.1 Requested third party data sets .................................................................................. 4
Table 2.2 Potentially useful third party data sets....................................................................... 5
Table 2.3 Third party Earth Observation (EO) data sets used by the CTCD ............................ 5
Table 2.4 Third party data set obtained by the centre predating the data manager’s
    appointment ........................................................................................................................ 6
Table 2.5 Internal CTCD data sets ............................................................................................ 7
Table 2.6 Missing data sets ....................................................................................................... 8
Table 4.1 Features of MyISAM and InnoDB tables ................................................................ 12
Table 5.1 DATA_SET table description ................................................................................. 15
Table 5.2 DR_QUI table description ....................................................................................... 16
Table 5.3 DATA_CONTACTS table description ................................................................... 17
Table 5.4 CONTACTS table description ................................................................................ 17
Table 5.5 DATA_EQUIPMENT table description ................................................................. 17
Table 5.6 EQUIPMENT table description .............................................................................. 17
Table 5.7 DATA_VARIABLE table description .................................................................... 18
Table 5.8 VARIABLE table description ................................................................................. 18
Table 5.9 DATA_REFERENCE table description.................................................................. 18
Table 5.10 REFERENCE table description............................................................................. 19


                                                                                                                                        2
CTCD                                                                         Data Management

1 Introduction

The CTCD officially came into existence on January 1st 2002. Following a period of
recruitment, the centre managed to fill posts for the research strands. However, after twice
failing to recruit a full time data manager, an alternative was required quickly. Paul Taylor
(based at Forest Research, Alice Holt) was appointed in May 2003 as an interim data
manager.

The CTCD requires many different forms of data to meet their objectives. These data sets are
held in different forms and are spread across many different institutions – each with their own
data release policy. The acquisition, storage and transfer of data were identified as key issues
to be addressed by the data manager.

The major tasks associated with the data manager are as follows:
    Assess the data needs for each project – including prediction of future projects
    Identify information on data holdings and access rights
    Set up data transfer channels
    Clarify common data needs and resolve compatibility issues
    Define an archiving policy
    Define the best way to collaborate with the NERC data centres

According to the data manager contract, the first three deliverables are:
   1. A report summarising internal and third party data sets required by CTCD partners
       together with a proposed data management approach and how to work with the
       NERC data policy
   2. Acquisition of required third party data sets
   3. A report describing the proposed data management strategy.

This report is intended to cover deliverable 1, provide an update of the progress on the
ongoing deliverable 2 and a broad outline of deliverable 3. As of yet, the metadata has not
been fully collected and so there are limits to the amount of detail that can be gone into with
the database design. However, most of the points are known already and the later information
can be generated as required. The metadata tables are an integral part of the database as they
can be queried efficiently and find the actual data required easily. It is also a good place to
store the licensing requirements so that they can be provided with all the data supplied by the
database. Intellectual property rights and licensing are a vital part of the CTCD’s data
management strategy. Most of the data sets that the CTCD uses are not wholly created by the
centre and it is vital to uphold the wishes of those who have provided us with the data we
need.




                                                                                              3
CTCD                                                                           Data Management


2 CTCD data assets and requirements

2.1   Introduction

The CTCD produces and requires many different data types to meet its deliverables. In
particular, the models in use by the CTCD require large data sets as inputs and additional data
to validate.

With any model, there is a disparity between what data the modeller would ideally like to use
and what data are actually available. The availability of meteorological data is an interesting
case in point. Met data are critical for growth models as it ties in with most elements of
physiology. The SDGVM model creates its own soil information (as reliable soils datasets are
rare). However, a lengthy time series of meteorological data is required to model the build up
of soil. Long and consistent data sets are few and far between. From the CTCD’s point of
view, it is important for the models to have a large spatial range. This is simply not possible
based on measured daily data. Modelled meteorological data (based on long term climatic
data) is more widely available, but in this case the metadata needs to be clearly available and
understood so that caveats can be included with the results.

This section includes details of the requested and delivered data sets that are important to the
centre.

2.2   Requested data sets

One of the first tasks for the data manager was to identify and prioritise the different data used
by the centre. All the data sets listed in this section are based on returns from the CTCD
research assistants. The main (urgent) data sets requested for use by the centre are listed in
Table 2.1 below. This includes data sets identified at the start of the contract and those sets
that were subsequently identified as important. Table 2.2 includes the data sets identified as
being of possible use, but have not yet been requested.

Table 2.1 Requested third party data sets
Data Set                                       Requested by             Data owner
UK Soil Carbon Map                             York                     DEFRA
Weather station met data for Kielder           Sheffield                Met Office/BADC
Central England Met data                       Sheffield                Met Office/BADC
Sub compartment database                       Sheffield/ York          Forestry Commission/
                                                                        Forest Enterprise
Woodland Inventory                             Sheffield                Forest Research
Sitka Spruce data (various)                    UCL                      Forest Research &
                                                                        University of
                                                                        Edinburgh
River discharge (various UK and                Sheffield                Various (incl. National
European)                                                               Water Authority -
                                                                        CEH)
Various Coalburn datasets (e.g. flow data)     Sheffield                CEH
Various Coalburn datasets – daily met data,    Sheffield                BADC, Cranfield
soils etc.
                                                                                                  4
CTCD                                                                         Data Management

Harwood FC soil classification map              York                 Forest Research (TSU)
Various soils data                              York                 Cranfield (NSRI)
Various soils data                              York/FR              Macaulay (MLURI)
ECMWF – 40 data                                 Sheffield            BADC
2001 5km gridded monthly met data for the       Sheffield, FR        Met Office
UK
2001 met data from individual stations          Sheffield, FR        BADC
CEH catchment boundaries                        York                 CEH
Water colour change records                     York                 University of Leeds
Geological data                                 York                 BGS
UKCIP data                                      Sheffield, FR        Met Office
LCM 2000                                        Sheffield            CEH
Table 2.2 Potentially useful third party data sets
Data set                                    Data owner                 Comments
Flux data                                   CEH
Forest Research data other than that        FR
mentioned in 2.1
EU soils map                                EU-JRC                     Basic soils data on a
                                                                       European scale.
National vegetation classification          Lancaster University

Table 2.1 and Table 2.2 do not include the Earth Observation (EO) datasets, which can be
found in Table 2.3. These data sets are treated separately due to their size and the nature of
the sets which require specialist knowledge to handle correctly. It was decided at an early
stage that due to the familiarity of the workers already involved with the data sets and
sources, they (and not the data manager) should continue to be responsible for the collection
and storage to meet their own needs. The raw data for these sets comes from third parties.
However, they are all processed to get meaningful results and there is a question as to at what
point these products become essentially CTCD data sets. The data sets generated as part of
the CTCD (such as the flux data recorded by Edinburgh) are not included in this section.
Table 2.3 Third party Earth Observation (EO) data sets used by the CTCD
Data set                             CTCD             Notes
                                     contact
VGT S10 Europe                       T Quaife         Surface reflectance based on a 10
                                                      maximum NDVI value
VGT NPP/NEP Europe                   T Quaife         NEP and NPP estimates derived from
                                                      SPOT-4/5 VGT data
MODIS level 4 surface                T Quaife         Most recent reprocessing of the daily
reflectance                                           surface reflectance product from MODIS
MODIS level 4 surface                T Quaife         ERS data for East Anglia
reflectance
MODIS (Terra) vegetation             M Disney         MODIS (Terra) NDVI and EVI over UK
continuous fields product                             – 4 MODIS tiles stitched and masked for
                                                      GB and NI
MODIS (Terra) NPP                    M Disney         MODIS (Terra) net primary productivity
MODIS LAI/fAPAR data                 M Disney         MODIS LAI/fAPAR over UK – 4
                                                      MODIS tiles stitched and masked for
                                                      GB and NI
                                                                                               5
CTCD                                                                         Data Management

MODIS land cover data             M Disney         MODIS land cover over UK – 4 MODIS
                                                   tiles stitched together and masked for
                                                   GB and NI.
AVHRR FASIR GLOBAL                M Disney         Fourier-Adjustment, Solar zenith angle
NDVI                                               corrected, Interpolated, Reconstructed
                                                   (FASIR) Normalised Difference
                                                   Vegetation Index (NDVI)
Phenology data                    M Disney         Monthly and 10-day composite AVHRR
                                                   NDVI
MODIS nadir BRDF-adjusted         M Disney         MODIS land surface temp and
reflectance and albedo                             emissivity data over UK – 4 MODIS
                                                   tiles unprocessed
MODIS (Terra) NDVI and EVI        M Disney         MODIS (Terra) NDVI and EVI over UK
                                                   – 4 MODIS tiles unprocessed
MODIS Land surface                M Disney         MODIS land surface temp and
temperature and emissivity                         emissivity data over UK – 4 MODIS
                                                   tiles unprocessed.
AVHRR GLOBAL NDVI                 M Disney         Monthly and 10-day composite AVHRR
                                                   NDVI
ERS SLCI                          P Drezet         C band microwave backscatter measured
                                                   coherently by SAR (phase information is
                                                   coded as real and imaginary parts)
Envisat ASAR                      P Drezet         From ESA EO
Envisat                           P Drezet         Calibrated intensity

Finally, Table 2.4 includes other third party data sets already held by the CTCD before the
data manager joined.

Table 2.4 Third party data sets obtained by the centre predating the data manager’s
appointment
Data set                          CTCD             Notes
                                  contact
OS Map data                       P Drezet         Includes DEM and raster data for
                                                   Kielder, Harwood and Thetford
Global Simulation of carbon       S LaFont         GPP and NPP, autotrophic and
fluxes from the TURC model                         heterotrophic respiration
LCM 2000                          A Heinemeyer     From CEH
                                  (only eligible
                                  user)
CEH countryside information       A Heinemeyer     York only. From CEH
system
Hadley temperature scenarios      A Heinemeyer     At 10’ scale up to 2100
                                  / P Taylor
UKCIPs data                       A Heinemeyer
UK soil carbon and vegetation     A Heinemeyer     1km gridded data (from CEH). 1998 data
data                                               only
CORINE landcover map              A Heinemeyer     SEI licence only
Worldwide organic soil carbon     A Heinemeyer     From ORNL
and nitrogen data
                                                                                          6
CTCD                                                                            Data Management

Global CO2 flux in soil              A Heinemeyer      From ORNL
respiration
Worldwide bulk density and soil      A Heinemeyer      IGBP
carbon database
Worldwide soil carbon and soil       A Heinemeyer      0.5 grid. ISRIC
pH database

2.3   Internally generated data sets

Most of the CTCD research strands generate their own data products. These vary from actual
field measurements (such as the flux measurements) to modelled data (from the models – or
from modules of the models – developed within the centre: ForestETp, SDGVM and
SPA).The field data is of great importance, and apart from quality assurance and formatting,
remains essentially the same as when it was measured. As the models are still evolving, their
products can vary significantly from week to week. Table 2.5 lists the various internally
produced data sets. I have not included the products of the analysis of EO data sets here as
they are already listed in Table 2.3.
Table 2.5 Internal CTCD data sets
Data Set                             Contact           Notes
Products of ForestETp                S LaFont          Modelled data (still being updated)
Products of SDGVM                    M Lomas           Modelled data (still being updated)
Products of SPA                      M Williams        Modelled data (still being updated)
Products of ACM                      M Williams        Modelled data (still being updated)
UCL fieldwork                        M Disney          Reflectance, LAI, canopy cover,
                                                       atmospheric optical depth at Harwood,
                                                       Barton Bendish, East Anglia and San
                                                       Rossore
Airborne remote sensing              M Disney          Airborne TM, CASI and LiDAR data
Sampling of soils data               A Heinemeyer      Variety of parameters sampled at
                                                       Harwood
Harwood eddy covariance              C Nichol          Ongoing data collection
Hyperspectral data collection        C Nichol          Measurement of hyperspectral
                                                       reflectance from a tower and airborne
                                                       platform
Leaf biochemical data                C Nichol          Leaf biochemical data from a High
                                                       Phase Liquid Chromatography (HPLC)
                                                       system


2.4   Progress of data delivery

The data sets requested by the CTCD involve establishing relationships with a number of
different groups with their own working practices and expectations. It is the data manager’s
responsibility to act as a central point of contact for all external data requests except for those
involving Earth observation data sets. As a general principle, this approach is sensible.
However, at first this did lead to some confusion as the data manager took over from people
who had previously been chasing up some data sets.


                                                                                                 7
CTCD                                                                         Data Management

There have been noticeable delays in obtaining some of the data sets specified in section 2.2.
In some cases, this is due to problems with the data sets themselves (which were thought to be
ready for release, when problems were identified). Technical issues have been a problem with
some of the meteorological data sets (for example the ECMWF-40 data had to be recalculated
to fit a grid that would be useful to the centre, and during that process the BADC’s hardware
encountered problems). In another example, the BADC have been grateful to the centre for
specific requests (individual station data for 2001), as it allowed them to present evidence to
the Met Office about the needs of their clients.

There was also the issue of being listed as a partner with the NERC entry in the Civil Service
Year Book (CSYB). The CTCD’s entry in the CSYB would mean access to cheaper data from
institutions such as the Ordinance Survey. The CTCD was not entered earlier in the year.
However, after contacting the NERC, they should be in the next publication (the CSYB is
updated infrequently).

Most of the problems can be dealt with on an individual data set basis. However, there are
still several data sets that have not yet been obtained. These are listed in Table 2.6, below.
The table also lists the reasons why the data has not yet been obtained for the CTCD. Most of
these cases are ongoing.

Table 2.6 Missing data sets
Data Set                               Notes
Soil characteristics (Scotland)        Macaulay (holders of the data) requesting a
                                       large fee to hire their data.
LCM 2000                               Unclear as to how useful this would be to the
                                       CTCD as a whole except as a comparison
                                       with the MODIS land use data set. This set
                                       would also be quite an expensive data set to
                                       obtain for the centre. Awaiting conclusions
                                       from a phenology/land use meeting held
                                       recently.
Woodland inventory                     Results from England and Wales are already
                                       available on the ftp site. However, the
                                       Scottish data is not yet ready.
2001 5km gridded meteorological data   The set has not yet been generated. However,
                                       they are in the process of doing so and are
                                       deciding whether there is potentially enough
                                       demand to make it widely available.
Harwood soils map                      The soil map has been identified, appears to
                                       be promising and very useful for validation
                                       work in the future.
CEH catchment boundary                 Unclear status
Water colour change from University of Unclear status
Leeds
Geological date from BGS               Unclear status

Apart from the above data sets, the requested data has been obtained.



                                                                                             8
CTCD                                                                             Data Management


3 Data management strategy

A cohesive data strategy is vital to the centre. As the centre comprises researchers based at
different institutions working on diverse yet interconnected strands of work, the underpinning
links need to be strong. Data sharing is an important part of this.

For the centre to operate effectively, inter-partner communication and awareness is
fundamental. Part of the solution to this is a regularly updated website, and this is also
instrumental in helping the CTCD profile in the research community. Data related issues are
key to the smooth running of the centre.

Several issues need to be addressed. The data manager of the CTCD acts as a central point for
all external data requests for the centre. This allows the centre’s researchers to have a
consistent, individual contact when they need new data sets and means that external bodies
only ever have to deal with one person. This makes it much easier to deal with NERC data
centres.

As each of the research strands are connected, it is important for every member of the centre
to be aware of what data is available within the centre. This extends not only to knowing in
general terms what data exists, but also details such as the method of collection, the period
and area for which the data is relevant, as well as many other factors. The obvious place for
this is a metadata database held centrally and easily accessible to all of the CTCD’s
researchers. Given the different hardware and software, the database needs to be accessible in
some format available to all. This is discussed in more detail in the following sections, but the
most satisfactory method is to produce a web site that is able to query the database. The
querying facility will be available through a number of forms and will produce files for the
user to download.

As well as metadata, the database is the ideal place to store key datasets. Exactly which data
sets will be stored in it is still a matter for discussion, as the details of the datasets regularly
used by the centre are still being identified as part of the metadata collection exercise. The
structure of the database cannot be finalised until this exercise is complete. The Earth
Observation data (largely images) could be stored in the database, but as this would not be the
primary storage area for these images this is inefficient. Instead, links to the data sets (either
as actual links, or as contact email addresses) will be stored.

 The data sets, ideally, should be as complete as possible. In practice, this is often not
possible, given mechanical and other problems. How data gaps are approached is a matter for
discussion.

Gap filled data has intrinsic problems. One of the main issues is that most users of the data
never stop to question values presented to them. Gap filled data (usually modelled data) needs
to be flagged as such in the data storage (usually a database). Many users do not stop to query
the data presented to them in this way. If someone asks what the maximum temperature was
on a certain date and they get a value back, they won’t stop to ask if that particular vale was
actually measured. One of the simplest ways of gap filling meteorological data is to say that
the weather on the missing day was the same as that on the previous day. It is as valid as any
other system in use. If there is a month of missing data, the same method could be used. It is
still a valid method, but has various problems. This is possibly the simplest form of modelled

                                                                                                  9
CTCD                                                                                Data Management

data. More complex models can be used to gap fill, but still have the disadvantage that they
are essentially guesses. The two main solutions to this problem are that either the database
could hold modelled data which is flagged as not being original or the gaps can be left and the
user can identify that modelled data are required. In the former case, not checking whether all
the data is real (i.e. looking at the data flags) is the responsibility of the user. In the latter case,
the user needs to identify gaps in the data and consider how to fill the gap themselves. As a
compromise, the front end to the database can query the user as to whether or not they want
the data to be gap filled or not. This brings the problem to the notice of the user without
enforcing a blanket decision.

One of the main data sets in current use is the 5km gridded meteorological data sets. Thirty
year averages (1961-1990) of monthly data are fed into a weather generator model. The
gridded data set is already interpolated from available individual met station means (based on
position and availability of data). The number of stations available varies greatly over a thirty
year time step and an unreliable station might be used in one month’s calculations and not in
another’s. Whether a station is used in the calculation for a given month depends on the
number of missing days of data. One or two missing days is considered acceptable, so the
data produced is already skewed. To create a map of ForestETp outputs for the UK, the model
is run for a grid of points. The grid used for this is controlled by the grid selected by the NSRI
soil data which is offset from the met data grid. The result of this is that the daily
meteorological input to the model is modelled data based on the interpolation of an
interpolated data set which has a thirty year average value (no measure of spread is given) on
a variable number of stations of a variety of quality and with data collection which may vary
over the time period. This is a data filling exercise, although there are obviously a number of
issues that the user needs to be aware of.

Using actual daily data from nearby stations is also limiting. For example, loading, formatting
and checking the data is a long process. There may not be conveniently placed stations close
to our selected grid points. Missing variables, short runs of data and gaps in the time series are
also problematic. Gap filling is important in these cases, but the protocols used have to be
transparent and consistent.

Data ownership issues are important in the CTCD, with the mix of third party and internal
data sets. With the third party data sets, clear licensing arrangements have been drawn up and
it is important (both legally and in the interest of fostering good relations) that these
agreements are kept. To that end, the license agreements will always be presented to those
accessing the data (or metadata) through the website. In many cases, the agreement simply
states that acknowledgement of the data source be made in any publications involving the
data. There is also the question of the ownership of data sets generated within the CTCD. The
experiments have been financed by the centre, but it is still not clear whether the ownership of
the data lies with the CTCD or the researcher’s institution.

 The question of GIS is an important one within the centre and the issue has been raised
several times. There is a lack of GIS expertise within the CTCD but a significant need has
been identified. Many of the data sets and research strands are geographically broad in nature
and GIS would be a good way to tie them together. Training on GIS software is pending at the
University of Sheffield (at the time of writing), and final decisions have not yet been made.
However, one of the probable outcomes is the saving of data (on the same disk as the
database and most likely linked to it) in a GIS-compatible format. There are potential
software compatibility issues associated with this, but these may be simple to resolve.

                                                                                                     10
CTCD                                                                         Data Management

4 Database issues

The database will be stored on the large (terabyte) disk at Sheffield University. More workers
are based at the University of Sheffield than at any of the other centres. It has a very good
communications network, allowing other researchers (particularly those based at universities)
easy and fast access.


4.1   Database system selection

Given the cost of Oracle, there was reluctance to use it as the database system. The purchase
and licensing costs for Oracle are significant despite the advantages (rollback, security, well
supported, widely used and simple to use under many different applications). The alternative
solutions include DB2, Microsoft SQL server, Microsoft Access and MySQL. Each of these
has potential problems. Microsoft SQL is not renowned for its reliability – particularly for
multiple users. Microsoft Access is likely to already be on most people’s machines. However,
from the server’s point of view, it requires ½ Mb of RAM per person on the website.

MySQL is an open source relational database system. This system is free, with an optional
payment (which is required before some organisations will rely on a system) for a limited
support system. As the system is open source, it is easily available and is being constantly
updated. It is a very fast system and for simple queries it is one of the fastest relational
databases around.


4.2   MySQL specific issues

As MySQL is easily available and cheap, with many of the benefits of the other systems on
the market, it was the obvious choice. However, there are still drawbacks in using MySQL as
the database.

Currently, it does not have all the features within the SQL that would be useful – there is no
capability to run nested queries or create views. Both of these features are quite significant,
but they are scheduled for inclusion at version 4.1 (the current version of MySQL is 4.0.16,
although 4.1 is available as an alpha version). Other missing features include triggers and
clusters. Unfortunately, the development of MySQL relies on volunteers and a proper release
schedule is impossible to identify.

MySQL forms part of the LAMP (Linux, Apache, MySQL, and PHP) software model. This is
a popular set of tools for software development, particularly for the development of database-
enabled web sites. Following the LAMP model means low costs, reasonably high
performance and quite easy to configure. As such they are quite widely used. There are few
problems with licensing. The software can be used without restrictions. It can be sold to
others as part of a developed product, but the source code has to be made available to others.
There are some problems with open source software – not all companies will use free
products for fear of lack of support; the documentation is not always comprehensive; less
popular projects die off due to lack of developer support and developers can be arrogant when
responding to queries from those they consider inexperienced. However, none of these are a
problem with MySQL.

                                                                                            11
CTCD                                                                                 Data Management

 PHP (the PHP Hypertext Processor) will be used on the website to query the database
(although this will be covered in section 5). One problem with using this system (on a unix
server) is that the software for Apache, PHP and MySQL will all have to be reloaded if the
version of MySQL is ever updated. In this case, there is a case for waiting until version 4.1 of
MySQL is properly available. In the case of reloading the software there are no problems with
the database, as taking the database temporarily offline to complete the update will not have
significant repercussions.

Table 4.1 Features of MyISAM and InnoDB tables
MyISAM                                               InnoDB
Do not support transactions                          Transactions supported
Each table stored as a separate file                 All tables stored in one file
Very fast query speed                                Slower query speed
Tables are independent                               Foreign keys supported, allowing greater
                                                     internal consistency
FULLTEXT supported                                   FULLTEXT not supported

MySQL uses two main sorts of data table structures – MyISAM and InnoDB. The differences
between the table types are summarised in Table 4.1. In general databases constructed from
InnoDB tables are more structurally robust. However, transactions are not going to be an
important part of the database management in this case. Adding FULLTEXT to a table allows
a searching capability similar to that used by internet search engines. Essentially it creates
another table full of statistical information about the words contained in the relevant fields, so
that results can be returned efficiently and in the order of significance. Since one of the
primary uses of the metadata database is to search for specific data sets, the tables in this
database will be largely MyISAM tables.

4.3    Data backups

Details of the backups to the database have yet to be finalised. MySQL does not have the
advanced rollback facilities of Oracle. This means that between forced backups, any data
added to the database could be lost. While important in most databases, there are some
specific issues in this case. The database is not one that will be constantly updated during
working hours and a dump1 of the database could either be generated automatically after
tables have been appended or run manually at regular intervals.

The mysqldump command creates a file which consists of a series of SQL statements that
could be used to recreate the entire database. It is possible that another user could make
changes to the table as the backup was running, which would result in a backup that was not
absolutely up to date. Given the likely frequency of updating of the tables in this database,
this is not likely if the dumps are properly scheduled. It is possible to stop the database server
and copy the entire directory, guaranteeing an identical copy of the database. This would be
viable as long as specified down-times were not a problem.

In most cases, the data on the database will not be the only copy. The individuals who supply
the data will each have their own copy. Even in the event that all backups of the database are

1
 In this document the word ‘dump’ refers to the process of generating a file that can be used to set up a
copy of the database. It is essentially a backup mechanism, but that term was avoided to prevent
confusion with a backup of the disk.
                                                                                                      12
CTCD                                                                       Data Management

lost, it should not be a catastrophic event. Given a timetable of regular dumps and the disk
itself having a back up system in place, these precautions should be sufficient. Running back
ups of the data itself will be the responsibility of someone on site.




                                                                                          13
CTCD                                                                           Data Management


5 Database structure

The design of the database is covered in this section. This covers the metadata database and a
brief overview of the structure of any of the tables storing the actual data. The structural
description includes the table descriptions and a relationships diagram. Any additional tables
added to the structure will be linked to the metadata tables. Hence, the ability to search the
data will be the basis of the entire design.

The design of the tables is very important. Re-designing the tables (such as adding or
removing columns) once the database is in use means severe reductions in efficiency. It also
means that the underlying data files all have to be changed. The structure of the database as a
whole is also efficiency related. There is a compromise to be made between the repetition of
data within tables and the proliferation of tables which need to be linked to extract meaningful
data. The more links required in queries, the slower it takes to generate results. The fewer the
number of tables, the more repetition of data and the less efficiently it is stored. This database
has been designed to optimise query times and minimise the amount of duplication.


5.1   CTCD Users

The first set of tables will not be reproduced in any detail as they are automatically generated
by most relational database systems. These are simple data management tables that contain
details of the system users (user names, passwords, etc) and what rights they have to query
which tables. There will be a default username for non-CTCD visitors to the web site, which
will allow the visitor to the site to run select queries of the metadata database. CTCD users
will have more rights – so they will be able to run select queries on the actual data tables
(depending on the specific licensing issues).

There is an unresolved question as to whether the data manager or the actual CTCD
researchers will be responsible for updating tables. For example, a form could be supplied on
the web site that would allow individuals to add metadata to the tables. It is technically
possible for people to update their actual data to the database remotely via the web site (e.g.
this would allow Dr Nichol to add her flux data at the click of a button). However, there are
potential problems with this process, and it may be simpler to add the data via the data
manager so that there is a degree of control over the process and somebody on hand to deal
with any hiccups.

It should be noted that whoever adds the data is responsible for the data quality. Before being
entered on the database, the researcher should be satisfied that the quality of the data meets
their standards. As previously mentioned, there are different methods for dealing with
dubious data points and a central policy needs to be agreed in the near future.


5.2   Metadata structure

The following table descriptions show the structures in which the data will be stored. Most of
the terms used in this are self explanatory, but some require greater elaboration. The data
types listed (mainly text based) are dependant on the type of data held and the size of the data.
These tend to be fixed. In the misc. column, PK refers to a primary key and FT refers to a
                                                                                               14
CTCD                                                                          Data Management

column defined as FULLTEXT (see section 4.1). Note that where more than one column in a
table is labelled as being a primary key, they form a joint primary key.

The DATA_SET table is the key part of the structure, and all the other tables link to it directly
or indirectly. Note that there is a need for a number of linking tables – tables that do not
contain information that is going to be extracted at any time. In each case they are required
for structural reasons. The metadata structure is represented in figure 5.1.




Figure 5.1 Structure of the metadata part of database


Table 5.1 DATA_SET table description
Column Name                Description                                      Data         Misc
                                                                            Type
SET_ID          A unique identifier for each main data set                  TINYINT      PK
TITLE           The full title of the data set                              TEXT         FT
LAST_DATE       The last revised date                                       DATE
PRIN_INVEST     ID relating to an entry in the contacts table               TINYINT
OWNER_CONTACT   ID relating to an entry in the contacts table.              TINYINT
                In this case, the entry will usually be part of
                an organisation
LICENCE         Details of any restrictions on the use of the               TEXT
                data. If acknowledgements are required, they
                are noted in the references table.
DATA_DESC       An abstract giving a brief description of the               TEXT         FT
                data set
MEAS_METHOD     Description of how the data was originally                  TEXT
                collected
ANALYSIS_METHOD Brief description of the analysis used to                   TEXT
                reach the final products
ANALYSIS_SEQ    Clear, ordered list of the sequence of steps in             TEXT
                the analysis
ANALYSIS_TECH   Any         special       software,      simple             TEXT
                algorithms/formulae used
                                                                                              15
CTCD                                                                      Data Management

QUAL_CONTROL           Details of quality control applied to the data    TEXT
                       set
ACC_JUDGEMENT          Subjective estimate of the level of data          TEXT
                       quality
QUANT_ERROR            Any quantitative measures of error in the         TEXT
                       data set
DATA_ISSUES            Any issues that might arise from the data set     TEXT
                       – particularly from those not used to this type
                       of data
FILE_FORMAT            The way that the data is stored (e.g. on the      TEXT
                       database, ASCII files, jpegs, etc)
BYTE_ARCHITECT         The platform used to produce the files          TINY
                                                                       TEXT
DATA_FORMAT            The layout of the files – type of delimitation, TEXT
                       column orders, header information
DATA_SET_SIZE          Estimated total size of data set                TEXT


Table 5.2 DR_QUI table description
Column Name            Description                                       Data      Misc
                                                                         Type
LOCALITY_ID            A unique identifier for the space and time        TINYINT   PK
                       coordinates of the data set
SET_ID                 The ID from the DATA_SET table                    TINYINT
PROJECTION             For any EO data, the full details of the          TEXT
                       projection used
SPATIAL_TYPE           Whether the data are point, gridded, raster or    ENUM
                       vector data.
AREA_NAME              Name of the area being studied (e.g.              TINY
                       Harwood, UK, etc.)                                TEXT
SW_EAST                The south west limit of the data coverage, in     SMALL
                       GB Ordinance Survey Eastings (in m)               INT
SW_NORTH               The south west limit of the data coverage, in     SMALL
                       GB Ordinance Survey Northings (in m)              INT
NE_EAST                The north east limit of the data coverage, in     SMALL
                       GB Ordinance Survey Eastings (in m)               INT
NE_NORTH               The north east limit of the data coverage, in     SMALL
                       GB Ordinance Survey Northings (in m)              INT
SW_LAT                 The latitude of the south west corner of the      DOUBLE
                       data coverage.
SW_LONG                The longitude of the south west corner of the     DOUBLE
                       data coverage.
NE_LAT                 The latitude of the north east corner of the      DOUBLE
                       data coverage.
NE_LONG                The longitude of the north east corner of the     DOUBLE
                       data coverage.
SPATIAL_RES            The spatial resolution of the data set (if        TINY
                       appropriate) – the size of the smallest cell,     TEXT
                       including units
                                                                                        16
CTCD                                                                    Data Management

TEMP_TYPE            Whether the data was a one-off measurement        ENUM
                     or a series of ongoing measurements
START_DATE           Start of the data coverage                        DATE
END_DATE             End of the data coverage                          DATE
TEMP_RES             The temporal resolution (or the frequency of      TINY
                     measurements)                                     TEXT
DB_HELD              Is the data held in the database? (Y/N)           ENUM
DATA_LOCATION        The location of the file holding the data. This   TINY
                     will ideally be the name of the database table    TEXT
                     or an URL.

Table 5.3 DATA_CONTACTS table description
Column Name          Description                                       Data      Misc
                                                                       Type
SET_ID               ID from the DATA_SET table                        TINYINT   PK
CONTACT_ID           ID from the CONTACT table                         TINYINT   PK

Table 5.4 CONTACTS table description
Column Name          Description                                       Data      Misc
                                                                       Type
CONTACT_ID           Unique identifier for the contact                 TINYINT   PK
CONTACT_NAME         The name of the contact as supplied      TINY
                                                              TEXT
ORGANISATION         The name of the organisation which the TINY
                     contact works for                        TEXT
ADDRESS_POSTAL       The postal address of the contact        TEXT
ADDRESS_EMAIL        The email address of the contact         TINY
                                                              TEXT
WEB_PAGE             Web page of the contact                  TINY
                                                              TEXT
FAX_NUM              The fax number of the contact            TINY
                                                              TEXT
TEL_NUM              The main telephone number of the contact TINY
                                                              TEXT
CTCD                 Is the contact a CTCD partner? (Y/N)     ENUM

Table 5.5 DATA_EQUIPMENT table description
Column Name          Description                                       Data      Misc
                                                                       Type
SET_ID               ID from the DATA_SET table                        TINYINT   PK
EQUIPMENT_ID         ID from the EQUIPMENT table                       TINYINT   PK

Table 5.6 EQUIPMENT table description
Column Name          Description                                       Data      Misc
                                                                       Type
EQUIPMENT_ID         Unique identifier for the piece of equipment      TINYINT   PK
                                                                                      17
CTCD                                                              Data Management

EQUIP_NAME            Name of the piece of equipment used    TINY
                                                             TEXT
EQUIP_OPS      Brief overview of the scientific basis of the TEXT
               equipment
MANUFACT_NAME Name of the company that manufactured the TINY
               piece of equipment                            TEXT
MANUFACT_POST  Postal address of the manufacturer            TEXT
MANUFACT_PHONE Phone number of the manufacturer              TINY
                                                             TEXT
MANUFACT_EMAIL Email address of the manufacturer             TINY
                                                             TEXT
MANUFACT_WEB   Web page of the manufacturer                  TINY
                                                             TEXT
CALIBRATION    Calibrations required on the equipment/ TEXT
               adjustments required on the raw data before
               they are meaningful.

Table 5.7 DATA_VARIABLE table description
Column Name           Description                               Data        Misc
                                                                Type
SET_ID                ID from the DATA_SET table                  TINYINT   PK
VARIABLE_ID           ID from the VARIABLE table                  TINYINT   PK
VARIABLE_TYPE         Whether the data is measured, calculated or ENUM
                      modelled

Table 5.8 VARIABLE table description
Column Name           Description                               Data        Misc
                                                                Type
VARIABLE_ID           Unique identifier for the variable used   TINYINT     PK
VARIABLE_NAME         The name of the variable used            TINY
                                                               TEXT
UNITS                 The units of the variable                TINY
                                                               TEXT
DEFINITION            A definition of the variable             TEXT
MAX_VALUE             Maximum reasonable value                 DOUBLE
MIN_VALUE             Minimum reasonable value                 DOUBLE
MISSING_VALUE         What has been used to indicate a missing VAR
                      value in any data files submitted        CHAR(5)

Table 5.9 DATA_REFERENCE table description
Column Name           Description                               Data        Misc
                                                                Type
SET_ID                ID from the DATA_SET table                TINYINT     PK
REFERENCE_ID          ID from the REFERENCE table               TINYINT     PK




                                                                                 18
CTCD                                                                           Data Management

Table 5.10 REFERENCE table description
Column Name                Description                                       Data         Misc
                                                                             Type
REFERENCE_ID               ID from the REFERENCE table                  TINYINT           PK
REF_AUTHORS                List of comma separated names in the form: TEXT
                           Surname Initials.
REF_YEAR                   The year of publication                      SMALL
                                                                        INT
REF_TITLE                  The title of the paper                       TEXT
REF_JOURNAL                The name of the journal containing the paper TINY
                                                                        TEXT
JOURNAL_NUM                Details of which volume of the journal TINY
                           contains the paper                           TEXT
JOURNAL_PAGES              The page numbers of the paper                TINY
                                                                        TEXT


5.3   Other data tables
The structure of the rest of the database has not yet been finalised. Large images should not
be stored in a database format if it can be avoided. It is not an efficient use of space –
particularly as much of the data will already be held on a different site. Instead, a system of
links to the well organised data structure at UCL is both efficient and a more reasonable
solution. The EO data held at Sheffield will probably be stored on the same disk as the
database. A logical filing system similar to that used at UCL would be the correct way to
proceed.

Other data types held by the CTCD include GIS vector coverages. While it is possible to hold
these data sets in the database itself, there is little point – for the same reasons as for the EO
images. The use of GIS to extend the capabilities of the centre is an important issue.
Currently, there is little GIS expertise within the centre which makes it difficult to come to a
decision on these matters at the moment. At present, the most practical set up is for the data to
be stored in a central area allowing researchers to access it and use GIS software on their own
machines.

Each of the other data sets held by the CTCD needs to be considered for inclusion into the
database. Adding tables to the database at a later date is not much of a problem as long as the
structure is well thought through. Each set of data will require a substructure within the
database, which can be linked to the metadata part of the database via the DATA_SET and
DR_QUI tables. An additional table is required which lists all the data tables and identifies
data set(s) they apply to. Most GIS packages allow access to databases. Hence GIS software
can be used to query the data held within the database if the user feels it is appropriate.

In the cases where the data is not stored on the database and direct links to the data are not
possible (e.g. large data sets stored on CDs), users who need the data (both internal and
external to the CTCD) can request it using the contact details.




                                                                                               19
CTCD                                                                             Data Management


6 Database user interface

There are several options available for delivering data to the user. It would be possible to
design a new piece of software for CTCD partners to install on their machines. At the other
end of the scale, regular dumps from the database could be left on the ftp site. In this case, the
responsibility for checking they have the most recent version and the capability for any
querying of the data is left with the user. Neither of these solutions is viable given restrictions
of time and expertise (either with the data manager or the database users).


6.1   Options

One potential solution is to set up CTCD machines as slaves of the master MySQL area on
the large disk at Sheffield. Given the complications of the various networks, this is only really
viable for machines at Sheffield. This system copies the database onto the slave machine.
When people use the slave machine to append to tables, the master machine (which checks
regularly) notices the updates on the slave machines and makes the same updates on its
version. This can lead to problems if two slaves make simultaneous updates to the same table,
but this is rare.

The master-slave relationship essentially creates further backups of the database and increases
the speed of queries. It has the advantage that one machine failing will not wipe out the
database (if the master fails it only prevents writing to the main database). However, this
system may be seen as overkill for the CTCD requirements. Simply logging on to the
machine across the network should be sufficient.

However, adopting this as a method of querying the database presupposes that the users will
want to learn how to use the MySQL version of SQL, or that the data manager will supply
scripts for standard queries. While both of these are plausible, the former can be time-
intensive and the latter is too restrictive.

The best way of reaching everyone in the centre is to allow access to the data through a
dynamic web site. As every worker in the CTCD has access to an internet connection, this is a
simple solution to the problem of access to the data. PHP and MySQL work together very
well. Using a server based query system means that no additional software is required on the
browser and no specialist database knowledge is required by the user. The main problem with
this system is that it is potentially restrictive with regard to the ability to query the data. Each
query has to be explicitly designed. However, it would be possible to allow the users the
option to write their own SQL commands.

Of the methods discussed above, the dynamic web site is the best and most robust way of
storing and disseminating data. As such, this is the method that will be used.


6.2   Metadata queries

The ability to examine the metadata database to obtain key data sets is one of its important
features. This section of the dynamic web site is therefore the key part.

                                                                                                 20
CTCD                                                                          Data Management

Initially, the user will be asked to log in and the details sent to check against the database
information. If the fields match, the user will be allowed to query the metadata or (if they
have the clearance – as all CTCD workers will) add new metadata via a form on the site.
The user will be allowed to query the database based on location (name of site, within certain
parameters of lat/long or Eastings/northing, within a certain distance of a specified point or on
a clickable map of the UK), time of measurement, key variables and/or key words in the title
and description. It will also allow the user to specify which pieces of information they would
like returned from the search (there will of course, be preset defaults). The returns will be
listed either on the website or in a series of text files (again, users’ option). The licensing
issues relevant to the data set will always be returned (and will therefore not be listed as an
option).

The option to add metadata will be available to CTCD members only. The form used to fill in
information will include a number of drop-down boxes based on entries already in the
database (so that the database doesn’t have countless copies of contact details for example).
This will also help for those datasets (such as the EO data) which all have the same broad
description but have distinct location issues.


6.3   Data Queries

Queries of the actual data will be quite restricted at first, until clearer user-requirements are
identified. In many cases, the user will simply want to know where the data is. Those that
want to select information from a specific data set will be allowed to query using a simple
interface (restricting the data set for example and simple grouped characteristics will be
available). A SQL interface will be supplied for those with a more intimate knowledge of the
database, allowing them to define their own links. Query results will be supplied as a text file
for the user to download. Given the high level of IT competence within the CTCD, this
should be sufficient for most users. Every query result of the actual data will also include
header information including details of the licence agreements which the user needs to be
aware of.

Data requests that are likely to be made regularly, but are not catered for by the above
arrangement, can be added to the site at a later date. In the event that the user needs
something more specific (as a one-off query), contact details for the data manager will be
provided.

Most of the select queries of the datasets will be restricted to CTCD members. This applies
particularly to third party data sets as we can not pass these on to outside bodies. With data
sets owned by the CTCD, it is possible that the release policy may be more flexible.

Certain members of the CTCD will want to add their data to the database in a simple way. It
is possible to do this via the web site. This needs careful initial conditions to be agreed on by
the user and the data manager before the facility is arranged. For certain types of data
however – e.g. the flux data, this would be a simple way of updating information with a
minimum of fuss.




                                                                                              21
CTCD                                                                            Data Management



7 Summary

The key tasks of the data manager are to assess data needs, act as a central contact for data
requests and supply the data requested; ensure that the licensing requirements of the third
party data sets are known to those who use the data; design a robust and accessible system for
storing and disseminating the data; create the data management system.

Obtaining data sets for the centre is an ongoing process, which will continue during the life
time of the contract. However, the concept of a central contact is now well established within
the centre.

The best way of allowing everyone within the centre access to the data is to have some form
of server based system. The simplest method is to create a web site with the ability to
dynamically access the data. A logical data structure is a prime requirement of such a system.
This should consist of a combination of a database and a competent filing structure. Using
metadata to list all data sets relevant to the centre and keep track of their locations is the most
important part of the organisational structure. The web site can be used to query the database
simply and requires no specialist skills or knowledge on behalf of the user. Querying the
database through a web site means that the return of licensing information to the user can be
enforced.

MySQL was selected as the database system for reasons of cost and convenience. The system
has some drawbacks (the most serious of which are due to be resolved in the next version),
but none are insurmountable. The system also works well with PHP, the processing system
used by the dynamic web site.

The CTCD requires many different forms of data to meet their objectives. These data sets are
held in different forms and are spread across many different institutions – each with their own
data release policy. The acquisition, storage and transfer of data were identified as key issues
to be addressed by the data manager.

The database and web server are due to be set up on the large disk, newly purchased by
Sheffield University. Regular back ups of the disk will be a matter for someone on site to
oversee.

When the system is up and running the CTCD users will be able to add data and metadata to
the database themselves. Which data sets will be included on the database has not yet been
decided. Initially, the protocols (mainly related to formatting and quality control) for adding
data to the database will be made clear to the data providers. Following on from that, the
responsibility for the data will be with the supplier.




                                                                                                22

								
To top