Optimization of work flow execution in ETL using Secure Genetic Algorithm by ijcsis

VIEWS: 147 PAGES: 10

									                                                                     (IJCSIS) International Journal of Computer Science and Information Security,
                                                                                                                        Vol. 8, No. 8, 2010

    Optimization of work flow execution in ETL using
               Secure Genetic Algorithm
                             Raman Kumar1, Saumya Singla2, Sagar Bhalla3 and Harshit Arora 4
                                           1,2,3,4
                                                     Department of Computer Science and Engineering,
                               1,2,3,4
                                         D A V Institute of Engineering and Technology, Jalandhar, Punjab, India.
                                                            1,2,3,4
                                                                    er.ramankumar@aol.in

   Abstract— Data Warehouses (DW) typically grows                             activity must be completed in certain time frame. So there is a
asynchronously, fed by a variety of sources which all serve a                 need to optimize the ETL process. Typical ETL activity
different purpose resulting in, for example, different reference              consists of three major tasks: extraction, transformation and
data. ETL is a key process to bring heterogeneous and                         loading.
asynchronous source extracts to a homogeneous environment.
The range of data values or data quality in an operational system                This research paper is the study of extraction and
may exceed the expectations of designers at the time validation               transformation stages. Data extraction can be seen as reader
and transformation rules are specified. Data profiling of a source            writer problem, which has been reformulated using multiple
during data analysis is recommended to identify the data
                                                                              buffers instead of using single finite buffer. Transformation is
conditions that will need to be managed by transformation rules
and its specifications. This will lead to implementation of the ETL
                                                                              set of activities which convert the data from one form to other.
process. Extraction-Transformation-Loading (ETL) tools are set                This thesis studies the use of Genetic Algorithm to optimize
of processes by which data is extracted from numerous                         the ETL workflow.
databases, applications and systems transformed as appropriate
                                                                                      A. Basic concepts of ETL
and loaded into target systems - including, but not limited to,
data warehouses, data marts, analytical applications, etc. Usually              a) Extract
ETL activity must be completed in certain time frame. So there is
a need to optimize the ETL process. A data warehouse (DW)                        The first part of an ETL process involves extracting the data
contains multiple views accessed by queries. One of the most                  from the source systems. Most data warehousing projects
important decisions in designing a data warehouse is selecting                consolidate data from different source systems. Each separate
views to materialize for the purpose of efficiently supporting                system may also use a different data organization/format.
decision making. Therefore heuristics have been used to search                Common data source formats are relational databases and flat
for an optimal solution. Evolutionary algorithms for materialized             files, but may include non-relational database structures such
view selection based on multiple global processing plans for
                                                                              as Information Management System (IMS) or other data
queries are also implemented. The ETL systems work on the
theory of random numbers, this research paper relates that the                structures such as Virtual Storage Access Method (VSAM) or
optimal solution for ETL systems can be reached in fewer stages               Indexed Sequential Access Method (ISAM), or even fetching
using genetic algorithm. This early reaching of the optimal                   from outside sources such as web speeding or screen-scraping.
solution results in saving of the bandwidth and CPU time which it             Extraction converts the data into a format for transformation
can efficiently use to do some other task. Therefore, the proposed            processing. An intrinsic part of the extraction involves the
scheme is secure and efficient against notorious conspiracy goals,            parsing of extracted data, resulting in a check if the data meets
information processing.                                                       an expected pattern or structure. If not, the data may be
                                                                              rejected entirely.
    Keywords- Extract, Transform, Load, Data Warehouse(DW),
Genetic Algorithm (GA), Architecture, Information Management                    b) Transform
System, Virtual Storage Acess Method and Indexed Sequential
Access Method                                                                    The transform stage applies to a series of rules or functions
                                                                              to the extracted data from the source to derive the data for
                       I.    INTRODUCTION                                     loading into the end target. Some data sources will require
                                                                              very little or even no manipulation of data. In other cases, one
   Companies know they have valuable data lying around                        or more of the following transformations types to meet the
throughout their networks that needs to be moved from one                     business and technical needs of the end target may be
place to another—such as from one business application to                     required:
another or to a data warehouse for analysis. The only problem
is that the data lies in all sorts of heterogeneous systems and                   •     Selecting only certain columns to load (or selecting
therefore in all sorts of formats. To integrate data to one                             null columns not to load).
warehouse for analysis a tool is required which can integrate
                                                                                  •     Translating coded values (For example, if the source
data from various systems. To solve the problem, companies
                                                                                        system stores 1 for male and 2 for female, but the
use extract, transform and load (ETL) software. Usually ETL
                                                                                        warehouse stores M for male and F for female), this




                                                                        213                             http://sites.google.com/site/ijcsis/
                                                                                                        ISSN 1947-5500
                                                               (IJCSIS) International Journal of Computer Science and Information Security,
                                                                                                                  Vol. 8, No. 8, 2010
          calls for automated data cleansing; no manual                        ETL stands for extract, transform and load, the processes
          cleansing occurs during ETL.                                      that enable companies to move data from multiple sources
                                                                            reformat and cleanse it, and load it into another database, a
    •     Encoding free-form values (For example, mapping                   data mart or a data warehouse for analysis, or on another
          "Male" to "1" and "Mr" to M).                                     operational system to support a business process. Companies
    •     Deriving a new calculated value (For example,                     know they have valuable data lying around throughout their
          sale_amount = qty * unit_price).                                  networks that needs to be moved from one place to another—
                                                                            such as from one business application to another or to a data
    •     Filtering.                                                        warehouse for analysis. The only problem is that the data lies
    •     Sorting.                                                          in all sorts of heterogeneous systems, and therefore in all sorts
                                                                            of formats. For instance, a CRM (Customer Relationship
    •     Joining data from multiple sources (For example,                  Management) system may define a customer in one way,
          lookup, merge).                                                   while a back-end accounting system may define the same
                                                                            customer differently. To solve the problem, companies use
    •     Aggregation (for example, rollup - summarizing
                                                                            extract, transform and load (ETL) software, which includes
          multiple rows of data - total sales for each store, and
                                                                            reading data from its source, cleaning it up and formatting it
          for each region, etc).
                                                                            uniformly, and then writing it to the target repository to be
    •     Generating surrogate-key values.                                  exploited. The data used in ETL processes can come from any
                                                                            source: a mainframe application, an ERP application, a CRM
    •     Transposing or pivoting (turning multiple columns                 tool, a flat file, an Excel spreadsheet—even a message queue.
          into multiple rows or vice versa).                                Extraction can be done via Java Database Connectivity,
    •     Splitting a column into multiple columns (For                     Microsoft Corporation’s Open Database Connectivity
          example, putting a comma-separated list specified as              technology, proprietary code or by creating flat files. After
          a string in one column as individual values in                    extraction, the data is transformed, or modified, depending on
          different columns).                                               the specific business logic involved so that it can be sent to the
                                                                            target repository. There are a variety of ways to perform the
    •     Applying any form of simple or complex data                       transformation, and the work involved varies. The data may
          validation. If validation fails, it may result in a full,         require reformatting only, but most ETL operations also
          partial or no rejection of the data, and thus none,               involve cleansing the data to remove duplicates and enforce
          some or all the data is handed over to the next step,             consistency. Part of what the software does is, examines
          depending on the rule design and exception handling.              individual data fields and applies rules to consistently convert
          Many of the above transformations may result in                   the contents to the form required by the target repository or
          exceptions, for example, when a code translation                  application. In addition, the ETL process could involve
          parses an unknown code in the extracted data.                     standardizing name and address fields, verifying telephone
c) Load                                                                     numbers or expanding records with additional fields
                                                                            containing demographic information or data from other
   The load phase loads the data into the end target, usually               systems. The transformation occurs when the data from each
the Data Warehouse (DW). Depending on the requirements of                   source is mapped, cleansed and reconciled so it all can be tied
the organization, this process varies widely. Some data                     together, with receivables tied to invoices and so on. After
warehouses may overwrite existing information with                          reconciliation, the data is transported and loaded into the data
cumulative, updated data every week, while other DW (or                     warehouse for analysis of things such as cycle times and total
even other parts of the same DW) may add new data in a                      outstanding receivables. In the past, companies that were
histories form, for example, hourly. The timing and scope to                doing data warehousing projects often used homegrown code
replace or append are strategic design choices dependent on                 to support ETL processes. However, even those that had done
the time available and the business needs. More complex                     successful implementations found that the source data file
systems can maintain a history and audit trail of all changes to            formats and the validation rules applying to the data evolved,
the data loaded in the DW. As the load phase interacts with a               requiring the ETL code to be modified and maintained. And
database, the constraints defined in the database schema — as               companies encountered problems as they added systems and
well as in triggers activated upon data load — apply (for                   the amount of data in them grew. Lack of scalability has been
example, uniqueness, referential integrity, mandatory fields),              a serious issue with homegrown ETL software. Providers of
which also contribute to the overall data quality performance               packaged ETL systems include Microsoft, which offers data
of the ETL process.                                                         transformation services bundled with its SQL Server database.
                                                                            Oracle has embedded some ETL capabilities in its database,
                       II.   ETL REQUIREMENTS                               and IBM offers a DB2 Information Integrator component for
                                                                            its warehouse offerings. More than half of all development
   A practical and secure optimization of workflow in ETL
                                                                            work for data warehousing projects is typically dedicated to
which must satisfy the following basic requirements which can
                                                                            the design and implementation of ETL processes. Poorly
be explored as follows [1], [2], [3]:




                                                                      214                             http://sites.google.com/site/ijcsis/
                                                                                                      ISSN 1947-5500
                                                               (IJCSIS) International Journal of Computer Science and Information Security,
                                                                                                                  Vol. 8, No. 8, 2010
designed ETL processes are costly to maintain, change and                                     III.   GENETIC ALGORITHMS
update, so it is critical to make the right choices in terms of the            Melanie, M., in his book “An Introduction to Genetic
right technology and tools that will be used for developing                 Algorithms” stated that generally speaking, genetic algorithms
logic involved so that it can be sent to the target repository.             are simulations of evolution, of what kind ever. In most cases,
The basic steps used for the development of the ETL Life                    however, genetic algorithms are nothing else than probabilistic
cycle are as follows:                                                       optimization methods which are based on the principles of
     1. Cycle initiation.                                                   evolution. He further suggested that if there is a solvable
                                                                            problem, a definition of an appropriate programming
     2. Build reference data.                                               language, and a sufficiently large set of representative test
     3. Extract (from sources).                                             examples (correct input-output pairs), a genetic algorithm is
                                                                            able to find a program which (approximately) solves the
     4. Validate.                                                           problem [12].
     5. Transform (clean, apply business rules, check for data                  Goldberg, D in his book “Genetic Algorithms in Search,
  integrity, create aggregates).                                            Optimization and Machine Learning” stated that crossover
     6. Stage (load into staging tables, if used).                          encourages information exchange among different individuals.
                                                                            It helps the propagation of useful genes in the population and
     7. Audit reports (for example, on compliance with                      assembling better individuals. In a lower level evolutionary
  business rules. Also, in case of failure, helps to                        algorithm, the crossover is implemented as a kind of cut-and-
  diagnose/repair).                                                         swap operator[6].
     8. Publish (to target tables).                                             In 2003 Ulrich Bodenhofer suggested that in solving the
     9. Archive.                                                            problems related to genetic algorithms the following steps can
                                                                            be taken [4]:
     10. Clean up.
                                                                                    Algorithm
  1) Architecture of Connector
                                                                                    t := 0;
  Basic architecture of this connector will be:
                                                                                    Compute initial population B0;
  The following operations will be performed by the
connector;                                                                          WHILE stopping condition not fulfilled DO

     • Reading data from source to the Quick ETL buffer.                            BEGIN

     • Writing data to target from the Quick ETL buffer.                            select individuals for reproduction;
                                                                                    create off springs by crossing individuals;
     • Managing the Meta data.
                                                                                    eventually mutate some individuals;
                                                                                    compute new generation
                                                                                    END
                                                                              As obvious from the above algorithm, the transition from
                                                                            one generation to the next consists of following basic
                                                                            components:
                                                                               Selection: Mechanism for selecting individuals (strings) for
                                                                            reproduction according to their fitness (objective function
                                                                            value).
                                                                              Crossover: Method of merging the genetic information of
                                                                            two individuals; if the coding is chosen properly, two good
               Figure 1 - Architecture of Connector                         parents produces good children.
  Various components of the connector are:                                     Mutation: In real evolution, the genetic material can by
     • Reader.                                                              changed randomly by erroneous reproduction or other
                                                                            deformations of genes, for example, by gamma radiation. In
     • Writer.                                                              genetic algorithms, mutation can be realized as a random
                                                                            deformation of the strings with a certain probability. The
     • Client GUI.                                                          positive effect is preservation of genetic diversity and, as an
                                                                            effect, that local maxima can be avoided.




                                                                      215                              http://sites.google.com/site/ijcsis/
                                                                                                       ISSN 1947-5500
                                                               (IJCSIS) International Journal of Computer Science and Information Security,
                                                                                                                  Vol. 8, No. 8, 2010
     Following inferences were drawn about the Genetic                      plans will not produce an optimal global processing plan in
Algorithms from the study and research carried by                           most cases. Finding an optimal global processing plan with
Bodenhofer:                                                                 optimal materialized views requires a two level hierarchy is
                                                                            needed. While the hybrid algorithms perform better than the
    1. Genetic Algorithms manipulate coded versions of the
                                                                            heuristic algorithm in terms of cost savings, they often require
    problem parameters instead of the parameters themselves,
                                                                            longer computation time. While the heuristic algorithm
    i.e. the search space is S instead of X itself.
                                                                            typically took seconds to run, a hybrid algorithm typically
    2. Genetic Algorithms use probabilistic transition                      took minutes, or even hours to run. Finding the suitable trade-
    operators while conventional methods for continuous                     off between the computation time and the cost saving is the
    optimization apply deterministic transition operators.                  basic concept which we have to maximize.
    More specifically, the way a new generation is computed
                                                                               Once a data warehousing design is completed and
    from the actual one has some random components
                                                                            implemented, it will be used frequently and may last for a long
    3. Normal genetic algorithms do not use any auxiliary                   time. Hence it is very important to optimize the design as
    information about the objective function value such as                  much as possible, even if this means a relatively long design
    derivatives. Therefore, they can be applied to any kind of              time. For optimizing the ETL process in the data ware houses
    continuous or discrete optimization problem. The only                   we have to set up the theoretical framework for the problem,
    thing to be done is to specify a meaningful decoding                    by modeling the problem as a state space search problem, with
    function.                                                               each state representing a particular design of the work flow as
                                                                            a graph. Since the problem is modeled as a state space search
    4. While almost all conventional methods search from a                  problem, it is the mandatory requirement to define transitions
    single point, Genetic Algorithms always operate on a                    from one state to another.
    whole population of points (strings). This contributes
    much to the robustness of genetic algorithms. It improves
                                                                                  IV. EFFICIENCY IMPROVEMENT IN READER
    the chance of reaching the global optimum and vice versa,
                                                                                                     WRITER PROBLEM
    reduces the risk of becoming trapped in a local stationary
    point.                                                                     Job of the reader is to get data from the source system and
                                                                            give it to transformation unit for processing. In the
   Banzhaf [1999] had drawn the inference that Evolutionary                 transformation stage this data is processed and finally written
algorithms have been shown to solve many real world                         to the target system.
problems. They use population based stochastic search
strategies and are unlikely to be trapped in a poor local
optimum. They make few assumptions about a problem
domain yet are capable of incorporating domain knowledge in
the design of chromosome representation and variation
operators. They are particularly suited for large and complex
problems where little prior knowledge is available [2].
   The materialized view selection based on multiple query
processing plans is a hard combinatorial optimization problem.
Good selection of materialized views can only be found by
taking a holistic approach and considering the optimization of
both global processing plans and materialized view selection.
A two-level structure for materialized view selection should be
followed so as to get the proper result. It has facilitated greatly
the development of several hybrid algorithms. In this literature
survey the inference is drawn that there are several hybrid                            Figure 2 - ETL workflow execution timing.
heuristic and evolutionary algorithms.                                          Typical steps to read data from Source and transfer it to
   Pure evolutionary algorithms were found to be impractical                transformation stage consist of following steps:
due to their excessive computation time. Pure heuristic                                •      Fetch data from oracle server to local buffer.
algorithms were unsatisfactory in terms of the quality of the
solutions they found. Hybrid algorithms that combine the                               •      Transform the data type.
advantages of heuristic and evolutionary algorithms seem to
                                                                                •    Fill the target buffer.
perform the best. It show that applying an evolutionary
algorithm to either global processing plan optimization or                      •    Flush the buffer.
materialized view selection for a given global processing plan
can reduce the total query and maintenance cost significantly.                  •    Repeat till data is available on oracle server.
This is further revealed from the literature survey carried out
that simply combining or merging optimal local processing




                                                                      216                                http://sites.google.com/site/ijcsis/
                                                                                                         ISSN 1947-5500
                                                             (IJCSIS) International Journal of Computer Science and Information Security,
                                                                                                                Vol. 8, No. 8, 2010
   Above mechanism to read and process data has following                            •     Update the time TF1 used to fill buffer.
issues:
                                                                                     •     If no more data available stop.
    • When data is fetched from oracle server to local
  buffer, CPU is almost idle.                                                        •     Check the status of buffer2.

     • During data transformation network bandwidth is not                           •      If empty fetch data in buffer2 and change
  utilized.                                                                       the status to filled.

     • During flushing of data CPU is idle.                                          •     Update the time TF2 used to fill buffer.

   The network bandwidth and CPU time can be efficiently                             •     If no more data available stop.
used using two buffers instead of one buffer. Second buffer                    Thread for transformation and uploading:
can be filled while first buffer is being processed and after the
processing of first buffer, second buffer will be ready to                           •     Check the status of buffer1.
process so waiting time will reduce. The steps will be:                              •     If filled then transform data and flush it.
     • Fetch data into first buffer.                                                 •     Change the status to empty.
     • While transformation and flushing fetch the data from                         •     Note the time for transformation and
  oracle server to second buffer.                                                 flushing. TP1 = TT1 + TU1.
     • Adjust the buffer size so that time to fetch is same as                       •     If TF1 < TP1 increase the buffer size.
  time for transformation and flushing.
                                                                                     •     Check the status of buffer2.
  Using two buffers following scenario exist:
                                                                                     •     If filled then transform data and flush it.
    • Fetch time is less than sum of transformation time
  and loading time.                                                                  •     Change the status to empty.
    • Fetch time is greater than sum of transformation time                          •     Note the time for transformation and
  and loading time.                                                               flushing. TP2 = TT2 + TU2.
     • Fetch time is same as sum of transformation time and                          •     If TF2 > TP2 decrease the buffer size.
  loading time.
                                                                               Adjusting buffer size:
   For the optimal utilization of network and CPU time, Fetch
time should be same as sum of transformation time and                                •     Buffer Size: S
loading time.                                                                       •   IF abs (TF - TP) > 2                         //
                                                                                  Minimum Time Difference
                                                                                              o    IF TF > TP
                                                                                           ∂S= S * TF / TP * Rnd    // Random Value
                                                                                           between 0 & 1
                                                                                           New Buffer Size = S - ∂S
                                                                                             o     IF TF < TP
                                                                                           ∂S= S * TP / TF * Rnd
                                                                                           New Buffer Size = S + ∂S
                                                                                         V. WORKFLOW OPTIMIZATION
     Figure 3 - Scenario for Two Buffer Implementation of                    To implement the ETL workflow as state-space search
                           Reader                                         problem the workflow is represented by a directed acyclic
                                                                          graph, where each node represents a transformation and edge
   To make the fetch time same as transformation time +                   represents the workflow. Consider the example: Two different
loading time, the buffer size is changed dynamically. In the              databases contain the purchase information from different
proposed algorithm two threads are used, one for fetching and             vendors. The database contains date, amount and vendor id.
other for transformation and loading. Both will work as:                  Now both the data need to be merged to single database so the
  Thread to fetch data:                                                   possible workflow can be as in Figure 4. Where S is source
                                                                          node, SK is surrogate key assignment, U is union, SL is
            •     Check the status of buffer1.                            selection transformation and T is target.
            •      If empty fetch data in buffer1 and change
         the status to filled.




                                                                    217                            http://sites.google.com/site/ijcsis/
                                                                                                   ISSN 1947-5500
                                                             (IJCSIS) International Journal of Computer Science and Information Security,
                                                                                                                Vol. 8, No. 8, 2010




                  Figure 4- Initial Workflow                                           Figure 6 - Selected second individual
  Solution Encoding                                                       After crossover new individual will be
   The solution for the current problem is encoded as an array
of nodes. Each node represents a transformation. For example
(1,2,6,7,3,4,5).
  Initial Population
   Initial population is generated by random transformation on
the given workflow.
  Fitness Function
   The objective function of the individual is based on the total
cost to execute the workflow. The cost of any transformation
is defined by the cost model chosen. The problem is to
minimize the cost, so an fitness function is derived which
needs to be maximized.                                                               Figure 7- First individual after crossover
  Operators Selection
  The tournament selection is used to generate new
population.
  Crossover
   Single-point cross over is implemented to test the results. In
the single-point crossover a random node is chosen and the
parents are cut at crossover points and then join to produce the
new individuals.
For example, consider following two individuals:                                   Figure 8 - Second individual after crossover
                                                                            Mutation
                                                                             Mutation with probability 1/3 is used. Mutation is applied
                                                                          as:
                                                                               • Generate a random number.
                                                                              • If random number is greater than mutation probability
                                                                            Exit.
                                                                              • Find the list of transformations applicable to the
              Figure 5 - Selected first individual                          node.
                                                                               • Select a random transformation.
                                                                               • Apply the transformation to the node.
                                                                             In the experiment following transformation are supported:
                                                                               • Swap.
                                                                               • Factorize.




                                                                    218                             http://sites.google.com/site/ijcsis/
                                                                                                    ISSN 1947-5500
                                                                (IJCSIS) International Journal of Computer Science and Information Security,
                                                                                                                   Vol. 8, No. 8, 2010
•     Distribute.                                                                 Crossover          Simple
   The algorithm used to generate the optimal workflow is as                      Selection Tournament Selection
follows:
                                                                                  Crossover Probability 0.1
Initialize population from the given workflow by random
                                                                                  Mutation Probability 0.9
mutation;
                                                                                  Below is the average of 100 Runs for 10 generations.
    Do
                                                                                  Table 1: Generation No Vs Average Fitness
                Generate the Cost of each workflow
                                                                                      Generation No.       Average Fitness
                Make a log of lowest cost workflows
                                                                                              0                 396.9521
              Determine which individuals should survive with
    fitness function;                                                                         1                 402.2411
                Reproduce the survivors;                                                      2                 408.1349
            Select parents randomly from the survivors for                                    3                 414.5278
    crossover;
                                                                                              4                 420.2802
                Select the crossover sites of the parents;
                                                                                              5                 425.2696
                Produce the next new generation of workflows;
                                                                                              6                 430.4964
            Mutate the new generation of workflows according to
    the mutation probability;                                                                 7                 434.1143
                If iteration limit exceeded                                                   8                 440.2575
                         output the optimal solutions                                         9                444.36096
                         exit
                endif
    loop
           A)   Generating Initial Population
      Initial population is generated from the given ETL
    workflow by randomly applying transitions on the workflow.
    The pseudo code for the initial population generation is:
      While not pop_size
      Copy the given workflow to create a new individual.
                Select a random node.
                Select a random transition.
      Apply the transition to the selected node.
        But during this processing network and CPU are not fully
    utilized. System can be made more efficient using more than
    one buffer as during processing time data can be fetched into
    another buffer. Workflow optimization is formulated as state-                 Figure 9 - Average Fitness Vs Generation No
    space search problem and state-space search is implemented
    using genetic algorithm.                                              Experiment 2:
         B) Workflow Optimization                                                 Initial Population 200
       For the workflow following experiments have been                           Crossover          Simple
    conducted:                                                                    Selection Tournament Selection
    Experiment 1:                                                                 Crossover Probability 0.9
                         Initial Population 200                                   Mutation Probability 0.1




                                                                    219                            http://sites.google.com/site/ijcsis/
                                                                                                   ISSN 1947-5500
                                                        (IJCSIS) International Journal of Computer Science and Information Security,
                                                                                                           Vol. 8, No. 8, 2010
       Below is the average of 100 Runs for 10 generations.
        Table 2: Generation No Vs Average Fitness
           Generation No.     Average Fitness
                    0             397.9563
                    1             403.0553
                    2             409.2708
                    3             416.0089
                    4             420.0008
                    5             426.4026
                    6             431.6523
                    7             437.5481
                    8             443.3943
                                                                     Figure 11- Average Fitness Vs Generation No for Different
                    9             447.2889
                                                                                      crossover probabilities
                                                                    Experiment 3:
                                                                                   Random Initial Solution 200
                                                                        Below is the average of 100 Runs for 10 generations.
                                                                            Table 3: Generation No Vs Average Fitness
                                                                               Generation No.       Average Fitness
                                                                                       0                 397.9763
                                                                                       1                 403.0553
                                                                                       2                 397.9763
                                                                                       3                 416.0089
                                                                                       4                 397.9763
                                                                                       5                 397.9763
                                                                                       6                 431.6523
                                                                                       7                 397.9763
       Figure 10 - Average Fitness Vs Generation No                                    8                 416.0089
Comparative Study                                                                      9                 403.0553




                                                              220                           http://sites.google.com/site/ijcsis/
                                                                                            ISSN 1947-5500
                                                             (IJCSIS) International Journal of Computer Science and Information Security,
                                                                                                                Vol. 8, No. 8, 2010
                                                                             There is a need to further investigate all the possible
                                                                          transitions, and their applicability conditions. It has not be
                                                                          derived that up to what stage the optimization will keep on
                                                                          increasing monotonically using the genetic algorithms or it
                                                                          will decrease after some time depending upon the complexity
                                                                          of the data warehouse. This can be the topic of further research
                                                                          that upto which stage the optimitality of the ETL system
                                                                          increases with the use of genetic algorithms.
                                                                                               VI. CONCLUSION
                                                                             Good selection of materialized views can only be found by
                                                                          taking a holistic approach and considering the optimization of
                                                                          both global processing plans and materialized view selection.
                                                                          Pure evolutionary algorithms were found to be impractical due
                                                                          to their excessive computation time. Pure heuristic algorithms
                                                                          were unsatisfactory in terms of the quality of the solutions
                                                                          they found. Hybrid algorithms that combine the advantages of
 Figure 12 - Average Fitness Vs Generation No for Random                  heuristic and evolutionary algorithms seem to perform the
                   Workflow generation                                    best.
                                                                             For optimizing the ETL process in the data ware houses we
Comparison of GA Vs Random                                                have to set up the theoretical framework for the problem, by
                                                                          modeling the problem as a state space search problem, with
                                                                          each state representing a particular design of the work flow as
                                                                          a graph. Since the problem is modeled as a state space search
                                                                          problem, it is the mandatory requirement to define transitions
                                                                          from one state to another. For the workflow optimization this
                                                                          study has used only swap, factorize and distribute transitions.
                                                                          It is concluded that the results generated by optimizing the
                                                                          data of data warehouse using random numbers are less optimal
                                                                          as compared to when one uses the genetic algorithms. It is
                                                                          further observed that as the generation in the genetic algorithm
                                                                          goes on increasing then accordingly the optimization of the
                                                                          ETL system increases. However the above conclusions only
                                                                          relate to the comparison of the random numbers and genetic
                                                                          algorithms.
                                                                             The efficiency of the reader is improved by the use of two
                                                                          buffers. The buffer size depends on the factors like network
Figure 13: Comparison of Random workflow Vs GA                            bandwidth, memory size, CPU speed etc. There is a need to
                                                                          further investigate the relation between these parameters and
Results:                                                                  optimal number of buffers which can improve the
                                                                          performance. For the optimization of workflow the new
   This is clear from the above graph that as the generation
                                                                          workflows have been generated with the implementation of
goes on increases the average value of the fitness goes on
                                                                          genetic algorithms. The cost of newly generated workflows
increasing.
                                                                          produced by genetic algorithms is less than those produced by
Limitations of The System                                                 random way because the genetic algorithm explores the search
                                                                          space very fast with respect to the desired objective function.
   This approach is used to generate the optimal workflow as a
search space solution. This technique does not consider the                                  VII. FUTURE SCOPE
analytical analysis of the problem. Fitness is proportional to
                                                                             There is a need to further investigate all the possible
the total cost but does not include analysis of the relations
                                                                          transitions, and their applicability conditions. It has not be
between operators. From the results it is clear that genetic
                                                                          derived that up to what stage the optimization will keep on
algorithm outperforms random workflow optimization. For the
                                                                          increasing monotonically using the genetic algorithms or it
workflow optimization this study has used only swap,
                                                                          will decrease after some time depending upon the complexity
factorize and distribute transitions. It is further observed that
                                                                          of the Data Warehouses (DW). This can be the topic of further
as the generation in the genetic algorithm goes on increasing
                                                                          research that up to which stage the optimitality of the ETL
then accordingly the optimization of the ETL system
                                                                          system increases with the use of genetic algorithms.
increases. However the above conclusions only relate to the
comparison of the random numbers and genetic algorithms.




                                                                    221                             http://sites.google.com/site/ijcsis/
                                                                                                    ISSN 1947-5500
                                                                            (IJCSIS) International Journal of Computer Science and Information Security,
                                                                                                                               Vol. 8, No. 8, 2010
                            ACKNOWLEDGMENT                                                [18] Sellis, T., “Multiple-query optimization”, ACM Trans. Database Syst.,
                                                                                               vol. 13, no. 1, Mar. 1999, pp. 23–52.
   I (Raman Kumar) deeply indebted to my beloved master,                                  [19] Simitsis, A., Vassiliadis, P and Sellis, T., “State-Space optimization of
supervisors, my parents and my research laboratory whose                                       ETL Workflows”, IEEE Transactions on Knowledge and Data
help, stimulating suggestions and encouragement helped me in                                   Engineering, Vol 17, No 10, Oct 2005.
all the time of research for and writing of this paper for                                [20] Simitsis, A., Vassiliadis, P. and Sellis, T., “Optimizing ETL Processes in
                                                                                               Data Warehouse Environments (long version)”, Available at
journal. The authors also wish to thank many anonymous
                                                                                          [21] Stillger, M. and Spiliopoulou, M., “Genetic programming in database
referees for their suggestions to improve this paper.                                          query optimization”, in Proc First Annual Conference Genetic
                                                                                               Programming, Stanford, CA, July 1996.
                                 REFERENCES                                               [22] W. Baer and A. Parkinson, “Cyberinsurance in IT Management,” IEEE
                                                                                               Security & Privacy, vol. 5, no. 3, 2007, pp. 50–56.
[1]    Abadi, D., Carney, D. and Çetintemel, U., “A new model and
       architecture for data stream management”, The VLDB Journal, 12(2)                  [23] Wisdom, J., “Research problems in data warehouse”, in Proc 4th
       2003, pp. 120-139.                                                                      International Conference Inform. Knowledge Manage., 1995, pp. 25–30.
[2]    Banzhaf, W., “Genetic and Evolutionary Computation Conference                      [24] Zhang, C. and Yang, J., “Genetic algorithm for materialized view
       (GECCO)”, Proceedings of the 1999 San Francisco, CA:Morgan                              selection in data warehouse environments”, in Proc. First Int. Conf. Data
       Kaufmann, July 1999.                                                                    Warehousing Knowledge Discovery, Lecture Notes in Computer
                                                                                               Science, Florence, Italy, 1999.
[3]    Baralis, E., Paraboschi, S. and Teniente, E., “Materialized view selection
       in a multidimensional database”, in Proc. 23rd Int. Conf. Very Large               [25] Zhang, C., Yao, X. and Yang, J., “An evolutionary approach to
       Data Base (VLDB), 1997, pp. 156–165.                                                    materialized views selection in a data warehouse environment”, IEEE
                                                                                               transactions on Systems, Man, and Cybernetics—Applications and
[4]    Bodenhofer, U., “Genetic Algorithms:theory and applications”, Fuzzy                     Reviews, vol. 31, no. 3, August 2001, pp. 282-293.
       logic lab linz, 2003.
[5]    Galhardas, H., Florescu, D., Shasha, D. and Simon, E., “Ajax: An
       Extensible Data Cleaning Tool”, SIGMOD’00, Texas, 2000, pp. 590.
[6]    Goldberg, D., “Genetic Algorithms in Search, Optimization and                                               AUTHOR’S PROFILE
       Machine Learning”, Reading, MA: Addison-Wesley, 1989.
[7]    Gupta, H., “Index selection for olap”, in Proc. Int. Conf. Data Eng.                                             Mr.       Raman          Kumar
       (ICDE), 1997, pp. 208–219.                                                                                    (er.ramankumar@aol.in) working
[8]    Gupta, H., Mumick, I., “Selection of views to materialize under a                                             as a Lecturer with the Department
       maintenance cost constraint”, in Proc. Int. Conf. Database Theory
       (ICDT), 1999, pp. 453–470.
                                                                                                                     of     Computer     Science    and
[9]    Ho, A., Lumpkin, G., “The genetic query optimizer”, In Genetic
                                                                                                                     Engineering, D A V Institute of
       Algorithms at Stanford 2004, J. R. Koza, Ed. Stanford, CA: Stanford                                           Engineering    and     Technology,
       Univ., 1994, pp. 67–76.                                                                                       Jalandhar. Before joining D A V
[10]   http:// www.dbnet.ece. ntua.gr/~asimi/publications/SiVS04.pdf.                                                Institute of     Engineering and
[11]   Ioannidis, Y., “Query optimization”, ACM Comput. Surv., vol. 28, no.1,                                        Technology, Jalandhar,
       Mar. 1996, pp. 121–123.                                                            He did his Bachelor of Technology with honours in Computer
[12]   Melanie, M., “An Introduction to Genetic Algorithms”, MIT Press 1998.              Science and Engineering from Guru Nanak Dev University;
[13]   Nicholas R. Jennings and Michael J. Wooldridge, “Agent Technology                  Amritsar (A 5 Star NAAC University). He did his Master of
       Foundations, Applications, and Markets”, Springer-Verlag, 1998.
                                                                                          Technology with honours Computer Science and Engineering
[14]   R.P. Majuca, W. Yurcik, and J.P. Kesan, “The Evolution of Cyber
       insurance”, tech. report cs.CR/0601020, ACM Computing Research Re-                 from Guru Nanak Dev University; Amritsar (A 5 Star NAAC
       pository, Jan. 2006.                                                               University). His major area of research is Cryptography,
[15]   Rahm, E. and Do, H., “Data Cleaning: Problems and Current                          Security Engineering and Information security. He has various
       Approaches”, Bulletin of the Technical Committee on Data Engineering,              publications in National as well as International Conferences
       23(4), 2000.
                                                                                          and Journals on his research areas.
[16]   Ross, K., Srivastava, D. and Sudarshan, S., “Materialized view
       maintenance and integrity constraint checking: Trading space for time”,
       in Proc. ACM SIGMOD International Conference Manage Data, 1996,
       pp. 447–458.
[17]   S. Baer, “Rewarding IT Security in the Marketplace,” Proc. 31st Re-
       search Conf. Comm., Information, and Internet Policy, TPRC, 2003;
       www.tprc.org/papers/2003/190/ BaerITSecurity.pdf.




                                                                                    222                                   http://sites.google.com/site/ijcsis/
                                                                                                                          ISSN 1947-5500

								
To top