Docstoc

Journal of Computer Science and Information Security Volume 9 No. 6 June 2011

Document Sample
Journal of Computer Science and Information Security Volume 9 No. 6 June 2011 Powered By Docstoc
					     IJCSIS Vol. 9 No. 6, June 2011
           ISSN 1947-5500




International Journal of
    Computer Science
      & Information Security




    © IJCSIS PUBLICATION 2011
                               Editorial
                     Message from Managing Editor
Journal of Computer Science and Information Security (IJCSIS ISSN 1947-5500) is an open
access, international, peer-reviewed, scholarly journal with a focused aim of promoting and
publishing original high quality research dealing with theoretical and scientific aspects in all
disciplines of Computing and Information Security. The journal is published monthly, and articles
are accepted for review on a continual basis. Papers that can provide both theoretical analysis,
along with carefully designed computational experiments, are particularly welcome.

IJCSIS editorial board consists of several internationally recognized experts and guest editors.
Wide circulation is assured because libraries and individuals, worldwide, subscribe and reference
to IJCSIS. The Journal has grown rapidly to its currently level of over 1,100 articles published and
indexed; with distribution to librarians, universities, research centers, researchers in computing,
and computer scientists.

Other field coverage includes: security infrastructures, network security: Internet security, content
protection, cryptography, steganography and formal methods in information security; multimedia
systems, software, information systems, intelligent systems, web services, data mining, wireless
communication, networking and technologies, innovation technology and management. (See
monthly Call for Papers)

IJCSIS is published using an open access publication model, meaning that all interested readers
will be able to freely access the journal online without the need for a subscription. We wish to
make IJCSIS a first-tier journal in Computer science field, with strong impact factor.

On behalf of the Editorial Board and the IJCSIS members, we would like to express our gratitude
to all authors and reviewers for their sustained support. The acceptance rate for this issue is 35%.
I am confident that the readers of this journal will explore new avenues of research and academic
excellence.



Available at http://sites.google.com/site/ijcsis/
IJCSIS Vol. 9, No. 6, June 2011 Edition
ISSN 1947-5500 © IJCSIS, USA.


Journal Indexed by (among others):
                 IJCSIS EDITORIAL BOARD


Dr. M. Emre Celebi,
Assistant Professor, Department of Computer Science, Louisiana State University
in Shreveport, USA

Dr. Yong Li
School of Electronic and Information Engineering, Beijing Jiaotong University,
P. R. China

Prof. Hamid Reza Naji
Department of Computer Enigneering, Shahid Beheshti University, Tehran, Iran

Dr. Sanjay Jasola
Professor and Dean, School of Information and Communication Technology,
Gautam Buddha University

Dr Riktesh Srivastava
Assistant Professor, Information Systems, Skyline University College, University
City of Sharjah, Sharjah, PO 1797, UAE

Dr. Siddhivinayak Kulkarni
University of Ballarat, Ballarat, Victoria, Australia

Professor (Dr) Mokhtar Beldjehem
Sainte-Anne University, Halifax, NS, Canada

Dr. Alex Pappachen James, (Research Fellow)
Queensland Micro-nanotechnology center, Griffith University, Australia

Dr. T.C. Manjunath,
ATRIA Institute of Tech, India.
                                      TABLE OF CONTENTS


1. Paper 19051107: Additive Model of Reliability of Biometric Systems with Exponential Distribution of
Failure Probability (pp. 1-4)

Zoran Ćosić, Director, Statheros d.o.o., Kaštel Stari, Croatia
Jasmin Ćosić, IT Section of Police Administration, Ministry of Interior of Una-sana canton, Bihać, Bosnia and
Hercegovina
Miroslav Bača, Professor, Faculty of Organisational and Informational science, Varaždin, Croatia


2. Paper 26051129: IP Private Branch eXchange of Saint Joseph University, Macao: A Design Case Study
(pp. 5-11)

A. Cotão, R. Whitfield, J. Negreiros
Information Technology Department, University of Saint Joseph, Macau, China


3. Paper 30051137: A Novel and Secure Data Sharing Model with Full Owner Control in the Cloud
Environment (pp. 12-17)

Mohamed Meky and Amjad Ali
Center of Security Studies, University of Maryland University College, Adelphi, Maryland, USA


4. Paper 17051101: Performances Evaluation of Inter-System Handover between IEEE802.16e and
IEEE802.11 Networks (pp. 18-24)

Abderrezak Djemai, Mourad Hadjila, Mohammed Feham
STIC laboratory, University of Tlemcen, Algeria


5. Paper 24051124: Recognizing the Electronic Medical Record Data from Unstructured Medical Data Using
Visual Text Mining Techniques (pp. 25-35)

Prof. Hussain Bushinak, Faculty of Medicine, Ain Shams University, Cairo, Egypt
Dr. Sayed AbdelGaber, Faculty of Computers and Information, Helwan University, Cairo, Egypt
Mr. Fahad Kamal AlSharif, Collage of Computer Science, Modern Academy, Cairo, Egypt


6. Paper 31051149: Creating an Appropriate Programming Language for Student Compiler Project (pp. 36-
39)

Elinda Kajo Mece, Department of Informatics Engineering, Polytechnic University of Tirana, Tirana, Albania


7. Paper 31051152: The History of Web Application Security Risks (pp. 40-47)

Fahad Alanazi, Software Technology Research Laboratory, De Montfort University, Leicester, LE1 9BH UK
Mohamed Sarrab, Software Technology Research Laboratory, De Montfort University, Leicester, LE1 9BH UK
8. Paper 31051156: Improving the Performance of Translation Wavelet Transform using BMICA (pp. 48-56)

Janett Walters-Williams, School of Computing & Information Technology, University of Technology, Jamaica,
Kingston 6, Jamaica W.I.
Yan Li, Department of Mathematics & Computing, Centre for Systems Biology, University of Southern Queensland,
Toowoomba, Australia


9. Paper 31051157: Hole Filing IFCNN Simulation by Parallel RK(5,6) Techniques (pp. 57-64)

S. Senthilkumar and Abd Rahni Mt Piah,
Universiti Sains Malaysia, School of Mathematical Sciences, Pulau Pinang-11800, Penang, Malaysia


10. Paper 31051160: Location Estimation and Mobility Prediction Using Neuro-fuzzy Networks In Cellular
Networks (pp. 65-69)

Maryam Borna, Department of Electrical Engineering, Iran University of Science and Technology, Tehran, Iran
Mohammad Soleimani, Department of Electrical Engineering, Iran University of Science and Technology, Tehran,
Iran


11. Paper 31051161: A Fuzzy Clustering Based Approach for Mining Usage Profiles from Web Log Data (pp.
70-79)

Zahid Ansari 1, Mohammad Fazle Azeem 2, A. Vinaya Babu 3 and Waseem Ahmed 4
1,4
  Dept. of Computer Science Engineering, P.A. College of Engineering, Mangalore, India
2
 Dept. of Electronics and Communication Engineering, P.A. College of Engineering, Mangalore, India
3
  Dept. of Computer Science Engineering, Jawaharlal Nehru Technological University, Hyderabad, India


12. Paper 31051168: Inception of Hybrid Wavelet Transform using Two Orthogonal Transforms and It’s use
for Image Compression (pp. 80-87)

Dr. H.B. Kekre, Senior Professor, Computer Engineering Department, SVKM’s NMIMS (Deemed-to-be University),
Vile Parle(W), Mumbai, India.
Dr. Tanuja K. Sarode, Assistant Professor, Computer Engineering Department, Thadomal Shahani Engineering
College, Bandra(W), Mumbai, India.
Sudeep D. Thepade, Associate Professor, Computer Engineering Department, SVKM’s NMIMS (Deemed-to-be
University), Vile Parle(W), Mumbai, India


13. Paper 31051177: A Model for the Controlled Development of Software Complexity Impacts (pp. 88-93)

Ghazal Keshavarz, Computer department, Science and Research Branch, Islamic Azad University, Tehran, Iran
Nasser Modiri, Computer department, Islamic Azad University, Zanjan, Iran
Mirmohsen Pedram, Computer department, Tarbiat Mollem University, Karaj, Iran


14. Paper 31051178: A Hierarchical Overlay Design for Peer to Peer and SIP Integration (pp. 94-99)

Md. Safiqul Islam & Syed Ashiqur Rahman, Computer Science and Engineering Department, Daffodil International
University, Dhaka, Bangladesh
Rezwan Ahmed, American International University – Bangladesh, Dhaka, Bangladesh
Mahmudul Hasan, Computer Science and Engineering Department, Daffodil International University, Dhaka,
Bangladesh
15. Paper 31051199: Evaluation of CPU Consuming, Memory Utilization and Time Transfering Between
Virtual Machines in Network by using HTTP and FTP techniques (pp. 100-105)

Igli TAFA, Elinda KAJO, Elma ZANAJ, Ariana BEJLERI, Aleksandër XHUVANI
Polytechnic University of Tirana, Information Technology Faculty, Computer Engineering Department, Tiranë,
Albania


16. Paper 17051103: A Proposal for Common Vulnerability Classification Scheme Based on Analysis of
Taxonomic Features in Vulnerability Databases (pp. 106-111)

Anshu Tripathi, Department of Information Technology, Mahakal Institute of Technology, Ujjain, India
Umesh Kumar Singh, Institute of Computer Science, Vikram University, Ujjain, India


17. Paper 19051108: Abrupt Change Detection of Fault in Power System Using Independent Component
Analysis (pp. 112-118)

Satyabrata Das, Asstt Prof., Department of CSE, College of Engineering Bhubaneswar, Orissa, India-751024
Soumya Ranjan Mohanty, Asstt Prof., Department of EE, Motilal Neheru National Institute of Technology,
Allahabad, India-211004
Sabyasachi Pattnaik, Prof., Department of I&CT, Fakir Mohan University, Balasore, India -756019


18. Paper 19051113: Modeling and Analyze the Deep Web: Surfacing Hidden Value (pp. 119-124)

Suneet Kumar, Associate Professor; Computer Science Dept., Dehradun Institute of Technology, Dehradun, India
Anuj Kumar Yadav, Assistant Professor; Computer Science Dept., Dehradun Institute of Technology, Dehradun,
India
Rakesh Bharati, Assistant Professor; Computer Science Dept., Dehradun Institute of Technology, Dehradun, India
Rani Choudhary, Sr. Lecturer; Computer Science Dept., BBDIT, Ghaziabad, India


19. Paper 24051122: Instigation of Orthogonal Wavelet Transforms using Walsh, Cosine, Hartley, Kekre
Transforms and their use in Image Compression (pp. 125-133)

Dr. H. B.Kekre, Sr. Professor, MPSTME, SVKM’s, NMIMS (Deemed-to-be University, Vileparle(W), Mumbai-56,
India.
Dr. Tanuja K. Sarode, Asst. Professor, Thadomal Shahani Engg. College, Bandra (W), Mumbai-50, India.
Sudeep D. Thepade, Associate Professor, MPSTME, SVKM’s, NMIMS (Deemed-to-be University, Vileparle(W),
Mumbai-56, India.
Ms. Sonal Shroff, Lecturer, Thadomal Shahani Engg. College Bandra (W), Mumbai-50, India


20. Paper 25051125: Analysing Assorted Window Sizes with LBG and KPE Codebook Generation
Techniques for Grayscale Image Colorization (pp. 134-138)

Dr. H. B. Kekre, Sr. Professor, MPSTME, SVKM’s, NMIMS (Deemed-to-be University, Vileparle(W), Mumbai-56,
India.
Dr. Tanuja K. Sarode, Asst. Professor, Thadomal Shahani Engg. College, Bandra (W), Mumbai-50, India.
Sudeep D. Thepade, Associate Professor, MPSTME, SVKM’s, NMIMS (Deemed-to-be University, Vileparle(W),
Mumbai-56, India.
Ms. Supriya Kamoji, Sr.Lecturer, Fr.Conceicao Rodrigues College of Engg, Bandra (W), Mumbai-50, India
21. Paper 27051132: Evolving Fuzzy Classification Systems from Numerical Data (pp. 139-147)

Pardeep Sandhu, Maharishi Markandeshwar University, Mullana, Haryana, India
Shakti Kumar, Institute of Science and Technology, Klawad, Haryana, India
Himanshu Sharma, Maharishi Markandeshwar University, Mullana, Haryana, India
Parvinder Bhalla, Institute of Science and Technology, Klawad, Haryana, India


22. Paper 27051134: A Low-Power CMOS Implementation of a Cellular Neural Network for Connected
Component Detection (pp. 148-152)

S. El-Din, A.K. Abol El-Seoud, and A. El-Fahar
Electrical Engineering Department, University of Alexandria, Alex, Egypt.
M. El-Sayed Ragab, School of Electronics, Comm. and Computer Eng., E-JUST. , Alexandria, Egypt.


23. Paper 30041170: An Integrated Framework for Content Based Image Retrieval (pp. 153-157)

Ritika Hirwane, SOIT, RGPV, Bhopal
Prof. Nishchol Mishra, SOIT, RGPV, Bhopal


24. Paper 30051135: A Novel Approach for Intranet Mailing For Providing User Authentication (pp. 158-163)

ASN Chakravarthy †, Sri Sai Aditya Institute of Science & Technology, Suram Palem,E.G.Dist , Andhra Pradesh,
India
A.S.S.D. Toyaza ††, Sri Sai Aditya Institute of Science & Technology, Suram Palem,E.G.Dist , Andhra Pradesh,
India


25. Paper 31011189: Visualization of Fluid Flow Patterns in Horizontal Circular Pipe Ducts (pp. 164-170)

Olagunju, Mukaila, Department of Computer Science, Kwara State Polytechnic, Ilorin, Nigeria
Taiwo, O. A (Ph.D), Department of Mathematics, University of Ilorin, Nigeria.


26. Paper 31031183: Iris Image Pre-Processing and Minutiae Points Extraction (pp. 171-174)

Archana R. C., J. Naveenkumar, Prof. Dr. Suhas.H.Patil
Computer Engineering, VDUCOE, Pune, Maharashtra, India


27. Paper 31051142: Establishing Relationships among Chidamber And Kemerer’s Suite of Metrics using
Field Experiments (pp. 175-182)

Ezekiel U Okike, Department of computer Science, University of Ibadan, Ibadan, Nigeria
Adenike O. Osofisan, Department of computer Science, University of Ibadan, Ibadan, Nigeria


28. Paper 31051143: Risk Assessment of Authentication Protocol: Kerberos (pp. 183-187)

Pathan Mohd. Shafi, Smtg Kashibai Navale College of Engineering,Pune
Dr Abdul sattar, Royal Institute of Technology and Science R. R. Dist.
Dr. P. chenna Reddy, JNTU College of Engineering, Pulivendula.
29. Paper 31051144: TOTN: Development of A Tourism – Specific Ontology For Information Retrieval In
Tamilnadu Tourism (pp. 188-193)

K. R. Ananthapadmanaban, Research Scholar, Sri Chandrasekarendra SaraswathiViswa Mahavidyalaya University,
Enathur, Kanchipuram-631 561
Dr. S. K. Srivatsa, Senior Professor, St.Joseph‘s College of Engg., Jeppiaar Nagar, Chennai-600 064


30. Paper 31051145: Unified Fast Algorithm for Most Commonly used Transforms using Mixed Radix and
Kronecker Product (pp. 194-202)

Dr. H.B. Kekre, Senior Professor, Department of Computer Science, Mukesh Patel School of Technology
Management and Engineering, Mumbai, India
Dr. Tanuja Sarode, Associate Professor, Department of Computer Science, Thadomal Shahani College of
Engineering, Mumbai, India
Rekha Vig, Asst. Prof. and Research Scholar, Dept. of Elec. and Telecom., Mukesh Patel School of Technology
Management and Engineering, Mumbai, India


31. Paper 31051147: A Framework for Identifying Software Vulnerabilities within SDLC Phases (pp. 203-
207)

Zeinab Moghbel, Department of Computer Engineering, Science and Research Branch, Islamic Azad University,
Tehran, Iran
Nasser Modiri, Department of Computer Engineering, Zanjan Branch, Islamic Azad University, Zanjan, Iran


32. Paper 31051148: Text Clustering Based on Frequent Items Using Zoning and Ranking (pp. 208-214)

S. Suneetha, Dr. M. Usha Rani, Department of Computer Science, SPMVV, Tirupati
Yaswanth Kumar.Avulapati, Dept of Computer Science, S.V.University, Tirupati


33. Paper 31051162: Steganography based on Contourlet Transform (pp. 215-220)

Sushil Kumar, Department of Mathematics, Rajdhani college, University of Delhi, New Delhi, India
S.K. Muttoo, Department of Computer Science, University of Delhi, Delhi, India


34. Paper 31051164: A Comparative Study of Proposed Improved PSO Algorithm with Proposed Hybrid
Algorithm for Multiprocessor Job Scheduling (pp. 221-228)

K. Thanushkodi, Akshaya College of Engineering and Technology Coimbatore, India
K. Deeba, Department of Computer Science and Engineering, Kalaignar Karunanidhi Institute of Technology,
Coimbatore, India


35. Paper 31051165: SCAM – Software Component Assessment Model (pp. 229-234)

Hasan Tahir, Aasia Khannum, Ruhma Tahir
Department of Computer Engineering, College of Electrical & Mechanical Engineering, National University of
Sciences and Technology (NUST), Islamabad, Pakistan
36. Paper 31051170: Selected Problems on Mobile Agent Communication (pp. 235-239)

Yinka A. Adekunle and Sola S. Maitanmi
Department of Computer Science & Mathematics, Babcock University, Ilisan Remo, Ogun State, Nigeria.


37. Paper 31051175: KGS Based Control for Parking Brake Cable Manufacturing System (pp. 240-248)

Geeta Khare, S.S.J.P, Asangaon
Dr. R.S. Prasad, R.R.I.M.T., Lucknow


38. Paper 31051183: Performance Appraise of Assorted Thresholding Methods in CBIR using Block
Truncation Coding (pp. 249-255)

Dr. H.B. Kekre, Sudeep D. Thepade, Shrikant Sanas
Computer Engineering Department, MPSTME, SVKM’s NMIMS (Deemed-to-be University), Mumbai, India


39. Paper 31051193: Performance Analysis of Cryptographic Algorithms Like ElGamal, RSA, and ECC for
Routing Protocols in Distributed Sensor Networks (pp. 256-263)

Suresha , Department of CSE, Reva Institute of Technology and Management, Bangalore ,Karnataka, India
Dr.Nalini.N , Prof. and Head, Department of CSE, Nitte Meenakshi Institute of Technology, Bangalore, Karnataka,
India


40. Paper 310511100: Exaggerate Self Quotient Image Model For Face Recognition Enlist Subspace Method
(pp. 264-269)

S. Muruganantham, Assistant professor, S.T.Hindu college, Nagercoil, India -629003
T. Jebarajan, Principal, Kings College of Engineering, Chennai, India - 600105


41. Paper 10000000: PHCC: Predictive Hop-by-Hop Congestion Control Protocol for Wireless Sensor
Networks (pp. 270-274)

Shahram Babaie, Eslam Mohammadi, Saeed Rasouli Heikalabad, Hossein Rasouli
Technical and Engineering Dept., Tabriz Branch, Islamic Azad University, Tabriz, Iran


42. Paper 31051158: Secured Right Angled or Ant Search Protocol for Reducing Congestion Effects and
Detecting Malicious Node in Mobile Ad hoc Networks by Multipath Routing (pp. 275-283)

Lt. Dr. S Santhosh Baboo, P.G. Research Dept of Com. Science, Arumbakkam, Chennai – 106., D G Vaishnav
College, Arumbakkam, Chennai – 106.
V J Chakravarthy, Research Scholar, Dravidian University
43. Paper 31051190: Side Lobe Reduction Of A Planar Array Antenna By Complex Weight Control Using
SQP Algorithm And Tchebychev Method (pp. 284-289)

A. Hammami, R. Ghayoula, and A. Gharsallah
Unité de recherche: Circuits et systmes électroniques HF, Faculté des Sciences de Tunis, Campus Universitaire
Tunis EL-manar, 2092, Tunisie


44. Paper 22051119: Image Compression Algorithm- A Review (pp. 290-295)

Marcus karnan, Tamilnadu College of Engineering, Coimbatore, India
M.S.Tamilselvi, Research Scholar, Dept of Computer Science & Engg., ANNA University, Coimbatore, India


45. Paper 31051189: Compensation of Nonlinear Distortion in OFDM Systems Using an Efficient Evaluation
Technique (pp. 296-299)

Dr. (Mrs.). R. Sukanesh, Professor / Department of ECE / TCE, Madurai – 15, India.
R. Sundaraguru, Research Scholar, Anna University, Chennai-25, India.


46. Paper 31051191: Performance Prediction of Single Static String Algorithms on Cluster Configurations
(pp. 300-306)

Prasad J. C., Research Scholar, Dept. of CSE, Dr.MG.R University, Chennai
cum Asst. Professor, Dept of CSE, FISAT, Angamaly, India
K. S. M. Panicker, Professor, Dept of CSE, Federal Institute of Science and Technology [FISAT], Angamaly, India


47. Paper 31051181: Explicit Solution of Hyperbolic Partial Differential Equations by an Iterative
Decomposition Method (pp. 307-309)

Adekunle, Y.A., Department of Computer Science and Mathematics, Babcock University, Ilisan-Remo Ogun State,
Nigeria
Kadiri, K.O., Department Electrical/Electronics Engineering, Federal Polytechnic, Offa, Kwara State, Nigeria
Odetunde, O.S., Department of Mathematical Science, Olabisi Onabanjo University, Ago-Iwoye, Ogun State,Nigeria


48. Paper 31051182: Numerical Approximation of Generalised Riccati Equations By Iterative Decomposition
Method (pp. 310-312)

Adekunle, Y.A., Department of Computer Science and Mathematics, Babcock University, Ilisan-Remo Ogun State,
Nigeria
Kadiri, K.O., Department Electrical/Electronics Engineering, Federal Polytechnic, Offa, Kwara State, Nigeria
Odetunde, O.S., Department of Mathematical Science, Olabisi Onabanjo University, Ago-Iwoye, Ogun State,Nigeria


49. Paper 26051128: Automated Face Detection and Feature Extraction Using Color FERET Image Database
(pp. 313-318)

Dewi Agushinta R.1, Fitria Handayani S.2
1
  Information System, 2Informatics
Gunadarma University, Jl. Margonda Raya 100 Pondok Cina, Depok 16424, Indonesia
50. Paper 31051186: Simplified Neural Network Design for Hand Written Digit Recognition (pp. 319-322)

Muhammad Zubair Asghar1, Hussain Ahmad1, Shakeel Ahmad1, Sheikh Muhammad Saqib1, Bashir Ahmad1 and
Muhammad Junaid Asghar2
1
  Institute of Computing and Information Technology Gomal University, D.I.Khan, Pakistan
2
  Faculty of Pharmacy, Gomal University, D.I.Khan, Pakistan


51. Paper 31051186: Diagnosis of Skin Diseases using Online Expert System (pp. 323-325)

Muhammad Zubair Asghar 1, Muhammad Junaid Asghar 2, Sheikh Muhammad Saqib 1, Bashir Ahmad 1, Shakeel
Ahmad 1 and Hussain Ahmad 1.
1
  Institute of Computing and Information Technology Gomal University, D.I.Khan, Pakistan
2
  Faculty of Pharmacy, Gomal University, D.I.Khan, Pakistan


52. Paper 31051194: Cloud based Data Warehouse 2.0 Storage Architecture: An extension to traditional
Data Warehousing (pp. 326-329)

Kifayat Ullah Khan, Sheikh Muhammad Saqib, Bashir Ahmad, Shakeel Ahmad and Muhammad Ahmad Jan
Institute of Computing and Information Technology Gomal University, D.I.Khan, Pakistan
                                                             (IJCSIS) International Journal of Computer Science and Information Security,
                                                                                                                 Vol. 9, No. 6, June 2011

     Additive model of reliability of biometric systems
     with exponential distribution of failure probability
                                                                                      Bihać, Bosnia and Hercegovina
                  Zoran Ćosić(Author)                                                      jacosic@gmail.com
                         director
                     Statheros d.o.o.
                   Kaštel Stari, Croatia                                                Miroslav Bača (Author)
                 zoran.cosic@statheros.hr                                                       professor
                                                                           Faculty of Organisational and Informational science
                                                                                           Varaždin, Croatia
                 Jasmin Ćosić (Author)                                                   miroslav.baca@foi.hr
           IT Section of Police Administration
         Ministry of Interior of Una-sana canton


Abstract— Approaches for reliability analysis of biometric              Data collection subsystem consists of a biometric sample,
systems are subject to a review of numerous scientific papers.          method of sampling, and sensors that are sampled. Signal
Most of them consider issues of reliability of component software       processing subsystem consists of drainage structures, quality
applications. System reliability, considering technical and             control and comparison of samples. Decision subsystem
software part, is of crucial importance for users and for
manufacturers of biometric systems.
                                                                        consists of the decision mechanisms and storage subsystems.

In this paper, the authors developed a mathematical model to            Schematisation of model described in figure 1 can be shown
analyse the reliability of biometric systems, regarding the             on figure 2.
dependence of components with exponential distribution of
failure probability.

   Keywords- Additive model, Biometric system, reliability,
exponential distribution, UML,
                                                                                                   Figure 2
                                                                                                         
                      I.     INTRODUCTION
                                                                           Schematic  presentation  of  a  biometric  [1]  system  is  a 
The general biometric system, according to [1] Wyman shown              simplified representation of a system in Figure 1 and shows 
in Figure 1, consists of 5 elements which are located in all            the serial configuration of system components dependence   
biometric systems today.

                                                                          II.   THE DEFINITION OF THE RELIABILITY OF BIOMETRIC
                                                                                                      SYSTEMS



                                                                        Biometric system designers and producers are motivated to
                                                                        use already constructed components and modules. Component
                                                                        system has a high reliability expectations , no matter who is
                                                                        producer. Most of existing reliability models are so generally
                                                                        described to not consider particularity of components and
                                                                        modules. In this paper authors will describe methodology
                                                                        based on mathematical model which take account of
                                                                        component reliability and connections reliability also. UML
                                                                        methodology in describing system and his inner interaction,
                                                                        simplify approach for researchers.
                           Figure 1. [1]                                UML [2] is also becoming standard in the process of designing
                                                                        and manufacturing systems so production of component
Each subsystem consists of the elements that contribute to the          systems gets benefits from the UML representation.
overall system quality.




                                                                    1                              http://sites.google.com/site/ijcsis/
                                                                                                   ISSN 1947-5500
                                                                   (IJCSIS) International Journal of Computer Science and Information Security,
                                                                                                                       Vol. 9, No. 6, June 2011
Assessment of [3] generic biometric system reliability in UML
is given in Figure 3, which describes the use of Use Case                        Diagrams sequences play an important role in assessing the
diagram:                                                                         reliability of the system because they give information on how
                                                                                 many components are involved in the execution of a scenario.
                                                                                 Through sequence diagrams it is simple to count the periods of
                                                                                 availability of components in the given scenarios as shown in
                                                                                 figure (3). The probability of failure of components with
                                                                                 known busy periods, can be given by the following
                                                                                 expression:


                                                                                                                                                   (3)
                         Figure 3 [3]
                                                                                 In witch is:
                                                                                 -   - probability of component failure i in the scenario j
q1 and q2 represent the probability that users u1 and u2 will                    -     - occupancy time of component i in the scenario j
access the system using some of its functionality.
P11 and P12 represent the probability that user u1 will use the                  The expression (3) applies only if the following conditions are
functionality of f1 and f2, and P21 and P22 represent the same                   met:
probability for the user u2.
                                                                                      - Independence of failure: the probability of failure of
The probability of execution of use case x, is defined by the
                                                                                         one component does not depend on other components
expression:
                                                                                     -    Regularity of failure: the probability of failure of one
                                                                                          component is equal throughout the execution of
                                                                      (1)                 occupation period of the component
                                
m is number of users.                                                            You can also show every moment of occupancy of any
                                                                                 component of the system considering the method to be
If we are able to join a no uniform distribution to Diagram of                   executed at that moment in the scenario
sequences in a given use case then (1) can be expressed as:
                                                                                 If we replace   with a set of method of failure probability
                                                                    (2)          where is                 then equation (3) becomes:

                                                                                                                                                   (4)
Where the fj (k) - frequency of the k-th transition of sequence
diagram in the j-th case. P (kj) – presents a probability of
default scenarios.                                                               During system operation, components interact and exchange
                                                                                 information. Then it is necessary to take into consideration
                                                                                 occupation period of the component:

                                                                                                                                                   (5)

                                                                                 Where is θij- the probability of failure of system components
                                                                                 and bp-busy period of the system.

                                                                                                                                                   (6)

                                                                                  Where is Ψlmj- the probability of failure connections between
                                                                                   the components and                         - the number of
                                                                                           interactions between system components.

                                                                                                                                                   (7)

                Figure 4 Sequence diagram [3]                                    The reliability of the system taking into account the
                                                                                 probability of failure can be expressed as:



                                                                             2                              http://sites.google.com/site/ijcsis/
                                                                                                            ISSN 1947-5500
                                                               (IJCSIS) International Journal of Computer Science and Information Security,
                                                                                                                   Vol. 9, No. 6, June 2011
                                                                          n2( Δt ) - number of failures in certain time interval Δt
                                                                          n1( t − Δt ) – the non failed number of elements at the end of
                                                              (7a)
                                                                          the interval   Δt , or until t − Δt
                                                                          The intensity of the element failure is calculated by the
                                                              (7b)        expression:


                                                                                                           1
                                                                                                     )        1
III.     RELIABILITY PREDICTION OF THE COMPONENTS SERIAL                                            λEL   =Θ=                                   (11)
           DEPENDENCE WITH EXPONENTIAL DISTRIBUTION                                                        n Θ⋅n
Based on the above assumptions [4], [5] system reliability can            Where is:
be calculated according to the law of exponential distribution:
                                                                          n- number of usable parts of the confidence interval
                                                                          (1 − α ) = 0, 75
                                                                (8)
                                                                          Θ - lower limit of confidence for the mean time between
In wich are:                                                              failures
Rs - System reliability
λ - Intensity of system fault                                             The total intensity of failure taking into consideration the
t - Required time of reliable operation of system                         number of elements that are not failed in a given time is
                                                                          calculated by formula:

Proof:
                                                                                                                                                (12)

                                                                          Where is:

                                                                          nEL- number of elements of subsystems that are not failed

                                                                          Mean time between failures MTBFs can be calculated using
In a serial dependence between of all parts of the system the             the expression:
failure of any part of the system may cause the failure of the
entire system.
Failure intensity function λs in the case of a serial dependence
of system elements is calculated by the expression:                                                                                             (13)

                                                                (9)


                                                                          Calculation of reliability [6] of some element is based on
  Where is λi - failure intensity of the i-th part of the system.
                                                                          empirical data on the time of functioning and eventual failure
                                                                          of the element.
 Failure intensity function λs is equal to the ratio between the
                                                                          Problem [7], [8] becomes more complex when the information
number of failures in the time-frame and the correct number of
                                                                          about the failure doesn't exist. In case that part worked
  elements in the system, until the beginning of this interval:
                                                                          perfectly, and information about the exploitation are available.
                                                                          If one assumes that the given part can apply the rule of the
                                                                          exponential distribution, it is possible to determine the upper
                                                              (10)        limit of confidence for the intensity of failure, in the cases of
                                                                          continuous operation or one or more failures.
Where is:
λs -function of failure intensity of system                               Lower limit of confidence for the mean time between
                                                                                   )
Δt - failure time of an system element                                    failures Θ , for confidence interval (1 − α ) is calculated using
                                                                          the formula:



                                                                      3                                  http://sites.google.com/site/ijcsis/
                                                                                                         ISSN 1947-5500
                                                              (IJCSIS) International Journal of Computer Science and Information Security,
                                                                                                                  Vol. 9, No. 6, June 2011
                                                                         probability. In accordance with this authors will define
                        )        2tr               2tr                   mathematical model for recovery system probability than
                        Θ≥                   =                (14)       system readiness to use.
                             χ 2α ,2 r + 2       χ 20.25,2
Where is:
                                                                                                        REFERENCES
tr – total time of system operation
r – Number of elements that have failed                                  [1]   Zasnivanje otvorene ontologije odabranih segmenata biometrijske
                                                                               znanosti - Markus Schatten– Magistarski rad – FOI 2007
    χ 2α ,2r + 2   - Random variable which has distribution              [2]   Modelling biometric systems in UML – Miroslav Bača, Markus
                                                                               Schatten, Bernardo Golenja, JIOS 2007 FOI Varaždin
                                                                         [3]   A Bayesian Approach to Reliability Prediction and Assessment of
                                                                               Component
        IV.        SPECIAL CASE OF NOT-FAILURE SYSTEM                    [4]   Based Systems – KH. Singhy, V. Cortellessa, B. Cukic, E.
                                                                               Gunely,V.Bharadwaj
Considering (8), (10) and (14) we obtain an expression for the
                                                                         [5]   Department of Statistics, Lane Department of Computer Science and
reliability of the whole system:                                               Electrical Engineering West Virginia University, Proceedings of the
                                                                               12th International Symposium on Software Reliability Engineering
                                                              (15)             (ISSREí01) 1071-9458/01 2001 IEEE
                                                                         [6]   Simulacijsko modeliranje pouzdanosti tehničkog sustava brodskog
In futher calculations:                                                        kompresora – Zoran Ćosić – Magistarski rad - 2007
                                                                         [7]   Teorija pouzdanosti tehničkih sistema,           Vujanović Nikola,
                                                                               Vojnoizdavački novinski centar, Beograd 2005
                                                              (16)       [8]   The Impact of Error Propagation on Software Reliability Analysis of
                                                                               Component-based Systems – Petar Popic, Master thesis, West Vriginia
                                                                               2005
Taking into account (11) we obtain an expression for the
reliability of the each elements of the system:
                                                                                                      AUTHORS PROFILE
                                                                         Zoran Ćosić, CEO at Statheros ltd, and business consultant in business process
                                                                              standardization field. He received BEng degree at Faculty of nautical
                                                              (17)            science , Split (HR) in 1990, MSc degree at Faculty of nautical science ,
                                                                              Split (HR) in 2007 , actually he is a PhD candidate at Faculty of
                                                                              informational and Organisational science Varaždin Croatia. He is
                                                                              a member of various professional societies and program
         V.        CONCLUSION AND FURTHER RESEARCH                            committee           members.       He      is     author      or      co-
                                                                              author more than 20 scientific and professional papers. His main
The reliability of technical systems is the subject for many                  fields of interest are: Informational security, biometrics and privacy,
research scientists, according to that analysis are available in              business process reingeenering,
different models of reliability of biometric systems that in             Jasmin Ćosić has received his BE (Economics) degree from University of
most cases take into account the software as one of the                       Bihać, B&H in 1997. He completed his study in Information Technology
                                                                              field (dipl.ing.Information Technlogy) in Mostar, University of Džemal
components of the same system. This study developed                           Bijedić, B&H. Currently he is PhD candidate in Faculty of Organization
mathematical model of a generic biometric system that                         and Informatics in Varaždin, University of Zagreb, Croatia. He is
consider user, hardware and software influence and assumes                    working in Ministry of the Interior of Una-sana canton, B&H. He is a
serial dependence of the components and the exponential                       ICT Expert Witness, and is a member of Association of Informatics of
                                                                              B&H, Member of IEEE and ACM. His areas of interests are Digital
distribution of failure probability of system components.                     Forensic, Computer Crime, Information Security and DBM Systems. He
Scientific contribution is expressed through mathematical                     has presented and published over 20 conference proceedings and journal
model definition of biometric system reliability prediction                   articles in his research area
within special case where is not possible to get failure data or         Miroslav Bača is currently an Associate professor, University of Zagreb,
in the case of perfectly working system. The results of                       Faculty       of      Organization    and     Informatics.      He     is
                                                                              a member of various professional societies and program
calculations can prevent real failure events by defining                      committee members, and he is reviewer of several international
preventive maintenance.                                                       journals and conferences. He is also the head of the Biometrics centre in
This mathematical model allows the prediction of system                       Varaždin,         Croatia.      He      is      author       or       co-
                                                                              author more than 70 scientific and professional papers and two
reliability in the early fase of projecting of its components .               books. His main research fields are computer forensics, biometrics and
    The subject of further research will be to create an                      privacy professor at Faculty of informational and Organisational science
integrated mathematical model of reliability of complex                       Varaždin Croatia
systems that considers the parallel and combined dependency
of system components with different distributions of failure




                                                                     4                                    http://sites.google.com/site/ijcsis/
                                                                                                          ISSN 1947-5500
                                                                 (IJCSIS) International Journal of Computer Science and Information Security,
                                                                                                                           Vol. 9, No. 6, 2011

             IP Private Branch eXchange of Saint Joseph
               University, Macao: a Design Case Study

                                                 A. Cotão, R. Whitfield, J. Negreiros
                                                   Information Technology Department
                                                        University of Saint Joseph
                                                              Macau, China

Abstract— To present the specification project of a digital                 telephone systems because they can significantly lower
telephone system, IP PBX, for the new campus of Saint Joseph                telecommunications costs (a significant operating cost) while
University (USJ), Macao, is the main goal of this research. Given           business voice network should be integrated with the data one
that the new USJ campus at Green Island, Macao, was projected               [3]. The required software will be installed on standard
for this coming September 2012 with the latest technologies
available to achieve energy savings and to contribute somehow to
                                                                            computer hardware, according to the next items: (A)
the environment sustainability, the available prototype was                 Understanding of the general benefits and technical trends of
designed using VoIP (Voice Over IP) to follow this novelty trend.           small and medium IP PBXs; (B) Configuration of an
It is expected to conclude, as well, that there is a financial reason       appropriate hardware and software combination regarding IP
for preferring a VoIP phone system over a conventional one.                 PBXs; (C) Evaluation of the final prototype; (D) Design and
Choosing this platform eliminates the need for conventional                 specification proposal as regards a production system suitable
telephone wiring for the new campus, which represents a                     for being used at USJ new campus.
considerable cost savings and logistics, for instance. Further,                 Concerning the structure of this writing, section two
good VoIP Open Source software are already available such as                summarizes the basic concepts and relationships of PSTN
AsteriskNOW©. At last, the internal USJ connection to the                   (Public Switch telephone Network), VoIP (Voice Over Internet
Catholic University of Lisbon, Portugal, adds another financial
                                                                            Protocol), PBX (Private Branch eXchange) and IP PBX for
reason for this project since international calls can be quite
                                                                            non-technological readers (but who like to understand it). This
expensive. To analyze the setup, test and implementation of a
prototype becomes, hence, the aim of this paper, including the
                                                                            includes business managers that are not aware of the possibility
explanation of the difficulties, technical lessons and                      to lower communication costs within their organizations. The
recommendations on the tested IP PBX.                                       following section will describe the specification and design for
                                                                            the IP PBX server, the IP phones connected to it, the VoIP
   Keywords- Voice Over IP (VoIP), Private Branch eXchange                  gateway to connect the IP PBX to the Public Switch Telephone
(PBX), AsteriskNOW©, PSTN, IP handsets, IP Providers.                       Network (PSTN) and other VoIP providers. In section four, the
                                                                            setup and implementation are shown while section five
                        I.    INTRODUCTION                                  illustrates the testing phase, not forgetting the evaluation of its
                                                                            efficiency and reliability. Inevitable, the last section
USJ is a joint venture result of the Catholic University of                 recommends the main technical specifications for an IP PBX
Portugal and the Diocese of Macao [1]. It was created in the                for the future campus of USJ.
NAPE area of Macao but, with the expansion of new courses
and a significant increase of the student´s number, the                                       II. PSTN, VOIP AND IP PBXS
available facilities rapidly became a serious concern, affecting
the future growth and the overall environment quality. With                 PSTN stands for Public Switched Telephone Network and is
                                                                            also known, according to [4], as POTS (Plain Old Telephone
the aim to overcome this issue, a new campus is being
                                                                            System). Born in 1876 with Alexander Graham Bell, it is a
developed and is under construction at Green Island, Macao.
                                                                            circuit-switched network where telephones handsets are
This new campus has been designed to use all available new                  interconnected among themselves through single or multiple
technologies with the purpose of minimize the running costs                 hub exchanges (cross-connect switches). Including the mobile
as well as to maintain the future environment sustainability. It            system, it is the main telecommunication network worldwide
is expected that the new technologies used will take full                   with 5.4 billion (800 million for fixed line and 4.6 billion for
advantage of solar power to heat, cool and natural lighting as              mobile) subscribers.
well as rainwater collecting for re-use.                                        At first, all phones were connected among themselves in a
    Historically, some organizations and companies have used                meshed network but, with the growing of users, this layout
separate telephone and computer data communications                         became impractical. Henceforth, a new layout (hierarchical and
networks [2]. VoIP combines both networks to greatly reduce                 star topologies) was developed to allow this steady growing.
capital and operating costs. USJ wants to adopt this approach               Once this tie was established, the human voice is converted
and, thus, the IP PBX technical development becomes the                     into analogical form and sent through copper twisted pair
ambition. It is believed that VoIP is the future of corporate               cables to the central office (CO), where it is converted to digital




                                                                        5                               http://sites.google.com/site/ijcsis/
                                                                                                        ISSN 1947-5500
                                                             (IJCSIS) International Journal of Computer Science and Information Security,
                                                                                                                       Vol. 9, No. 6, 2011
signals at the Digital Cross-Connect Switches [5, 6]. These             Once again, a key distinction between PSTN and Internet is
digital signals can take different paths depending on, whether it       the destiny identification. Within the PSTN, lines, rather than
is a local, national or international call. If it is a local call       devices, are identified, that is, your home phone number is the
connected to the same switch, the analogical signal is routed           phone line to your place (in fact, you can attach different
directly to the destination number. Otherwise, the very first           devices to this line such as faxes and answering machines). By
switch will convert the analogical signal to digital for routing        contrast, the Internet identifies each specific device by their IP
afterwards. Later, it is converted back to an analogical form to        address.
be sent to the destination call number.                                     IP PBX stands for Internet Protocol Private Branch
    VoIP stands for Voice over IP or Voice over Internet                eXchange and, basically, is an IT framework which uses the
Protocol. It is another way to make phone calls because, instead        common LAN (Local Area Network), Internet, PSTN and
of using the conventional PSTN network, it is done through the          VoIP providers for communication purposes (see Figure 1).
Internet infrastructure. VoIP started in 1995 with Vocaltec©
when it released its first Internet phone software. It used the
H.323 protocol and run on any PC with the usual microphones
and speakers. Vocaltec© enjoyed an initial success but, due to
the lack of broadband, it did not survive for long. On 2003, a
major step was given by Skype©. Unlike the previous Internet
experiences, Skype© used its own protocol. Since it held a good
voice quality, it became a major commercial reference for
VoIP. In the first half of 2010, according to Skype© Web site,
users made a total of 6.4 billion calls to landlines and mobile
phones.
    A VoIP call usually starts with a typical PC or with an
IP/analogue telephone as long as it is connected to an Analog
Telephone Adapter (ATA). The analog voice is converted to a
successive 0s and 1s and, afterwards, the digital signal is
broken into smaller chunks called packets to be sent to their
destination through the Internet [7, 8]. At the destination, the
process is reversed, the smaller packets are assembled back in a
properly order and converted to an analog signal that is                                  Figure 1. IP PBX and Internet integration.
understandable by the human ear.
    VoIP technology uses three protocols: SIP (Session                  From the view point of the end-user, there are no differences.
Initiation Protocol, an IP signaling procedure to establish,            For the organizations, the change relies on the system setup
modify and terminate VoIP calls), RTP (Real-time Transport              and phone calls since it is less costly. Basically, an IP PBX
Protocol, a standardized packet format for delivering                   consists of a server with special hardware for PSTN interfaces,
audio/video over the Internet) and RTCP to monitor                      SIP phones and VoIP Gateways. Indeed, SIP phone, VoIP
transmission statistics and Quality of Service (QoS) status.            phone and IP phone are different names for the same device
    The connection between VoIP and PSTN is done through a              [10]. It is this device that connects the user with the IP PBX.
communication protocol named ENUM that stands for                       All the calls received are, then, routed to their destination
telephone numbering mapping [9]. Basically, ENUM translates             automatically, according to a set of rules (the dial plan) that
telephone numbers into a format that can be used by the                 should be previously programmed.
Internet. Bear in mind that PSTN is a circuit switched network              An IP PBX server operates similar to a proxy server, that is,
that uses telephone numbers while the Internet structure is a           the SIP clients (soft phones or handset IP phones) register at it
packet switched network that uses Uniform Resource                      and when they wish to make a phone call, they request the IP
Identifiers (URIs) for addressing each device. This ENUM                PBX to establish the connection. Internally, the IP PBX has a
protocol enables circuit switched traffic to be carried on a            directory with all telephones/users with their corresponding SIP
packet switched network by matching a circuit address                   address. Therefore, it is possible to connect an internal call
(telephone number) to a network address (URI). Hence, ENUM              (located in the same LAN) or route an external call through
links both PSTN and Internet, providing a means for Internet            either a VoIP gateway or a VoIP service provider. The
connected phones receive/make calls to the PSTN network.                connections with the external parties are done through SIP
    Calls between subscribers of the same VoIP provider                 trunks and, optionally, via a VoIP Gateway, where the SIP
(VoipBuster© or Vonage©, for instance) are usually free. Calls          Trunk bonds the IP PBX to an ITSP (Internet Telephony
between subscribers of different VoIP providers should also             Service Provider). Peculiarly, the physically moving of a SIP
have no costs associated, as well. On the other hand, when the          phone does not affect its relationship to the IP PBX.
calls are originated from VoIP providers and are terminated at
the PSTN, the costs involved equals the interconnection costs
charged by the VoIP provider for the gateway usage to connect                      III.      SYSTEM SPECIFICATION AND DESIGN
the Internet network to the local PSTN. To minimize these               According to Figure 2, the IP PBX prototype will run
interconnection charges, as expected, VoIP subscribers use the          Asterisk© whose hardware equipments and software packages
Internet network to the nearest PSTN termination point.                 specifications are as follows: (A) Computer hardware (Intel



                                                                    6                                    http://sites.google.com/site/ijcsis/
                                                                                                         ISSN 1947-5500
                                                            (IJCSIS) International Journal of Computer Science and Information Security,
                                                                                                                      Vol. 9, No. 6, 2011
Pentium© Dual CPU, RAM of 3GB, NVIDIA© GeForce 9500
GT graphics card, Realtek© RTL8168C network card, 120GB
Maxtor© 6Y120PO ATA HD with a DVD drive); (B) Software
(AsteriskNOW© 1.5 based on Linux CentOS©, MySQL© and
Apache Web server); (C) Connections (local extensions for the
same IP PBX LAN, four remote extensions, links to Macao
PSTN through a VoIP gateway and, as expected, connection to
the Portugal PSTN network through four different VoIP
providers to minimize call costs).
    In a more in depth view, the three local extensions will
follow the next pattern: 1001 (ATA Linksys SPA3102), 1002              Figure 3. On the left, the GXP2000© is an IP handset suitable for both small
                                                                       and large business organizations and it can be connected directly to any LAN
(soft-phone CounterPath’s X-Lite©) and 1005 (Grandstream©              (it handles up to four simultaneous VoIP calls). It has a dual 10M/100Mbps
GXP2000 IP Phone) for Macao calls. Regarding external ones,            Ethernet port, an intuitive user interface, a large back-lit LCD display with
there will be one VoIP gateway to access the Macao PSTN                multiple languages support and privacy protection. On the right, the Siemens©
landline, four VoIP providers to minimize international calls          Gigaset C470 IP is a cordless IP phone which allows connecting to any IP
charges to Portuguese nomadic numbers, G9 Telecom© to                  PBX, via a LAN, as well as to a local PSTN.
receive and make phone calls from and to Portuguese nomadic
phone numbers, SMSDiscount© to connect to the Portuguese
PSTN land lines, VoIPCheap© to connect to the Macanese and
Hong Kong PSTN land lines and, at last, SmartVoIP© to
connect to the Portuguese mobile network (check figure
below)).




                                                                       Figure 4. On the left, the IP soft-phone CounterPath’s X-Lite© can be used to
                                                                       make and receive voice/video calls [11]. Its minimum specifications required
                                                                       to connect and operate with the IP PBX are an audio codec G.711, SIP
                                                                       protocol and ID/Voicemail caller. On the center, the ATA Linksys© SPA3102
                                                                       is an adapter with the ability to connect analog telephones and fax machines to
                                                                       the IP PBX through a computer data network. Curiously, it has also the ability
                                                                       to bond to any local PSTN. On the right, the ATA Linksys© PAP2T allows the
                                                                       link of one or two analog phones to the IP PBX in order to create one or two
                                                                       extra extensions.

                                                                                   IV.    SYSTEM SETUP AND IMPLEMENTATION
                                                                       To start, the Asterisk©NOW IP PBX prototype must be
                                                                       integrated both to the Macao CTM provider (with a broadband
Figure 2. The IP PBX layout (512Kbps is the minimal recommended        Internet connection) and USJ’s LAN. The ATA Linksys©
Internet bandwidth).                                                   SPA3102 will allow the link between the PSTN network and
                                                                       all USJ analog phones. Basically, it will work as a VoIP
Regarding the IP PBX clients specifications, there are different       Internet gateway as well as an extension from the IP PBX.
brands, models and types of IP phones available in the market          Hence, the bond between the IP PBX and the Macao residential
such as soft-phones, ATA (Analog Telephone Adapter) and                PSTN is established through this VoIP gateway. Secondly, the
handset IP phones. With the exception of the first option that         ATA Linksys© PAP2T connects two analog handsets but
can be downloaded free of charge, the remaining ones are not           working as two different extensions of the IP PBX. Third, the
available in Macao. For the purpose of testing, one                    IP PBX server and all its local extensions (ATA, handset IP
Grandstream© GXP2000 (standard desktop IP phone handset)               phone and soft-phone) will be integrated with the CTM Macao
and one ATA Linksys© SPA3102 (it works as an analog                    network, underpinned by an ADSL Modem and a broadband
handset extension and, as well, as a VoIP gateway to connect           router (see Figure 5). As expected, this router implements NAT
the IP PBX to the Macao PSTN) were purchased from an                   (Network Address Translation) and firewall functionality in
online USA site. The ATA Linksys© PAP2T and the Siemens©               order to protect the local USJ LAN from Internet intrusion. It
Gigaset C470 IP phone (used as remote extensions located in            also provides DHCP service to allocate private IP addresses to
Portugal) were purchased from an online Portuguese site (see           the local LAN equipment. Fourth, the IP PBX LAN router has
Figures 3 and 4) with the capability to work with more than one        to be configured to allow the UDP (User Datagram Protocol)
VoIP provider as long as it is well configured. Moreover, both         data packets to pass through it and to be forward to the right IP
are able to work as an extension of the IP PBX prototype.              address. With this purpose, several UDP ports were setup: 5004
                                                                       to 5037 (Real-time Transfer Protocol, RTP), 5039 to 5082
                                                                       (Session Initiation Protocol, SIP) and 10000 to 20000 (extra
                                                                       RTP ports). In a brief way, SIP ports are used for signaling the




                                                                   7                                    http://sites.google.com/site/ijcsis/
                                                                                                        ISSN 1947-5500
                                                                       (IJCSIS) International Journal of Computer Science and Information Security,
                                                                                                                                 Vol. 9, No. 6, 2011
connection between two IP phones (the telephone ring) and,                          Step                                   Action
when the called IP phone answers it, the RTP protocol start                          1        Splash screen and choose install.
being used since it is the main responsible to transport the                         2        Format hard drive.
                                                                                     3        Accept default disk partitioning.
audio packets [12].
                                                                                     4        Choose the time zone.
                                                                                     5        Input the “root” password.
                                                                                     6        When the installation finishes, remove the bootable CD and
                                                                                              reboot the PC.
                                                                                       7      Configure the firewall setting to “Enabled” and “Permissive”.
                                                                                       8      Choose a static IP instead of DHCP concerning the network
                                                                                              configuration.
                                                                                       9      The IP PBX configuration finishes by rebooting while the
                                                                                              login screen is splashed.

                                                                                 The trunk lines allow the IP PBX to connect with the external
                                                                                 parties, that is, it links the PSTN and VoIP providers. In this
                                                                                 case, the available trunk connects the Macao CTM PSTN
                                                                                 through a VoIP gateway (the ATA Linksys© SPA3102). It can
                                                                                 be used for both outgoing and incoming calls. Alternatively,
                                                                                 the VoIP trunks allow the system to call external parties (VoIP
             Figure 5. The overall IP PBX network diagram.
                                                                                 and PSTN telephone numbers in other countries) through local
                                                                                 VoIP providers using the following Internet infrastructure: (A)
Notice that the IP PBX server computer must be configured                        G9 Telecom© for income/outcome phone calls to Portugal
with a private static IP address to simplify the configuration of                nomadic numbers; (B) VoIPCheap© to make land line and
the local VoIP extensions and the forwarding port. This                          mobile phone calls to Hong Kong and Macau; (C)
configuration is accomplished at the router level and based on                   SMSDiscount© to make land line calls to Portugal; (D)
the Port Range Forwarding procedure as figure 6 shows.                           SmartVoIP© will be used for outgoing mobile calls to
                                                                                 Portugal. For instance, if someone is using one of the
                                                                                 extensions connected to the IP PBX and needs to phone a
                                                                                 CTM© number or a Macao mobile one, the call will be
                                                                                 established through the Macao PSTN trunk (or through the
                                                                                 VoIPCheap© trunk, if the PSTN trunk is already being used).
                                                                                 Nevertheless, if he/she wants to call a PSTN land line in
                                                                                 Portugal, this call will be established through the
 Figure 6. Example of a port range regarding forward router configuration.
                                                                                 SMSDiscount© trunk.
The Asterisk©NOW download (version 1.5) comes with an                                Regarding trunks decision, the IP PBX does it through the
integrated distribution that includes Linux distribution (a                      Outbound Routes which defines the sequence path regarding
stripped down version of CentOS©), MySQL© database,                              what to do when one external telephone call arrives into the IP
Apache© Web server, PHP© Web programming language,                               PBX or when someone dials an external phone number
Asterisk© IP PBX server and FreePBX Asterisk©                                    (strategy definition for which trunk should be used to establish
administration packages. The installation sequence is                            any particular connection).
summarized in Table 1. On the configuration side, this includes                      With the present prototype, there will be only two trunk
the creation of Asterisk©NOW Extensions, trunks,                                 lines for inbound routes to be configured: the Macao PSTN line
inbound/outbound routes, Follow me, Disa and conference                          and the Portugal nomadic number. The remaining trunks have
functions, Backup/Restore and DID (Direct Inward Dial).                          no associated telephone number and cannot receive telephone
    Extensions are used for internal calls that only involve the                 calls (only used for outbound calls). According to Table 2, all
IP PBX. Trunks are used for external calls that are routed                       the received phone calls from the Macao PSTN trunk line were
through VoIP gateways or VoIP providers [13]. It covers calls                    set to be forward to the extension 1005 (IP phone
from and to outside parties, that is, PSTN numbers, nomadic                      Grandstream© GXP2000 handset). Similarly, all incoming calls
numbers (VoIP numbers) or other IP PBX’s. Once again,                            from Portugal nomadic numbers were routed to the same
extensions are all those numbers regarding soft-phones, ATA’s                    extension.
or IP phones directly connected to the IP PBX (configured in
the IP PBX itself and in each IP handset). For instance, the                        TABLE II.         CREATION AND CONFIGURATION OF INBOUND ROUTES
                                                                                                                         ©
                                                                                                          WITHIN ASTERIX NOW.
common telephone number that people have in their office desk
usually holds a sub-number with three or four digits. It is this                  Step                                    Action
sub-number that allows he/she to call their office colleagues or                   1       Within the Inbound Routes menu, choose the Add Incoming Route
third PSTN parties as long as these extensions are defined in                              option.
the PBX office.                                                                    2       Add the description (for instance, the device model SPA3102) and
                                                                                           the CTM PSTN number.
                                                                                   3       Choose the destination (set destination menu) for incoming phone
  TABLE I.        INSTALLATION OF ASTERIX©NOW SOFTWARE PACKAGE.                            calls on this trunk route. Keep in mind that for this trial product, all




                                                                             8                                      http://sites.google.com/site/ijcsis/
                                                                                                                    ISSN 1947-5500
                                                                         (IJCSIS) International Journal of Computer Science and Information Security,
                                                                                                                                   Vol. 9, No. 6, 2011
         the incoming calls will be sent to the extension 1005 of the                 The conference function connects users from different
         Grandstream© IP phone GXP2000.                                           places to minimize phone calls cost [14]. There are two ways
  4      Click the Submit/Apply button.
                                                                                  to setup this procedure [15]: (A) The participants are informed
The defined dial plan for all outbound routes is summarized in                    in advance on the date/time of the conference and, previously,
Table 3. Note that Macao, Portugal and Hong Kong                                  the users should call the phone conference number; (B) The
international country codes are 853, 351 are 852, respectively.                   participants are informed in advance on the date/time of the
                                                                                  conference and, on a pre-defined date/time, the IP PBX
      TABLE III.      DIAL PLAN FOR THE IP PBX PROTOTYPE SYSTEM.                  administrator pulls them to the conference through the
           Trunk                Inbound route           Outbound route            FreePBX Flash panel (user’s phones will ring and, after they
  In case of Macao-PSTN        Destination =          008531.                     answer, the conference call is already setup).
  failure, VoIPCheap© will     Extension 1005         0085328XXXXXX                   The Direct Inward Dial (DID) feature redirect a PSTN
  be used                                             008536XXXXXXX
                                                      008538XXXXXXX               phone number with a single prefix while the last
                                                      00853999                    two/three/four digits varies. Thus, each block map number
                                                      00852XXXXXXXX               corresponds to a different extension. To implement DID, it
                                                                                  starts to request a special kind of trunk from the PSTN
  G9 Telecom©                                         003513XXXXXXXX
                                                                                  provider. For this particular line, as each call is started, the
  SMSDiscount©                                        003512XXXXXXXX              suffix digits are actually passed to the IP PBX so it can decide
                                                      003517XXXXXXXX              which route extension to call to. Usually, PSTN telephone
                                                      003518XXXXXXXX              numbers are obtained in a block of numbers, for instance, from
  SmartVoIP© (in case of                              003519XXXXXXXX
                                                                                  28831000 until 28831009 (a ten block numbers, in this case).
  failure, SMSDiscount©                                                           This block of numbers is, then, configured to match the spatial
  will be used)                                                                   extensions defined in the IP PBX. Typically, the first number
                                                                                  28831000 is a direct line for the receptionist while the
In line with Table 4, different dial plans with different VoIP
                                                                                  remaining ones are DIDs (28831001 signifies organization
provider trunks were configured to minimize call costs.
                                                                                  extension 1001, 28831002 means organization extension 1002
  TABLE IV.        RATES CHARGED BY THE MAIN VOIP PROVIDERS OF USJ.               and so on). Hence, any external user that wants to make a
                                                                                  direct phone call to extension 1005, just needs to dial the
   Destination           VoIP provider (charge rate in Euros/Minute)              telephone number 28831005.
                        G9        VoIPC         SMS          SmartVoIP©
                     Telecom©     heap©      Discount©                                Naturally, the soft-phones are configured directly in each
   Hong Kong           0.100      0.000         0.010          0.000              laptop while the handset IP phone although ATA have to be
   (Land Line)                                                                    configured in a different way, depending on the brand and
   Hong Kong           0.250        0.000        0.005           0.000            model. Extension soft-phones and other handsets need to be
    (Mobile)
   Macao (Land         0.250        0.020        0.030           0.030
                                                                                  register at the IP PBX with a different IP address (a password
      Line)                                                                       must be supplied as part of the registration process). If an
      Macao            0.250        0.030        0.030           0.030            extension is turned off or disconnected from the network, for
    (Mobile)                                                                      instance, the IP PBX will divert calls to the voicemail or
     Portugal          0.016        0.000        0.000           0.000
   (Land Line)
                                                                                  another pre-defined function. Extensions on the same LAN
     Portugal          0.106        0.100        0.065           0.060            can also be hard coded with its IP address from the IP PBX.
    (Mobile)                                                                      Yet, outside extensions are different, depending on whether
                                                                                  the IP PBX has a public IP address (or not). In this case study,
The Follow me function is applied whenever the user is not
                                                                                  the DNS (Domain Named Service) is used to obtain the
able to receive the phone call in his/her extension and he/she
                                                                                  required IP address. Even if five IP phones were installed
wants to forward that call to another extension or even to an
external phone number (both PSTN and VoIP number). In this                        (CounterPath's X-Lite©, Linksys ATA SPA3102©, Linksys©
particular case, the IP PBX was setup on for every weekday                        ATA PAP2T, Grandstream© GXP2000 and Siemens© Gigaset
(after working hours) and for those days where someone is out                     C470 IP), Table 5 only shows the main four steps of the first
of the office (it forwards all the calls from the present extension               appliance.
to the user´s mobile).
                                                                                      TABLE V.       CONFIGURATION OF IP COUNTERPATH’S X-LITE©.
    The Disa function is applied whenever he/she needs to do a
costly phone call (international one to New Zealand, for                              Step                               Action
instance). To avoid to be personally charged for this                                  1     Download the CounterPath’s X-Lite© software
                                                                                             (http://www.counterpath.com/x-lite.html).
circumstance, the user calls to a Disa active phone number of
                                                                                       2     On the SIP Accounting Settings menu, add a new account.
the office. If the request password is correct then the IP PBX                         3     Fill the following fields: Display name, Extension number,
will setup the line automatically, according to the cheapest                                 Password, Authorization user name and Domain (the IP
defined route (the dial plan of the outbound routes).                                        address of the IP PBX).
Unsurprisingly, the password traditional system is required for                        4     Configuration conclusion after the Ready message.
safety purposes.




                                                                              9                                 http://sites.google.com/site/ijcsis/
                                                                                                                ISSN 1947-5500
                                                                               (IJCSIS) International Journal of Computer Science and Information Security,
                                                                                                                                         Vol. 9, No. 6, 2011
                            V.      SYSTEM TESTING
True to form, the subsequent pace is to evaluate the IP PBX
prototype efficiency and reliability of the IP PBX, telephone
calls among extensions located in the same/different IP PBX
LAN, inbound/outbound connections (via SIP and PSTN
trunks), voicemail, Follow me, Disa and conferences
capabilities. As shown below, the tests of Asterisk©NOW
includes its start up procedure (see Table 6), IP PBX access
(see Table 7) and Secure Shell (see Table 8), client’s
registration (see Figure 7), PSTN and SIP trunks (see Table 9).

TABLE VI.       IP PBX SERVER START UP TEST RESULTS. IT WAS OBSERVED
          THAT THE SERVER REQUIRED 29 SECONDS TO SHUTDOWN.

   Date              Start Time     Duration (Seconds)      Start Up Errors?            Figure 7. Snapshot of the IP phones (IP PBX clients) extensions (1001,
 2010.10.17           16:13:00              83                     No                   1002, 1005, 1007, 1025, 1026 and 1029) registered after the start up
 2010.10.17           16:23:00              83                     No                   procedure. No abnormalities were found.
 2010.10.17           16:34:30              83                     No
 2010.10.17           16:44:00              83                     No                   Afterwards, the phone calls between extensions of the
 2010.10.17           16:55:00              83                     No
                                                                                        same/different IP PBX LAN followed. Figure 8 and 9 shows
 TABLE VII.          CHECK RESULTS OF THE IP PBX SERVER ACCESS THROUGH
                                                                                        the call status information between extension 1001-1002 and
                         THE FREEPBX GUI INTERFACE.                                     1029-1025 and, auspiciously, the quality voice was considered
                                                                                        excellent in both cases, according to the excellent, good,
        Date          Access Time        Login       Access to         Test             reasonable and bad scale. The same relationship was found for
                      (hh:mm:ss)       accepted?     IP PBX?          Result
   2010.10.17           16:15:00          Yes          Yes             Pass
                                                                                        other calls extensions such as 1002 to 1005 and 1005 to 1001.
   2010.10.17           16:25:00          Yes          Yes             Pass
   2010.10.17           16:36:30          Yes          Yes             Pass
   2010.10.17           16:47:00          Yes          Yes             Pass
   2010.10.17           16:58:00          Yes          Yes             Pass

 TABLE VIII.         TEST RESULTS OF IP PBX SERVER ACCESS VIA SSH CLIENT                Figure 8. Snapshot of the in boundary call (Macao-Macao) between 1001
                                  INTERFACE.                                            and 1002 extensions.

         Date           Access Time        Login         Access to     Test
                        (hh:mm:ss)       accepted?       IP PBX?      Result
    2010.10.17            16:18:00          Yes            Yes         Pass
    2010.10.17            16:28:00          Yes            Yes         Pass
    2010.10.17            16:39:30          Yes            Yes         Pass
    2010.10.17            16:50:00          Yes            Yes         Pass             Figure 9. Snapshot of the out boundary call (Macao-Maputo, Mozambique)
    2010.10.17            17:02:00          Yes            Yes         Pass             between 1029 and 1025 extensions.

 TABLE IX.           EVALUATION OF PSTN AND SIP TRUNKS. ONCE AGAIN, NO                  At last, Table 10 and 11 exhibit some trial results of the
                         IRREGULARITIES WERE FOUND.
                                                                                        inbound and outbound calls using PSTN and SIP trunks. As
Time                          Trunks Registered                                         well, the voice mail (from extension 1001 to 1002 and 1025),
           PSTN           G9       VoIP       SMS             Smart                     Follow me and Disa (see table 12) tests occurred with no major
                        Telecom   Cheap     Discount          VoIP                      problem.
18:56          Yes        Yes      Yes        Yes              Yes
19:05          Yes        Yes      Yes        Yes              Yes
19:13          Yes        Yes      Yes        Yes              Yes                        TABLE X.       APPRAISAL RESULTS OF THE OUTBOUND CONNECTION TO
19:20          Yes        Yes      Yes        Yes              Yes                                           MACAO USING PSTN TRUNKS.
19:28          Yes        Yes      Yes        Yes              Yes
                                                                                          Start    Duration        Origin           Destination        Voice
                                                                                          Time     (hh:mm)                                            Quality
                                                                                        (hh:mm)                PSTN     Place     Fix        Place
                                                                                                                                  Line
                                                                                                                                   or
                                                                                                                                 Mobile
                                                                                         20:08       02:59     1001    Macao     Mobile     Macao      Good

                                                                                         20:15       05:23     1005    Macao     Mobile     Macao      Good



                                                                                         TABLE XI.      ASSESSMENT RESULTS OF THE OUTBOUND CONNECTION TO
                                                                                                         OTHER COUNTRIES USING SIP TRUNKS.




                                                                                   10                                 http://sites.google.com/site/ijcsis/
                                                                                                                      ISSN 1947-5500
                                                                           (IJCSIS) International Journal of Computer Science and Information Security,
                                                                                                                                     Vol. 9, No. 6, 2011
   Start   Duration          Origin            Destination        Voice                                            REFERENCES
   Time                                                          Quality
                                                                                    [1]  USJ (University of Saint Joseph), “About University of Saint Joseph
                        PSTN      Place       Fix      Place                             History”,                            available                          at
                                              Line                                       http://www.usj.edu.mo/?content_left&col=1&id=15 [accessed Mar 09].
                                               or
                                                                                    [2] Boavida, F., Monteiro, E, “Computer Networks Engineering”, 4th
                                             Mobile
                                                                                         Edition, ISBN 972-722-203-x, FCA-Lidel, 2007, pp. 554.
   20:47      02:23     1001      Macao       Fix      Macao      Good
                                              Line                                  [3] Vestias, M., “CISCO Networking”, 4Th Edition, FCA-Lidel, ISBN 978-
                                                                                         972-722-506-4, 2010, 648 p.
   20:29      02:51     1002      Macao      Mobile    Macao      Good
                                                                                    [4] Hamdi, M., Verscheure, O., Hubaux, J.-P., Dalgic, I., Wang, P., “Voice
                                                                                         Service Interworking for PSTN and IP Networks”, Communications
                                                                                         Magazine, IEEE, Vol 37 (5), ISSN 0163-6804, 1999, pp. 104-111.
     TABLE XII.       EXAMINATION RESULTS OF THE DISA FUNCTION.                     [5] Obara, H., Yasushi, T., “An Efficient Contention Resolution Algorithm
      Access       Disa         Line given       Call        Test Result                 for Input Queuing ATM Cross-Connect Switches”, International Journal
                                                                                         of Digital & Analog Cabled Systems, Vol2 (4), pp. 1989, 261-267.
     (hh:mm)      Active       by IP PBX?      Success?
                    ?                                                               [6] Holtmanns, S., Horn, G., Moeller, W., “Identity Management in Mobile
      22:27        No             No              No            Pass                     Communication Systems”, in Selected Topics in Communication
      22:30        Yes            Yes                           Pass                     Networks and Distributed Systems, Sudip Misra & Isaac Woungang
                                                                                         (Eds), World Scientific Publishing, 2010, pp. 709-730.
The function conference test was done with the extensions                           [7] Gratz, J., “Voice Over Internet Protocol”, Science & Techology, 6 Minn.
1001 (Macao), 1025 and 1026 (Portugal) and 1029                                          J.L., 2004, 443 pp.
(Mozambique). According to Figure 10, the overall quality was                       [8] Prabhakaran, K., “Advanced Link State Protocol”, in Computer and
considered pretty good.                                                                  Network Technology: Proceedings of the International Conference on
                                                                                         ICCNT 2009, Zhou & Mahadevan (Eds), World Scientific Publishing,,
                                                                                         2009, pp. 89-92.
                                                                                    [9] Neustar, “What Is ENUM?”, Available at http://www.enum.org/
                                                                                         what.html [accessed Oct 10].
                                                                                    [10] Barbeau, M., Boone, P, Kranakis, E., “Wimax/802.16 Broadband
                                                                                         Wireless Netwworks”, in Selected Topics in Communication Networks
                                                                                         and Distributed Systems, Sudip Misra & Isaac Woungang (Eds), World
                                                                                         Scientific Publishing, 2010, pp. 79-111.
Figure 10. Snapshot of the conference calls among 1001, 1025, 1026 and
1029 extensions.                                                                    [11] Blueface, CounterPath X-Lite Softphone Specifications. [Online]
                                                                                         Available       at    http://www.blueface.ie/helpandadvice/specification/
                                                                                         xlite.aspx [accessed Sep 10].
                       VI.     FINAL THOUGHTS                                       [12] Sharma, S., “Hello Expired Time Based Greedy Routing Scheme for
                                                                                         Mobile Ad Hoc Networks”, in Computer and Network Technology:
For personal use, a well known VoIP application is Skype©.                               Proceedings of the International Conference on ICCNT 2009, Zhou &
This application allows audio and video communications at                                Mahadevan (Eds), World Scientific Publishing,, 2009, pp. 45-49.
very low costs (from Skype© to PSTN telephones) or even at                          [13] Smarter, “Linksys SPA3102 Voice Gateway with Router - VoIP
no cost at all (from Skype© to Skype©). Therefore, the incentive                         gateway”, available at http://www.smarter.com/bridges-routers/linksys-
of this project is to assemble, implement and configure a VoIP                           spa3102-voice-gateway-with-router-voip-gateway/pd--ch-2--pi-
                                                                                         770317.html [accessed Sep 10].
phone system for USJ needs based on Linux© and other FOSS
                                                                                    [14] Hallberg, B., “Networking”, 5th Edition, McGraw-Hill Professional
(Free Open Source Software) technologies [16]. According to                              Publishing, p. 415, 2009.
the previous results, it seems it is possible to setup an IP PBX                    [15] Asterisk, “Forum-AsteriskNOW Support”, [Online] Available at
for USJ, including IP phones.                                                            http://forums.digium.com/viewforum.php?f=14&sid=feeaa4f3fbe8e9fc1
    Regarding future work, the productive equipment depends,                             1706bb68efd5cf1&start=2200 [accessed Sep 10].
nowadays, by the end of the construction of the new campus.                         [16] Chava, K., How, J., “Integration of Open Source and Enterprise IP
Still, two technical lessons should be highlighted from the past:                        PBXs, Testbeds and Research Infrastructure for the Development of
(A) To have two Internet broadband lines, one for the IP PBX                             Networks and Communities” (3rd International Conference), ISBN 978-
                                                                                         1-4244-0739-2, 2007, pp. 1-6.
server and another for the remainder Internet data traffic
network; (B) To design carefully the SIP and RTP ports for                                                       AUTHORS PROFILE
both protocols work all together without any conflicts.                             António Cotão holds a Master degree in Information Technology from the
                                                                                         University of Saint Joseph, Macau, China, and, at the moment, he works
    Finally and based on the planning projections of all                                 in the logistics and financial department of a Portuguese pharmaceutical
staff/student number of USJ by 2012, it is recommended the                               (Hovione).
following hardware components: (A) IP PBX Server (2 Intel                           Richard Whitfield is a full professor at Saint Joseph University, Macau,
Xeon© processor 7500 series, 16GB RAM, Motherboard with                                  China, and, currently, he is one of the responsibles for the construction
standard graphics and dual 100/1000 Ethernet NIC cards,                                  of the new USJ campus. He holds a doctoral degree in Manufacturing
RAID 5 redundancy with 1TB for each HD, UPS and                                          from the University of Melbourne, Australia.
Digium/ATA’s gateway to connect the Macao PSTN. (B) For                             João Negreiros is an associate professor at University of Saint Joseph from
                                                                                         2011 and he holds a doctoral degree in Information Technology from the
the clients, Polycom©, Cisco© and Linksys© brands are highly                             New University of Lisbon, Portugal.
recommended appliances. (C) Asterisk©NOW as the main core
software.




                                                                               11                                    http://sites.google.com/site/ijcsis/
                                                                                                                     ISSN 1947-5500
                                                              (IJCSIS) International Journal of Computer Science and Information Security,
                                                                                                                  Vol. 9, No. 6, June 2011


    A Novel and Secure Data Sharing Model with Full
       Owner Control in the Cloud Environment
                                                 Mohamed Meky and Amjad Ali
                                                    Center of Security Studies
                                            University of Maryland University College
                                                     Adelphi, Maryland, USA
                                           mmeky@faculty.umuc.edu and aali@umuc.edu


Abstract— Cloud computing is a rapidly growing segment of the             to the public such as the Google App Engine [3] and Microsoft
IT industry that will bring new service opportunities with                Live Mesh [4]. Storage-as-a-Service, such as Amazon simple
significant cost reduction in IT capital expenditures and                 storage service [5], gives data owners a cost effective service to
operating costs, on-demand capacity, and pay-per-use pricing              store massive data and handles efficient routine data backup by
models for IT service providers. Among these services are                 utilizing the vast storage capacity offered by a cloud computing
Software-as-a-Service, Platform-as-a-Service, Infrastructure-as–          infrastructure. In addition, it gives customers the ability to
a-Service, Communication-as-a-Service, Monitoring-as-a-Service,           expand and reduce IT resources as needed. However, with the
and Storage-as-a-Service. Storage-as-a-Service provides data              development of cloud computing, deployment of IT systems
owners a cost effective service to store massive data and handles
                                                                          and data storage is shifted to off-premises third-party IT
efficient routine data backup by utilizing the vast storage
capacity offered by a cloud computing infrastructure. However,
                                                                          infrastructures, i.e., cloud computing platforms. Shifting data
shifting data storage to cloud computing infrastructure                   storage to cloud computing infrastructure introduces several
introduces several security threats to data as cloud providers may        security threats to data, as cloud providers may have complete
have complete control on the computing infrastructure that                control on the computing infrastructure that underpins the
underpins the services. These security threats include                    services. These security threats include unauthorized data
unauthorized data access, compromise data integrity and                   access, compromised data integrity and confidentiality, and less
confidentiality, and less direct control over data for data owner.        direct control over data for data owners. To overcome these
The current literatures propose several approaches for storing            threats, we present a secure and efficient model that allows the
and sharing data in the cloud environments. However, these                data owners to have full control to grant or deny data sharing in
approaches are either applicable to specific data formats or              the cloud environment. In addition, the proposed model ensures
encryption techniques. In this paper, unlike previous studies, we         data integrity and confidentiality, and prevents cloud providers
introduce a secure and efficient model that allows the data               from revealing data to unauthorized users. The proposed model
owners to have full control over data sharing in the cloud                can be used in several applications such as remote file storage,
environment. In addition, it prevents cloud providers from                data publication, on-demand data access, and online
revealing data to unauthorized users. The proposed model can be           educational programs. Each application can use its data format
used in different IT areas, with different data and encryption            and encryption technique to provide secure data sharing in the
techniques, to provide secure data sharing for fixed and mobile           cloud. In addition, the proposed model uses a low computing
computing devices.
                                                                          power (e.g. symmetric encryption) and a one- authentication
    Keywords- cloud computing; cloud storage; data sharing
                                                                          step to accept or deny a data access request. Therefore, it can
model; data access control; data owner full control, cloud storage        be used with low computing power devices such as mobile
as a service; data encryption                                             devices. The remainder of this paper is organized as follows. In
                                                                          section II, we survey and analyze the related work. Section III
                                                                          describes the details of our proposed model, followed by the
                       I.    INTRODUCTION                                 security analysis in section IV, and finally, section V concludes
    Cloud computing is a rapidly growing segment of the IT                the paper.
industry that will bring new service opportunities with
significant cost reduction and increased operating efficiency for                             II.   RELATED WORK
IT vendors. Cloud computing includes three major models:
Software-as-a-Service,         Platform-as-a-Service,        and              Deployment of storage as a cloud computing service,
Infrastructure-as-a-Service [1]. Additional models are evolving           where data storage is shifted to off-premises third-party
as the concept of cloud computing develops new services such              infrastructure, introduces special security threats. Therefore,
as Storage-as-a-Service, Communication-as-a-Service, and                  data owners have to establish the following special security
Monitoring-as-a-Service. An important characteristic of cloud             requirements to safeguard the data in the midst of un-trusted
computing is pay-per-use [2]. Customers pay for cloud services            cloud environments:
only when they use them. Several cloud services are available



                                                                     12                               http://sites.google.com/site/ijcsis/
                                                                                                      ISSN 1947-5500
                                                             (IJCSIS) International Journal of Computer Science and Information Security,
                                                                                                                 Vol. 9, No. 6, June 2011

A. Ensuring Date Integrity and Confidentiality                                             III.   THE P PROPOSED MODEL
    The cloud storage providers should not have the capability               In this section, we will explain our proposed access model
of compromising the integrity and confidentiality of the data            based on a scenario illustrated in Figure 1 and notations listed
stored in the cloud. Confidentiality means keeping users’ data           in Table 1. As shown in Figure 1, a data owner, who stores his
secret in the cloud systems while data integrity means                   encrypted data in the cloud, receives a data access request
preserving information integrity, i.e., no data loss or                  from a user. After successfully authenticating the user and
modification by unauthorized users [6].                                  checking the policies, relevant to the user, the data owner
                                                                         sends a control message to the user and a data access permit to
B. Controling Data Access and Sharing                                    the cloud storage provider. The data access permit has relevant
   The data owner should be the only authority that grants and           information that allows the cloud storage provider to apply
access to authorized users.                                              data owner’s policy and provides specific data to the user.
                                                                         Meanwhile the control message, sent by the data owner, will
                                                                         allow the user to decrypt and authenticate the data that will be
C. Authentication
                                                                         granted from the cloud storage provider. As shown in step 4 in
    The Authentication is used to verify the claimed identity of         Figure 1, the user compares the information received from the
the data owner, user, or other entity [7] such as cloud provider.        data owner with information received from the cloud provider.
    To meet these security requirements, data owners have to             If there is a match, the user ensures that the received
enforce authorization access policies that prevent revealing             information is valid and authentic.
data information to cloud service providers or unauthorized                 In the proposed model, a cloud storage provider has no
users. Previous studies proposed several approaches for storing          knowledge about the data encryption algorithm and decryption
and sharing data in the cloud environments. However, these               key. This way, data owners keep control over data integrity
approaches are either applicable to specific data formats or             and confidentiality in the cloud. Meanwhile, data owners
encryption techniques. For example, the model introduced in              control user policy access and reveal relevant information that
[8] applies the publisher policy model presented in [9] to secure        grants users access and protects data against any modification.
storage of Extensible Markup Language (XML) data in the
cloud by adding special secure co-process to the stored
machine, as part of the cloud infrastructure, to enable efficient
encryption to the stored XML documents.                Although
mechanism published in [8] may enforce owner’s policies on
XML documents, the cloud providers have access to plain
XML data. Reference [10] introduced a model for securing
data sharing on the cloud. In that model, data sharing is
achieved by re-encrypting the data to the authorized users by
the cloud provider. Although model illustrated in [10] can
enforce sharing policies, specified by data owners, and
preventing unauthorized access to data, the model’s idea works
only with one encryption technique (progress elliptic curve                    Figure 1. Secure Data Sharing Model with Full Control in the Cloud
encryption) and requires the cloud provider to re-encrypt the
encrypted data before forwarding it to authorized users.                                    TABLE I.        MODEL’S NOTATIONS
Reference [11] introduced a model to outsource very large                 Notation     Description                                     Comments
blocks of data by encrypting each block of data with a different
                                                                          O-ID         Data Owner ID
encryption key. However, the model published in [11] fails to
demonstrate how a user will ensure data confidentiality after             C-ID         Cloud storage provider ID
receiving data from the cloud. In addition, whenever a user's
                                                                          U-ID         User ID
access right is revoked, the data block group needs to be
fragmented and several data blocks need to be re-encrypted.               D-ID         Shared data ID
Our model is more secure and more efficient than the model
                                                                          SU           User secret anonymity                          Published by
presented in [11] and immune to eavesdropping attacks since,                                                                          data owner
in our model, a user is not allowed to communicate with the               SC           Cloud provider secret anonymity                Published by
cloud provider. In summary, our model gives the data owner                                                                            data owner
full control to grant or deny data sharing in the cloud using             du           Secret encryption key for exchanging           Published by
efficient and secure procedures. In addition, it prevents cloud                        messages between data owner and the user       data owner
providers from revealing data contents to unauthorized users.             dc           Secret encryption key for exchanging           Published by
                                                                                       messages between data owner and the            data owner
The proposed model can be used in several applications (e.g.                           cloud provider
remote file storage, data publication, online educational                 XOR          Logical exclusive or operation
programs), with different data and encryption techniques, to
provide secure data sharing for both fixed and mobile                     ks           A one-time session key to be used with         Generated by
                                                                                       XOR operation when transferring message        data owner
computing devices.                                                                     from the cloud provider to the user




                                                                    13                                    http://sites.google.com/site/ijcsis/
                                                                                                          ISSN 1947-5500
                                                                      (IJCSIS) International Journal of Computer Science and Information Security,
                                                                                                                          Vol. 9, No. 6, June 2011

h (.)        A one-way secure hash function such as                              encryption algorithm, EN, encryption key, kd, and data hash
             SHA-1                                                               value, h (data), that are relevant to the data (D-ID), a one-
||           A concatenation operator
                                                                                 time session key, ks, and optional field, OP. The optional
{.}k         Encryption operator using encryption key, k                         field, OP, could be used to extend the capability of the
                                                                                 proposed model. For example, the optional field could have
EN           Encryption algorithm used for encrypting      Chosen by the
             the shared data                               data owner
                                                                                 the time when the data should be accessed (e.g. for
                                                           based on data         downloading a test on an online educational program) or
                                                           type                  special access policy that could be related to Mandatory
ENC{dat      Encrypted data                                Sent by cloud         Access Control (MAC) or Role Based Access Controls
a}                                                         provider              (RBAC) [13]. After preparing the message, m2, the data owner
kd           Encryption key used for encrypting the        Chosen by the         sends the control message = {O-ID, {m2, h (m2 // SU)}du} to the
             shared data                                   data owner
h(data)      Hash value of the shared data                 Calculated at
                                                                                 user. Upon receiving the control message, {O-ID, {m2, h (m2 //
                                                           the data owner        SU)}du}, the user will authenticate and check the integrity of
                                                                                 the received message as follows:
For execution of this proposed model, the data owner first                       a) Decrypt the received message, using the symmetric secret
needs to complete the following tasks:
                                                                                    key, du, and obtain m2= {C-ID // D-ID // Nu // Nd // EN // kd
a) Issue two secret anonymities, SC and SU, for the cloud
                                                                                    // ks // h(data) // OP}, and h (m2 // SU)
   service provider and the user.
                                                                                 b) Compare the values of D-ID and Nu, obtained from m2, to
b) Issue two secret symmetric encryption keys, dc and du, for
                                                                                    those values sent in message m1. If there is a match, the
   the cloud service provider and the user.
                                                                                    user continues.
c) Use a secure channel, such as Diffie-Hellman key
                                                                                 c) Compute h (m2, SU) and check whether it equals the
   agreement [12], to exchange SC and dc with the cloud
                                                                                    received h (m2 // SU)). If there is a match, the user
   provider, and submit SU and du to the user
                                                                                    authenticates the data owner.
    In addition, we assume that the data owner encrypts the                      d) Keep C-ID, ks, and Nd for processing cloud provider
data with a suitable encryption algorithm, relevant to the data                     message, m4, in step 5.
type, and submitted the encrypted data to the cloud service                        3.    Data Owner Sends a Data Access Permit to the Cloud
provider though a secure channel. The proposed model has the                             Provider
following five steps:
                                                                                     In addition to sending the control message to the user, the
     1. A user Resquest Data Access from the Data Owner                          data owner prepares a message m3 = {D-ID // U-ID // Nu // Nd
    A user who would like to access data, defined by D-ID,                       // ks // OP} and sends a permit data access message = {O-ID,
generates a nonce, Nu, and prepares a message m1= {U-ID //                       {m3 // h (m3 // SC)dc}} to the cloud provider
D-ID // Nu} to be sent to the data owner. The user then sends a
request data access message = {U-ID, {m1 // h (m1 // SU)}du} to                         4.Cloude Provider Sends the Encrypted Data to the
the data owner.                                                                           User
        2.Data Owner Authenticates and Sends Control                                 Upon receiving the grant data access message, {O-ID, {m3
          Message to the User                                                    // h (m3 // SC)}dc}, the cloud provider executes the following
                                                                                 steps:
    Upon receiving the data access request from the user, the
data owner executes the following steps:                                         a) Decrypt the received message, using the symmetric secret
a) Decrypt the received message, using the symmetric secret                         key, dc, (that is relevant to O-ID) and obtain m3 = {D-ID,
   key, du, (that is relevant to U-ID) and obtain m1 = (U-ID,                       U-ID // Nu // Nd // ks // OP}, and h (m3 // SC).
   D-ID // Nu), and h (m1 // SU).                                                b) Verify the format of D-ID from the decrypted message m3.
b) Verify the format of U-ID, D-ID from the decrypted message                       If there is no match, the cloud provider terminates the
   m1. If there is no match, the data owner terminates the                          connection. Otherwise, the cloud provider continues.
   connection. Otherwise, the data owner continues.                              c) Compute h (m3 // SC)) and checks whether it equals the
c) Compute h (m1 // SU) and check whether it equals the                             received h (m3 // SC)). If there is a match, the cloud
   received h (m1 // SU)). If there is a match, the data owner                      provider ensures the authenticity of the data owner.
   determines the authenticity of the user.                                      d) Extract ks from m3 and prepare a message m4 = {D-ID, U-
                                                                                    ID // Nu // Nd // OP // ENC {data}} XOR ks.
    After authenticating the user, the data owner generates a                    e) Send a message = {C-ID, m4 // h (m4 // ks)} to the user
nonce, Nd, a one-time session key, ks, and prepares two special                     defined by U-ID, obtained from message m3, as shown in
messages m2, and m3 to be sent to the user and the cloud                            Figure 1.
provider respectively. The message, m2= {C-ID // D-ID // Nu //
Nd // EN // kd // h (data) // ks // OP}, contains the following
parameters: cloud provider identification, C-ID, shared data
identification, D-ID, message nonce, Nu and Nd, the



                                                                            14                              http://sites.google.com/site/ijcsis/
                                                                                                            ISSN 1947-5500
                                                             (IJCSIS) International Journal of Computer Science and Information Security,
                                                                                                                 Vol. 9, No. 6, June 2011

       5.User Verifies the Received Data from the Cloud                  information disclosure during sharing, and other security
         Provider                                                        attacks.
    Upon receiving a message {C-ID, m4 // h (m4 // ks)} from                   1) Unauthorized data access attack
the cloud provider, the user retrieves the one session key, ks,
received from the data owner in m2, and executes the                         Since data owners keep the encryption information (key
following steps:                                                         and algorithm) and check the identity of users, unauthorized
                                                                         data access is not possible in our model. In general,
a) Compute m4 XOR ks and obtain m4 = {D-ID, U-ID // Nu //                unauthorized data access attacks occur by one of the following
   Nd // OP // ENC {data}}.                                              methods:
b) Compute h (m4 // ks) and compare it with the received h (m4              1.    The attacker acquires data from the cloud storage
   // ks). If there is a match, the user continues.                      provider. In our model, the user doesn’t initiate any messages
c) Compare the values of C-ID, D-ID, Nu ,and Nd, received                with the cloud provider to gain data access. Even if the cloud
   from cloud provider, to those values obtained from message            provider sends data to an unauthorized user, the user can’t
   m2, received from the data owner. If there is a match, the            decrypt the received message since the encryption information
   user authenticates the received message.                              (key and algorithm) is not known to unauthorized users and to
d) Encode the received encrypted data, ENC {data}, with the              the cloud providers. Therefore, it is not possible for
   encoding key, kd, received from the data owner in m2.                 unauthorized users to know the encryption information
e) Compute h (data) and compare it with h (data) obtained                without the help of the data owner.
   from the data owner in message m2. If there is a match, the              2.    The attacker acquires data access from the data
   user ensures the integrity and confidentiality of the received        owner. To get data access permission from the data owner, the
   data.                                                                 attacker must have the knowledge of user anonymity, US, and
                                                                         the encryption key, du. It is not possible for the attacker to
 IV.        SECURITY ANALYSIS OF THE PROPOSED MODEL                      guess both parameters and access the data.
   This section illustrates how the proposed model achieves
                                                                              2) Information disclosure during sharing attack
security requirements for storing data in cloud environments
and how it offers enhanced resiliency to security threats.                   Since data is always in its encrypted form, there is no way
                                                                         data can be decrypted before it is delivered to authorized
A. Security Requirement Achieved                                         users. This ensures that the entire sharing process will not
                                                                         disclose information to cloud providers and unauthorized
    1) Ensuring data integrity and confidentiality                       users. To acquire data during sharing, an attacker must have
                                                                         the decryption key and algorithm. Since this information is
   In the proposed model, since the data is stored in encrypted          kept with the data owner, cloud storage providers and
form on the cloud and the data owner keeps the encryption key            unauthorized users cannot decrypt the data.
and algorithm information, the cloud storage provider does not
have the capability of compromising the integrity and                         3) Data owner/user’s identify guessing attack
confidentiality of the data stored in the cloud infrastructure.              As shown in Figure 1 and Figure 2, the user/data owner
    2) Controlling data access and sharing                               appends a secret user’s anonymity to the exchanged message
   In the proposed model, since the data owner is the only               (m1/m2) before computing its hash code, and then encrypts the
authority that authenticates the user and issues the data                exchanged message by the secret symmetric key, du. Both
encryption information (algorithm and key) to authorized                 secrets (SU, and du) are known only to the data owner and the
users, cloud providers cannot grant data access to                       authorized user. At the receiving side, the data owner/user
unauthorized users.                                                      decrypts the message and appends the same secret anonymity,
                                                                         SU, to the message before calculating its hash code to check
    3) Authentication                                                    the message’s authenticity. Since the hash code provides
    Authentication is the act of establishing or confirming              authentication and the encryption provides confidentiality to
claims made by or about the subject are true and authentic               the exchanged message between data owner and user, the
[14]. In the proposed model, authentication is achieved by               adversary can’t guess the user’s anonymity from the
using a hash code that contains a secret anonymity SU or SC              exchanged messages and therefore can’t imitate user identity
and encrypt by a secret encryption key (du or dc) as shown in            to create a new data access request. Similarly, the adversary
Figure 1. For example, the data owner appends a secret user’s            cannot imitate a data owner and send fake data access to a
anonymity, SU, to the exchanged message, m2, before                      user.
computing its hash code, h (m2 // SU). The data owner then
encrypts the exchanged message, {m2 // h (m2 // SU)} by the
secret symmetric key (du) and sends it to the user.
B. Resilience Against Security Threats
   This subsection shows how the proposed model is resilient
to security threats such as unauthorized data access attack,




                                                                    15                              http://sites.google.com/site/ijcsis/
                                                                                                    ISSN 1947-5500
                                                                         (IJCSIS) International Journal of Computer Science and Information Security,
                                                                                                                             Vol. 9, No. 6, June 2011

                                                                                    algorithm, and data encryption key) from m2 since he or she
                                                                                    cannot decrypt m2 without knowing secrets SU, and du. In
                                                                                    addition, the adversary will not be able to decrypt m4, received
                                                                                    from the cloud service provider, since he or she cannot reveal
                                                                                    the one time encryption key, ks, issued by data owner in
                                                                                    message, m2.

                                                                                                              V.     CONCLUSION
   Figure 2. Securing transmission between the data owner and the user                  This paper has introduced a secure and efficient model that
    4) Cloud provider’s identity guessing attack                                    offers the data owner full control to grant or deny data sharing
                                                                                    in the cloud environment. In addition, it prevents cloud
   As shown in Figure 1 and Figure 3, the data owner uses a                         providers from reveling data to unauthorized users. The
cloud provider’s anonymity, SC, and encryption key, dc, to                          proposed model can be used in several applications such as
provide authentication, by hash code, and confidentiality, by                       remote file storage, data publication, on-demand music access,
encryption, when sending messages to the cloud provider.                            and online educational programs. Each application can use its
Therefore, the adversary cannot guess the cloud’s anonymity                         own data format and encryption technique to provide secure
from the exchanged messages. Similarly, the adversary cannot                        data sharing in the cloud. In addition, since the proposed model
imitate a data owner and sends fake data access permit                              uses low computing power (e.g. symmetric encryption) and a
messages, m3, to the cloud provider.                                                one- authentication step to accept or deny a data access, it can
                                                                                    be used with mobile or fixed devices. Security analysis has
                                                                                    demonstrated that the proposed model meets cloud security
                                                                                    requirements and is resilient to several security threats.

                                                                                                                   REFERENCES
                                                                                    [1]    T. Sridhar, “Cloud computing – a primer, Part 1: models and
                                                                                           technologies,” The Internet Protocol Journal, vol. 12 (3), pp. 2–19,
                                                                                           September 2009.
                                                                                    [2]    J. W. Rittinghouse and J. F. Ransome, “Cloud computing:
Figure 3. Securing transmission between the data owner and the cloud service               implementation, management, and security,” CRC Press. Boca Raton,
                                  provider                                                 2010
                                                                                    [3]    Google Inc., “Google app engine,” 2011, retrieved in March 2011 from
     5) Impersonation attack                                                               http://appengine.google.com
    An impersonation attack involves an adversary who                               [4]    Microsoft Inc., “Microsoft live mesh,” 2011, retrieved in March 2011
attempts to impersonate a data owner, a user, or a cloud                                   from http://www.mesh.com
provider.                                                                           [5]    Amazon Inc., “Simple storage service,” 2011, retrieved in March 2011
                                                                                           from http://aws.amazon.com/s3
     a) An adversary can’t imitate a data owner to grant a                          [6]    M. Zhou, R. Zhang, W. Xie, W. Qian, and A. Zhou, “Security and
        user data access without knowing user secrets (SU,                                 Privacy in Cloud Computing: A Survey,” Sixth international conference
        du), cloud provider secrets (SC, dc), and data                                     on semantics, knowledge and grids, pp.105-112, 2010.
                                                                                    [7]    C. Kaufman, R. Perlman, and M. Speciner, “Network security: private
        encryption information (encryption algorithm, data                                 communication in a public world,” Upper Saddle River, New Jersey:
        encryption key).                                                                   Prentice Hall Press, 2002
     b) Without knowing the secrets (SU, du), an adversary                          [8]    K. Hamlen, M. Kantarcioglu, L. Khan, and B. Thuraisingham, “Security
        cannot imitate a user to decrypt the message m2 and                                issues for cloud computing,” International Journal of Information
                                                                                           Security and Privacy , vol. 4 (2), pp. 39-51, 2010.
        then get data access
                                                                                    [9]    E. Bertino, B. Carminati, E. Ferrari, B. Thuraisingham, and A. Gupta,
     c) Since the cloud provider doesn’t know the data                                     “Selective and authentic third party distribution of XML documents,”
        encryption algorithm, EN, the data encryption key,                                 IEEE Transactions on Knowledge and Data Engineering , vol. 16 (10),
        kd, and the message encryption key, ks, (issued by the                             pp- 1263-1278, 2004.
        data owner to the authorized user), an adversary                            [10]   G. Zhao, C. Rong, J. Li, F. Zhang, and Y. Tang, “Trusted data sharing
                                                                                           over untrusted cloud storage providers,” 2nd IEEE international
        cannot imitate a cloud provider to provide users with                              conference on cloud computing technology and science, pp- 97-103,
        fake data.                                                                         2010
                                                                                    [11]   W. Wan and Z. Li, “Secure and efficient access to outsourced data,”
     6) Replay attack                                                                      16th ACM conference on computer and communication security, 2009.
    A replay attack is a method in which an adversary tries to                      [12]   W. Diffie and M. Hellman, “New directions in cryptography,” IEEE
replay messages obtained in previous communications. For                                   Transactions on Information Theory , vol. 22 (6), pp- 644-654, 1976
example, an adversary might replay the used message m1 to                           [13]   M. Ciampa, “Security+Guide to Network Security Fundamentals,”
the data owner requesting data access and then receive the                                 Boston, MA: Course Technology, Cengage Learning, 2009
message m2 from data owner. However, the adversary cannot                           [14]   R. Zhang and L. Liu, “Security models and requirements for healthcare
derive correct data information (data ID, data encryption                                  application clouds,” IEEE 3rd International Conference on Cloud
                                                                                           Computing, 2010




                                                                               16                                    http://sites.google.com/site/ijcsis/
                                                                                                                     ISSN 1947-5500
                                                                           (IJCSIS) International Journal of Computer Science and Information Security,
                                                                                                                               Vol. 9, No. 6, June 2011


                              AUTHORS PROFILE
Mohamed Meky is an IT professional who has a unique combination of
teaching, research, leadership, and industrial experiences. He published
several articles, developed many courses, and lead different industrial projects
in IT field. His current research interest is in security area.


Amjad Ali is the Director of the Center for Security Studies and a Professor of
Cybersecurity at University of Maryland University College. He played a
significant role in the design and launch of UMUC’s cybersecurity programs.
He teaches graduate level courses in the area of cybersecurity and technology
management. He has served as a panelist and a presenter in major conferences
and seminars on the topics of cybersecurity and innovation management. In
addition, he has published articles in the cybersecurity area.




                                                                                   17                           http://sites.google.com/site/ijcsis/
                                                                                                                ISSN 1947-5500
                                                            (IJCSIS) International Journal of Computer Science and Information Security,
                                                                                                                      Vol. 9, No. 6, 2011

        Performances Evaluation of Inter-System Handover
         between IEEE802.16e and IEEE802.11 Networks
                                Abderrezak Djemai1, Mourad Hadjila2, Mohammed Feham3
                                     STIC laboratory, University of Tlemcen, Algeria
                      1
                       djemai_tlm@yahoo.fr, 2mhadjila_2002@yahoo.fr, 3m_feham@mail.univ-tlemcen.dz


Abstract— This article presents the mechanisms to be
implemented for analyzing the performances of the inter-system               II.       WIFI AND WIMAX TECHNOLOGIES DESCRIPTION
handover between WiFi and WiMAX networks. The presence of
an entity of handover is significant so that the mobile terminal        A. WiFi Overview
supports both technologies enabling it to make heterogeneous                WiFi is a high rate wireless transmission used to connect
transfers. In this paper, we propose the development of a               laptops or any type of peripheral in a range of several tens of
software platform able to manage the interoperability between           meters in indoor applications to several hundreds of meters in
WiMAX and WiFi with uninterrupted communication.
                                                                        open space.
   Keywords- Networks, Wireless, WiFi, WiMAX, Handover,                    WiFi networks present a multitude of functionalities which
Packets.                                                                come from fixed and mobile communications world. These
                                                                        functionalities allow them to be more reliable, providing the
                      I.    INTRODUCTION                                several services to the users.
   The wireless data networks knew a true explosion since the              The principal functionalities of a WiFi network are:
end of the Nineties to make connection to Internet. Wireless
environment presents many differences with the world of the                        •    The fragmentation and the re-assembly which
wired networks in particular at the level of the low layers in                          allow avoiding the problem of transmission of
communications which are the physical and data links layers.                            important volumes of data thus decreasing the
                                                                                        error rate.
    The routing of the data towards and since wireless mobile
equipment is a crucial problem especially between two                              •    The mobility management.
different networks. Times of interruption of the                                   •    Variation of the transmission rate according to the
communications can make these last unusable or not easily                               radio environment.
understanding (i.e. such as for example in the case of a
videoconference). Thus, this operation consists in defining new                    •    The insurance of a good quality of service.
protocols and network mechanisms for a minimization or a
                                                                           Figure 1 illustrates the WiFi network topology.
suppression of times of interruption.
   The last decade was marked by the emergence of many
wireless technologies such as Bluetooth 802.15 or the WiFi
(Wireless Fidelity) 802.11.
    The most recent technology which makes today great
development in the field of the wireless transmission is
WiMAX (Worldwide Interoperability for Microwave Access)
[1]. Appeared in June 2001, WiMAX is now the network of
access to large band more requested thanks to its new
performances of the data rate and the range.                                                 Figure 1. WiFi network topology

    The remainder of this paper is organized as follows: section
I presents a brief description of WiFi and WiMAX                        B. WiMAX Overview
technologies. Section II is devoted to the concepts of handover             WiMAX (Worldwide Interoperability for Microwave
WiFi-WiMAX and handover WiMAX-WiFi. Section III is                      Access) is a hertzian solution for WMAN networks. It is based
reserved for the results of simulation and finally we conclude          on the standard IEEE 802.16, validated in 2001 by the
this paper.                                                             international agency of IEEE standardization.
                                                                            The initial version of the standard works in the band (10-
                                                                        66) GHz and requires a line of sight (LOS) between the
                                                                        transmitter and the receiver. However, the extension 802.16a,
                                                                        works in the band (2-11) GHz, adapted better to the




                                                                   18                                  http://sites.google.com/site/ijcsis/
                                                                                                       ISSN 1947-5500
                                                                                     (IJCSIS) International Journal of Computer Science and Information Security,
                                                                                                                                               Vol. 9, No. 6, 2011
regulations, and allows a transmission in no line of sight                                    Station) during its displacement of the coverage area of a Base
(NLOS) space.                                                                                 Station (BS) to another.
    WiMAX would be an alternative to wired broadband                                             The standard 802.16e supports three types of Handover
technologies. It would reinforce the connection in terms of                                   which are:
capacity, rate and coverage. Its transmission capacities are
theoretically of 70 Mbps for a range of 50 km. In practice, it                                  • The Hard Handover,
allows a transmission rate of 10 Mbps for a range of 20 Km.                                     • The MDHO (Macro Diversity Handover),
    Figure 2 shows the WiMAX network architecture.                                              • The FBSS (Fast Base Station Switching).

                                                                                                  The Hard handover is obligatory, as for the two others they
                                                                                              are optional.

                                                                                              A. Hard Handover
                                                                                                  During the Hard Handover, the MSS communicates with
                                                                                              several BS at the same time. The link with the old BS is
                                                                                              cancelled before the establishment of the new one. The
                                                                                              handover is carried out as from the moment that the signal of
                                                                                              the close cell is more important than that of the current BS.
                                                                                                 Figure 3 shows the Hard Handover execution.


                    Figure 2. WiFi network architecture


C. Comparaison between WiMAX and WiFi
   The table I recapitulates the difference between WiFi and
WiMAX technologies.

       TABLE I.         TECHNICAL SPECIFITIES OF WIFI AND WIMAX
                              TECHNOLOGIES
 Parameter      Wifi 802.11           WiMAX 802.16               Difference
                                                            The physical layer of
                                                               802.16 tolerates
                                                            timeouts (reflections)
                  About 300
                                       Up to 45 Km-              through the
                   meters                                                                                      Figure 3. Hard Handover Execution
   Range                               cells of 5 to 10    implementation of 256
                  maximum
                                             Km               FFT (Fast Fourier
                                                          Transform) as against 64
                                                                  for 802.11                  B. Macro Diversity Handover(MDHO)
                                                              802.16 has better                   While Macro Diversity Handover [3] is supported by the
                Optimized for            Long-range
  Coverage       short range            optimized for
                                                            penetration through               MSS and the BS, the whole of diversity is updated at the MSS
                                                             obstacles to longer
                   inside               outdoor use
                                                                   distances
                                                                                              and the BS. It should be noted that the whole of diversity is the
                Designed for                               802.11 MAC protocol                list of the base stations participating to the procedure of
                                         Designed to
                LANs, is for a
                                      support up to 100
                                                          uses a CSMA/CA while                Handover, whose field level is higher than a certain value.
                 dozen users,                               802.16 uses TDMA.
                                        users, sizes of
 Adaptability   band sizes of                              802.16 can use all the                 Moreover, this list is defined for each MSS associated with
                                        bands varying
                    fixed                                  available frequencies
                 frequencies
                                        from 15 to 20
                                                             whereas 802.11 is                the network. During Macro Diversity Handover, the MSS who
                                            MHz
                  (20 MHz)                                          limited                   takes part in the procedure of Handover communicates with all
                                                             Higher frequency                 the base stations belonging to the whole of diversity. During
                2.7 bps/Hz or         5 bps/Hz or up to
                                                             coupled with error
   Bit rate     up to 54 Mbps          100 Mbps in 20
                                                            correction providing              the procedure of MDHO, in the downlink direction, two base
                 in 20 MHz                  MHz
                                                           better use of spectrum             stations or more transmit data to the MSS so this creates
                                                          802.11 avoids collisions            diversity in reception. In the uplink direction, the transmissions
                  quality of                                   of messages via
  Quality of
                   service
                                       Integrated in
                                                                 CSMA/CA.                     from the MSS are received by several base stations.
   Service                            MAC at differents
                   support                                802.16: same frequency
    (QoS)                                  layers                                             The following figure illustrates the architecture of Macro
                  (802.11e)                                 but spread overtime
                                                                  (TDMA)                      Diversity Handover.

                               III.     HANDOVER
   The handover [2] is the mechanism which ensures the
continuity of the connection of one MSS (Mobile Subscriber



                                                                                         19                               http://sites.google.com/site/ijcsis/
                                                                                                                          ISSN 1947-5500
                                                               (IJCSIS) International Journal of Computer Science and Information Security,
                                                                                                                         Vol. 9, No. 6, 2011
                                                                           message contains a new prefix and informs the director of
                                                                           interface. A timer is associated with the prefix. When the prefix
                                                                           is expired, an opinion is sent to the director of interface. The
                                                                           implementation also supports RS (Router Solicitation) to make
                                                                           it possible a MN to discover a new BS after Handover.

                                                                           B. Media Independent Handover (MIH IEEE 802.21)
                                                                               The realization of handover between heterogeneous
                                                                           networks of access in a transparent way from the point of view
                                                                           of the mobile user (without interruption nor deterioration)
                                                                           requires the taking into account of certain concepts such as
                                                                           continuity of service, quality of service, the discovery and the
                                                                           selection of the network [4], [5].
                                                                               Thus the work group IEEE 802.21 created a basic
                                                                           architecture which defines a function MIHF “Media
                 Figure 4.   Macro Diversity Handover                      Independent Handover Function” which will help the mobile
                                                                           systems to carry out a handover without service interruption
C. Fast Base Station Switching (FBSS)                                      between heterogeneous networks such as IEEE 802.3 (wire
    The principle is more or less similar to that of the MDHO              LAN), IEEE 802.11x (wireless LAN), IEEE 802.16e (mobile
in the sense that there is always the overall concept of diversity.        WiMAX network), GPRS and UMTS (mobile network 3G).
With the difference here, that the mobile subscriber station                   The IEEE 802.21 standard [4] is the development of an
chooses a base station from the whole of diversity to become               architecture that enables service continuity in a transparent
its principal base station. The principal base station is the only         manner when the mobile terminal (MN) moves between two
base station with which the mobile subscriber station                      heterogeneous networks in data link level.
exchanges traffic at the same time in the uplink and downlink,
by including the messages of management. It is also with this                  A set of functions to optimize the handover is defined in the
BS that the MSS is recorded, synchronized or is made its                   protocol stack of mobility management MME (Mobility
control in the downlink. However, with each transmitted frame,             Management Entity) of network elements and there is a
the MSS can change the principal base station as shown on                  creation of a new entity called MIHF (Media Independent
figure 5.                                                                  Handover Function). It works on layer 3 and can communicate
                                                                           between local and remote interfaces which can be in contact via
                                                                           another MIHF.
                                                                              This is illustrated on the figure 6.




                 Figure 5. Fast Base Station Switching


           IV.    NECESSARY SIMULATION MODULES
   Neighbor Discovery (ND), the module Media Independent
Handover (MIH) and the mobility management module                                     Figure 6. Overall picture of design of MIH [6], [7]
(MIPv6) are the key elements used in the code of simulation.
                                                                           C. MobilityManagement Module (MIPv6)
A. Neighbor Discovery (ND)
                                                                               MIPv6 describes the mobility management of IPv6
    The module ND is used to provide the detection of
                                                                           terminals. This mobility allows that an IPv6 terminal is always
movement of layer 3. In the network, the BS sends periodically
                                                                           reachable whatever its localization in the Internet and its
messages RAs (Router Advertisement) to inform the Mobile
                                                                           connection remain active in spite of its displacement.
Nodes (MNs) about the prefix of network. The ND agent
located in MN receives these RAs and determines if the                        The figure 7 contains several actors:



                                                                      20                                  http://sites.google.com/site/ijcsis/
                                                                                                          ISSN 1947-5500
                                                             (IJCSIS) International Journal of Computer Science and Information Security,
                                                                                                                       Vol. 9, No. 6, 2011
  • The mobile node (MN): is the IPv6 terminal which can
    move.
  • The agent mother (Home Agent, HA): is equipment of
    network which manages mobility with the manner of a
    HLR in cellular networks.
  • Correspondent Terminal (Correspond Node, CN): is an
    IPv6 terminal with which the MN has or will have an
    active connection. 

   One distinguishes two types of networks on which MN can
                                                                                       Figure 7. Basic mechanism of IPv6 mobility
come to be connected:
  • Network mother is the network of MN origin, or it is                     Figure 8 shows Optimization of the routing between the
                                                                         correspondent and the mobile. If another correspondent CN
    addressable by its address mother (HA: Home Address).
                                                                         wants to communicate with MN, it sends its first packet to the
  • Visited network is the network where MN moves. At the                address mother of MN , where the HA plays its part of proxy
    time of its arrival in this type of network, MN recovers             and transfers the packet towards the MN . After the arrival of
    thanks to the self-configuration mechanism of IPv6 [8],              a transferred packet, this last can choose to announce to the
    [9] a topologically correct IPv6 address called temporary            correspondent his current location , thus allowing a direct
                                                                         communication between CN and MN .
    address (Care-of Address).

   The basic principle of IPv6 Mobile is that MN is always
addressable by its address mother, whether it is in its network
mother or in a visited network.
    If MN is in its network mother, the routing of the packets is
carried out in a standard way, while being based on the tables
of the routers. MN is neither more nor less than one “fixed”
IPv6 terminal.
     If MN carries out a movement to go on a visited network ,
this one recovers a temporary address on this network; i.e.
pertaining to the prefix used on this bond of the network. It
records its new position near the agent mother , thanks to a
message called Binding Update (BU) comprising at the same
time its address mother and its temporary address, and awaits a
confirmation of its share      in the form of a message called            Figure 8. Optimization of the routing between the Correspondent and the
                                                                                                          Mobile
Binding Acknowledgment (BA). The agent mother plays the
part of proxy and intercepts all the packets intended for the
address mother to direct them towards the new MN position –                               V.     SIMULATION AND RESULTS
i.e its temporary address “primary”.
                                                                             The results shown in this part are obtained by NS2
    MN announces its new position        to the correspondent            simulator. NS2 is a software tool for simulation of data-
with which it was in communication, always thanks to the BU              processing networks. It becomes today a standard of reference
and BA messages, in order to optimize the communications                 in this field. The software is runnable as well under Unix as
(the communications will not be sent any more to the address             under Windows. The Simulator is composed of an application
mother then directed by the agent mother towards the                     program interface in TCL and a core written in C++ in which
temporary address “primary”, but directly sent from the                  the majority of the protocols networks were implemented.
correspondent terminal to the mobile node).
                                                                         A. Scenario of Simulation
                                                                            In this part we consider a simple topology including a
                                                                         multi-interface node supporting two technologies WiFi and
                                                                         WiMAX. The mobile node (MN) establishes a connection with
                                                                         CN (Correspondent Node).
                                                                             Let us suppose that the MN employs at the beginning the
                                                                         WiMAX interface, one commutates the traffic with the WiFi
                                                                         interface when it becomes available.




                                                                    21                                  http://sites.google.com/site/ijcsis/
                                                                                                        ISSN 1947-5500
                                                                             (IJCSIS) International Journal of Computer Science and Information Security,
                                                                                                                                       Vol. 9, No. 6, 2011
   The figure 9 describes the four essential components of our                                    TABLE III.       WIMAX NETWORK PARAMETERS
scenario:
                                                                                              Parameters                              Signification
         •     Router 0 (CN)                                                             Channel/WirelessChannel                Channel type : Wireless
         •     Router1 (Gateway)                                                        Propagation/TwoRayGround          Radio propagation model : 802.16
                                                                                         Phy/WirelessPhy /OFDM              Network interface type : 802.16
         •     WiMAX Base station (BS 802.16)                                                  Mac/802_16                       MAC layer type 802.16
         •     Access point WiFi (AP 802.11)                                             Queue/DropTail/PriQueue
                                                                                                    LL
                                                                                                                                  Queue interface type
                                                                                                                                 Link layer type 802.16
         •     Mobile node (MN)                                                           Antenna/OmniAntenna                       Antenna model
                                                                                                    50                            Maximum queue size
                                                                                               adhocrouting             Used routing protocol. In this case DSDV



                                                                                         Table IV gives the WiMAX base station parameters.

                                                                                               TABLE IV.        PARAMETERS OF WIMAX BASE STATION

                                                                                              Parameters                           Signification
                                                                                          WiMAX cell coverage                          1000 m
                                                                                                 Pt                                   0.025w
                                                                                             RXThresh                                1.26562 x 10-13 w
                                                                                              CSThresh                       0.8 x [1.26562 x 10-13] w


                                                                                        3) WiFi Network parameters:

                                                                                         Table V describes the configuration of the WiFi network
             Figure 9. Topology of the scenario (2000m X 2000m)                       parameters.

B. Parameter Setting and Configuration of the Networks                                              TABLE V.        WIFI NETWORK PARAMETERS
    Before being able to use the simulator, the topology of the
                                                                                              Parameters                              Signification
network and the need for each node must be described in a
TCL file which will be then read by the simulator. The                                   Channel/WirelessChannel                Channel type : Wireless
parameters and the configurations defined in this file are the                          Propagation/TwoRayGround            Radio propagation model : 802.11
                                                                                             Phy/WirelessPhy                 Network interface type : 802.11
following:                                                                                     Mac/802_11                       MAC layer type 802.11

  1) Simulation parameters:
                                                                                        Table VI represents the configuration of the Access point
   Table II represents the configuration of the simulation                            WiFi.
parameters.
                                                                                              TABLE VI.         PARAMETERS OF THE ACCESS POINT WIFI
                  TABLE II.       SIMULATION PARAMETERS
                                                                                          Parameters                          Signification
    Parameters                             Signification                                 WiFi coverage                             20 m
                                                                                               Pt                                0.025 w
     Trafic_start                       = 05s : trafic start
                                                                                             freq                            2.412 GHz
     Trafic_stop                        = 70 s: trafic end
                                                                                          RXThresh                           6.12277 x 10-9 w
   Simulation_stop                      = 70s : simulation end                            CSThresh                        0.9 x [6.1227 x 10-9] w
                          RNG (Random Number Generator) fixed at 1 for all
        Seed
                                      simulated scenarios

  2) WiMAX Network parameters:                                                        C. Performance evaluation of the Handover
                                                                                          This part contains the results of the simulated scenarios and
   Table III describes the configuration of the WiMAX                                 the analysis of the influence of metric used in the execution of
network parameters.                                                                   vertical handover WiFi-WiMAX in two directions: WFWXHO
                                                                                      (handover WiFi towards WiMAX) and WXWFHO (handover
                                                                                      WiMAX towards WiFi).
                                                                                         This metric concern the lost packets rate and it is given by:

                                                                                                                    number of lost packets                         (1)
                                                                                      Loss Packets Rate =
                                                                                                             total number of generated packets




                                                                                 22                                  http://sites.google.com/site/ijcsis/
                                                                                                                     ISSN 1947-5500
                                                              (IJCSIS) International Journal of Computer Science and Information Security,
                                                                                                                        Vol. 9, No. 6, 2011
                                                                           does not receive any more of the packets of the old base
  1) Handover WiFi-WiMAX (WFWXHO): The simulated                           station.
scenario consists to transfer the traffic between the router 0
(CN) and the MN which linearly moves from WiFi network to                2) Handover WiMAX-WiFi (WXWFHO): In this part, we
WiMAX network.                                                         suppose that the mobile was initially connected to the
   Figure 10 shows the simulation model of WiFi-WiMAX                  WiMAX network, when it leaves the coverage area, it
Handover.                                                              commutates the traffic on the WiFi interface.

                                                                       Figure 12 illustrates the simulation model of WiMAX-WiFi
                                                                       Handover.




             Figure 10. Simulation model (WFWXHO)

   Figure 11 depicts the evolution of the lost packets rate
according to the simulation time.

                                                                                     Figure 12. Simulation model (WXWFHO)



                                                                          Figure 13 shows the evolution of the loss packets rate
                                                                       according to the simulation time.




         Figure 11. Evolution of lost packets rate (WFWXHO)

   According to this figure we deduce that:
  • For a weak time of simulation the number of the lost
    packets is very high.
  • When the time of simulation is increased the number of
    the lost packets falls; that means that when the MN                          Figure 13. Evolution of lost packets rate (WXWFHO)
    moves from a network mother (WiFi) towards a visited
    network (WiMAX), it communicates initially with its                   3) Comparative Study: In This part, we will make the
    agent mother (with messages BA and BU) to assign her               comparison between the two types of handover WFWXHO
    new position to him and thus to ensure the redirection of          and WXWFHO according to time of simulation and the
                                                                       transmission speed of MN.
    the packets to him.
                                                                           Figure 14 presents the evolution of the lost packets rate
  • If we examine the files traces generated, we find that the         according to simulation time.
    destruction of the packets is due to the time of
    establishment of a new localization where the mobile




                                                                  23                                http://sites.google.com/site/ijcsis/
                                                                                                    ISSN 1947-5500
                                                           (IJCSIS) International Journal of Computer Science and Information Security,
                                                                                                                     Vol. 9, No. 6, 2011
                                                                           The standard IEEE 802.21 is a standard under development
                                                                       to offer a handover and ensures interoperability between
                                                                       heterogeneous networks.
                                                                           Handover for Mobile IPv6 (MIPv6) was accepted as a more
                                                                       or less effective solution of handover for applications of the
                                                                       types UDP such as the voice to solve the problems of lost
                                                                       packets.
                                                                           Following the analysis carried out of these practical
                                                                       challenges, an architecture of development was proposed to be
                                                                       able to simulate a scenario supporting various types of
                                                                       applications between an access point WiFi and an access point
                                                                       802.16 e.
                                                                           As a perspective for this work, it would be interesting to
                                                                       consider other scenarios of simulation, which could illustrate
 Figure 14. Comparison between WFWXHO and WXWFHO according to          the effect of the load of the mobile nodes on the performances
                                time.                                  of the vertical handover between WiFi and WiMAX or other
                                                                       types of applications such as FTP, TELNET…
   For various simulation times, the number of the lost packets
during the execution of WFWX handover is more important
compared to that of WXWF Handover.                                                                     REFERENCES
                                                                       [1]   J. G. Andrews, A. Ghosh, R. Muhamed, “Fundamentals of WiMAX:
   The figure 15 shows evolution of the lost packets rate                    Understanding Broadband Wireless Networking, ” Prentice Hall PTR,
according to the speed of the mobile.                                        Feb, 2007.
                                                                       [2]   Sik Choi, Gyung-Ho Hwang, Taesoo Kwon, Ae-Ri Lim, and Dong-Ho
                                                                             Cho, “Fast Handover Scheme for Real Time Downlink Services in
                                                                             IEEE 802.16e BWA System’’, May.2005.
                                                                       [3]   Institute of Electrical and Electronics Engineers, IEEE standard for local
                                                                             and metropolitan area networks- Part 21: Media Independent Handover,
                                                                             in: IEEE std 802.21-2008, vol. 21, 2009 pp.C1-C301.
                                                                       [4]   C. Baudoin, R. Dhaou, F. Arnal, M. Salhani, A-L. Beylot, “Analyse
                                                                             d’applicabilité de standards de télécoms terrestres aux systèmes
                                                                             de télécommunications par satellite, scénario 4G”, Rapport de
                                                                             contrat, IRT-06-09-01, Institut National Polytechnique de Toulouse,
                                                                             Sep.2006.
                                                                       [5]   IEEE              802.21,            DCN               2105-0240-01-0000-
                                                                             Joint_Harmonized_MIH_Proposal_Draft_Text.doc, May, 2005.
                                                                       [6]   NIST, The Network Simulator NS-2 NIST add-on-IEEE802.21 model
                                                                             (based          on        IEEE          P802.21/D03.00)-Draft         1.0,
                                                                             http://w3.antd.nist.gov/seamlessandsecure/files/mobility/doc/MIH-
                                                                             module.pdf (January, 2007)
                                                                       [7]   Yoon Young An, Byung Ho Yae1, Kang Won Lee, You Ze Cho, and
                                                                             Woo Young Jung, ”Reduction of Handover Latency Using MIH
 Figure 15. Comparison between WFWXHO and WXWFHO according to
                                                                             Services in MIPv6”, IEEE Proceedings of the 20th International
                               speed.
                                                                             Conference on AINA’06, june.2006.
                                                                       [8]   G. Cizault, “IPv6 Théorie et pratique”, paris, O'reilly, 1998.
   When mobility is small (10-11m/s), the number of the lost
                                                                       [9]   R. Koodli(Ed.), “Fast Handover for Mobile IPv6,” IETF RFC 4068, Jul.
packets of scenario (WiFi-WiMAX) is lower than that of                       2005.
(WiMAX-WiFi), but this is opposite when speed is increased.



                      VI.   CONCLUSION
   In this paper, we considered the calculation of the lost
packets rate to evaluate the performances of the inter-system
handover between the two wireless networks WiFi and
WiMAX in the two directions WFWXHO and WXWFHO.
   The implementation of the modules such as MIH (module
developed by IEEE 802.21) and MIPv6 (module of the
management of mobility) is necessary to support the Vertical
Handover.




                                                                  24                                     http://sites.google.com/site/ijcsis/
                                                                                                         ISSN 1947-5500
                                                      (IJCSIS) International Journal of Computer Science and Information Security,
                                                      Vol. 9, No. 6, June 2011




Recognizing The Electronic Medical Record Data
 From Unstructured Medical Data Using Visual
            Text Mining Techniques
       Prof. Hussain Bushinak                         Dr. Sayed AbdelGaber                             Mr. Fahad Kamal AlSharif
         Faculty of Medicine                  Faculty of Computers and Information                    Collage of Computer Science
        Ain Shams University                            Helwan University                                  Modern Academy
             Cairo, Egypt                                  Cairo, Egypt                                       Cairo, Egypt

Abstract: Computer systems and communication technologies             2.   Help to derive data directly from the electronic record,
made a strong and influential presence in the different fields             making research-data collection by product of routine
of medicine. The cornerstone of a functional medical                       clinical record keeping.            .
information system is the Electronic Health Records (EHR)
management system. EHR implementation and adoption face               3.   Help to Move from paper-based health care system to
different barriers that slow down its deployment in different              secure electronic medical records which will save lives
organizations. This research focuses on resolving the most                 and reduce health care costs.
public barriers, which are data entry, unstructured clinical
data modifying the physician work flow. This research
                                                                      4.   Help in Early detection of infectious disease by
proposed a solution, which use Text mining and Natural                     advanced data collection, fusion and processing
language processing techniques.This solution tested and                    techniques which would be at the forefront in spotting
verified in four real-world clinical organizations. The                    the emergence of new diseases, and crucial to tracking
suggested solution proved correcteness and perciseness with                the spread of known diseases[2].
91.88%..
                                                                      II.ELECTRONIC HEALTH RECORD ,DEFINITION AND MODELS
Keywords:    Electronic Health   Reacord,  Textmining,                    EHR defined as longitudinal electronic record of
Unstructured Medical Data , medical Data entry, Health                patients' health information generated by one or more
Information Technology.                                               encounters in any care delivery setting. This information
                                                                      includes, but not limited to, patient demographics, progress
                      I.INTRODUCTION                                  notes, examinations details like symptoms and findings,
                                                                      medications, vital signs, past medical history,
    The paper-based medical record is woefully inadequate             immunizations, laboratory data, and radiology reports. The
for meeting the needs of modern medicine. It arose in the             EHR automates and streamlines the clinician's workflow.
19th century as a highly personalized "lab notebook" that             The EHR has the ability to generate a complete record of a
clinicians could use to record their observations and plans           clinical patient encounter as well as supporting other care
so that they could be reminded of pertinent details when              directly or indirectly related activities via interface
they next saw that same patient. There were no bureaucratic           including evidence-based decision support, quality
requirements, no assumptions that the record would be used            management, and outcomes reporting. The EHR means a
to support communication among varied providers of care,              repository of patient data in a digital form stored and
and remarkably few data or test results to fill up the                exchanged securely and accessible by multiple authorized
record’s pages. The record that met the needs of clinicians a         users. [2][3][4]
century ago has struggled mightily to adjust over the
decades so as to accommodate to new requirements as
health care and medicine have changed which leads to the                  There are many EHR architectural models that can be
existence of Health Information Technology (HIT) [1].                 used all over the world. The most two popular EHR models
                                                                      are:
   HIT allows comprehensive management of medical
knowledge and its secure exchange among health care
                                                                      1.   Central Repository Model
consumers and providers. Broad uses of HIT will:
                                                                          The center of EHR model will be the repository, which
1.   Help to eliminate the manual tasks of extracting data
                                                                      will be fed by the existing applications in different care
     from charts or filling out specialized datasheets.
                                                                      locations such as hospitals, clinics, and family physician
                                                                      practices. The feed from these applications will be
                                                                      messaging based on the pre-agreed standards. The
                                                                      messaging needs to be based well-defined standards, for




                                                                 25                               http://sites.google.com/site/ijcsis/
                                                                                                  ISSN 1947-5500
                                                      (IJCSIS) International Journal of Computer Science and Information Security,
                                                      Vol. 9, No. 6, June 2011




example the HL7. Reference Information Model (RIM) for                repository using a shared database or by providing a
which XML could be used as the recommended                            common user interface to all hosted applications and
Implementation Technology Specification (ITS). [5]                    extracting data from these systems using a portal whose
                                                                      authentication and authorization mechanism can also be
                                                                      controlled at the data center level as shown in figure 3. [5]




             Figure 1. EHR Central Repository Model

                                                                                       Figure 3. Shared Services Model

    The event-driven messages that need to be sent and
stored in the repository will essentially be event-based                  III.BARRIERS OF THE ELECTRONIC HEALTH RECORD
summaries as shown in figure (2). The event-based                                          IMPLEMENTATION
summaries stored in the repository can be queried and                     Implementation of EHR faces different barriers, but
retrieved by different clinicians who are treating the                these barriers vary from one environment to another.
patients in different scenarios and by different clinical             Hereafter, the main focus will be on the general barriers
settings. The retrieval and access of data from the                   that exist in most of EHR implementation attempts, these
repository is subject to establishing that the clinicians             barriers are:
legitimately access the data for treating only those patients
who are in their care. The retrieval is done through                  1.   Financial Barriers
messaging which can be done either through synchronous                     Financial barriers are divided into the following points:
or asynchronous messages depending on the urgency,
complexity, and importance of the data that is being                                High Costs: These costs are divided into two
retrieved. [5]                                                                      main parts, initial cost and ongoing cost. [6]
                                                                                    Under-developed business case: This barrier
                                                                                    raised because of the following: Uncertainty
                                                                                    of EHR returns on investment, Financial
                                                                                    benefits are only achieved on the long run and
                                                                                    The main objective and benefits of EHR is to
                                                                                    provide a high quality medical service for the
                                                                                    citizens. [6]


                                                                      2.   Technological Barriers
                                                                           Technological barriers are divided into four points: [7]
                                                                                    Inadequate technical support
                 Figure 2. EHR Message Events                                       Inadequate data exchange
                                                                                    Security and privacy
                                                                                    Lack of standards

2.   Managed Services Model                                           3.   Physicians Attitudinal and Behavioral Barriers in data
                                                                           entry:
   The managed services model is based on hosting
applications for different care providers and care settings in             Many health information system projects fail due to
a data center by a consortium, which may consist of group             attitudes, behaviors, barriers in data entry and lack of
of infrastructure providers, system integrators, and                  systematic consideration of human-centered computing
application providers. The hosted applications can be used            issues such as usability, workflow, organizational change,
to provide an effective EHR by building a common                      and process reengineering. There are two major factors that




                                                                 26                                 http://sites.google.com/site/ijcsis/
                                                                                                    ISSN 1947-5500
                                                      (IJCSIS) International Journal of Computer Science and Information Security,
                                                      Vol. 9, No. 6, June 2011




lead to sluggish performance of this EHR system, these                              Textual Objects: Based on a written or printed
factors are: complexity of the Graphical User Interface                             language, such as clinical reports, nursery
(GUI) and system response time. This forces clinician to                            notes and examination sheets. [11]
see fewer patients and have longer workdays, largely
because of the extra time needed to use the system. [8]
    In 2004,Lisa Pizziferri and others concluded that the                 Using unstructured data for storing clinical data has the
benefits of using EHR system can be achieved and accepted             following limitations:
by physicians if only the physicians do not need to sacrifice                       The data is not consumable from a semantic
their time with patients or other activities during clinic                          level without a compatible interface or
sessions. Physicians recognize the quality improvements                             application.
achieved by EHRs, but their time should be saved by
decreasing the time required for data entry in EHR systems.                         Any technology cannot be necessarily gained
[9]                                                                                 insight into the context of the information
                                                                                    unless it can actually be read.

4.   Organizational Change Barriers
                                                                      6.    Barriers of using unstructured data in Electronic Health
     This category contains many points, these points are:                  Record:
                                                                               Aggregation of information across all the records in
               Design of and alignment with workflow and                    a large repository could bring benefits for clinical
               office integration:                                          research. When physicians work with structured data,
               54.2 percent out of the 5000 respondents                     they could receive alerts of the drugs that have bad
               reported that they are worried about slower                  interaction together which enables them to enhance
               workflow and low productivity according to                   the treatment process and avoid the medication errors;
               the American Academy of Family Physicians                    but this cannot be done with unstructured data [12].
               survey results (American Academy of Family
               Physicians 2004). [10]                                      IV.SURVEYING THE SOLUTIONS OF EHR DATA ENTRY
                                                                                             BARRIERS:
               Migration from paper-based systems:
                                                                              In October 2010, Ergin Soysal, Ilyas Cicekli, and
               Staff training:                                              Nazife Baykal designed and developed an ontology
                                                                            based information extraction system for radiological
5.   The format of Clinical Data store in EHR systems                       reports. [15]

       Generally speaking, there are two main types of                        The main goal of this technique is to extract and
     data store shapes: structured data           and                       convert the available information in free text Turkish
     unstructured data.                                                     radiology reports into a structured information model
                                                                            using manually created extraction rules and domain
               Structured data: Structured data is a data that              ontology. This technique extracts data from the
               has a relational data model and enforce                      radiological reports, which is a free text written by
               composition to the atomic data types.                        physicians and insert it as a structured data into the
               Structured data is managed by technology that                EHR. [13]
               allows for querying and reporting against
               predetermined data types and understood
               relationships, like patient demographics,                      However, this technique has the following
               laboratory tests, etc. [11]                                  drawbacks:
               Unstructured data: Unstructured data consists                          It concentrates mainly on abdominal
               of any data stored in an unstructured format at                        radiology reports.
               an atomic level. That is, in the unstructured                          It does not use a huge and trusted medical
               content, there is no conceptual definition and                         expressions repository, which may reduce
               no data type definition - in textual documents,                        the quality of information extraction
               a word is simply a word. [11]                                          process. Consequently, wrong clinical
                                                                                      information will be recorded.

     Unstructured data consists of two basic categories:                      In September 2010, Adam Wright, Elizabeth S.
              Bitmap Objects: Inherently non-language                       Chen, and Francine L. Maloney developed a technique
              based, such as X-rays, radiology, video or                    for identifying associations between medications,
              audio files.                                                  laboratory results and problems. They developed a




                                                                 27                                http://sites.google.com/site/ijcsis/
                                                                                                   ISSN 1947-5500
                                                (IJCSIS) International Journal of Computer Science and Information Security,
                                                Vol. 9, No. 6, June 2011




knowledge base of medication and laboratory result                            It does not use spelling correction.
problems associations in an automated fashion. It was                         There is no clear structure data model to
based on two data mining techniques; frequent item                            store the extracted data from the clinical
set mining and association rule mining. This technique                        report.
was successfully able to identify a large number of                           It does not use a huge and trusted data
clinically accurate associations. A high proportion of                        source for medical expressions like Unified
high-scoring associations were adjudged clinically                            Medical Language Systems (UMLS).
accurate when evaluated against the gold standard
(89.2% for medications with the best-performing                      In July 2010, another technique for automatically
statistic, chi square, and 55.6% for laboratory results           extracting information needed from complex clinical
using interest) [14]. However, this technique has the             questions was developed by Yong-gang Cao, James J.
following drawbacks:                                              Cimino, John Ely and Hong Yu. They built a fully
                                                                  automated system Ask EHRMES Help clinicians
           The researchers assumed that patients’ data            extract and articulate multimedia information from
           was structured.                                        literature to answer their ad hoc clinical questions.
           Building the knowledge base concentrated               This system automatically retrieves, extracts, and
           only on patient’s problems, medications                integrates information from the literature and other
           and laboratory results, which mean the                 information resources and attempts to formulate this
           other data, such as the patient’s history,             information as answers in response to ad hoc medical
           diagnosis, and procedures are not in                   questions posted by clinicians, all of which can be
           account.                                               achieved within a time-frame that meets their demands
           Data entry is done through traditional GUI.            [17]. This technique succeeds in clinical question
           So, this solution did not enhance the                  answering and in identifying the category of the
           physician workflow.                                    question but in the EHR system adoption process
                                                                  faced the following limitations:
                                                                              This technique extracted the clinical
   In September 2010, a system for misspellings in                            information to identify the question
drug information system queries was developed by                              category but not to store this information in
Christian Senger, Jens Kaltschmidt, Simon P.W.                                the EHR repository.
Schmitt, Markus G. Pruszydlo and Walter E. Haefeli.                           It works only on question answering but
This system attempted to solve the problem of drug’s                          not in the data entry process.
data entry in Drug Information System (DIS). The                              It does not enhance the physician workflow
researchers evaluated correctly spelled and misspelled                        during the examination process.
drug names from all queries of the University Hospital               Although the previous techniques attempted to solve
of Heidelberg. The results identified that search                 the EHR data entry barrier but it has the following
engines of DIS should be equipped with error-tolerant             limitations:
search capabilities. Auto-completion lists might                              These techniques concentrate on specific
expedite searches but might fail regularly due to the                         parts of data, such as diseases and leaves.
high frequency of typographic errors already in initials.                     The used medical expression repository
It improved the DIS data entry by using spelling                              does not contain all the expressions or the
corrected tools to make the drug information                                  semantic relations between them.
understandable and available, but it concentrated only                        Some of these techniques store the EHR
on DIS without examination, history, and procedure                            data as free text (unstructured data form).
data [16].                                                                    The physician workflow has some
                                                                              modifications which, in turn, leads to more
  In august 2010, a technique was developed by                                physical and mental efforts and reduces the
Yong-gang Cao, James J. Cimino, John Ely and Hong                             physician’s productivity.
Yu. It was an automated identification of diseases and
diagnosis in clinical records. This technique presents
                                                             V.    BRIDGING THE UNSTRUCTURED DATA TO STRUCTURED
an approach for a prototyping of a diagnosis classifier
                                                                                       EHR
based on a popular computational linguistics platform
[18]. This technique has the following limitations:                  The suggested idea is to convert the unstructured
            It focuses only on the diseases key words             free text clinical data to structured EHR data without
            to be extracted and ignores other important           modifying the workflow of physicians or adding any
            parts    like    operations,    symptoms,             additional physical or mental effort to them. Figure (4)
            finding…etc.                                          shows the algorithm of the suggested technique.




                                                            28                             http://sites.google.com/site/ijcsis/
                                                                                           ISSN 1947-5500
                                                       (IJCSIS) International Journal of Computer Science and Information Security,
                                                       Vol. 9, No. 6, June 2011




                                                                                    Figure 6 Spell Check input and output



                                                                       Step 3: Text mining with Natural Language Processing
                                                                       Techniques
                                                                         In this step, the resulted data will be cleaned and
                                                                         partitioned into statements. to be classified and coded;
                                                                         Using text mining and NLP all medical data will be
                                                                         classified and coded in the form of multiple statements
                                                                         and remove the unwanted words. This step consists of:
                                                                         [19]
                                                                                      Text preprocessing,
             Figure 4 Objective Technique Steps                                       Part of speech tagging,
                                                                                      Statements Segmentation,
                                                                                      Noun phrase extraction.
Step1: Optical Character Recognition OCR                                 The declaration of each pervious component is
  The physician writes his/her diagnoses as regular on                   showing in the following.
  pen-pad, paper or tablet PC. If the clinical report wrote              1. Text preprocessing: Is called tokenization or text
  on paper, it will need to scan it. The clinical report                      normalization and it does include the following
  data will be stored as image of a free hand text which                      steps: [19]
  can be process. This free hand text image scans with                                Throw away unwanted stuff (e.g.,
  OCR tool to convert to machine encoded text. The                                    unwanted brackets and tags).
  Details of this step represented in figure (5).                                     Word boundaries: white space and
                                                                                      punctuations.
                                                                                      Stemming (Lemmatization): This is
                                                                                      optional. English words like ‘look’ can be
                                                                                      inflected with morphological suffixes to
                                                                                      produce ‘looks, looking, looked’. They
                                                                                      share the same stem ‘look’. Often (but not
                                                                                      always) it is beneficial to map all inflected
                                                                                      forms into the stem. This is a complex
                                                                                      process since there can be many
                                                                                      exceptional cases (e.g., department vs.
       Figure 5 OCR and Handwriting input and output                                  depart, be vs. were). The most commonly
                                                                                      used stemmer is the Porter Stemmer.
Step 2: Spelling Corrector                                                            However, there are many others.
  Machine encoded text may include spelling errors                                   Stop word removal: the most frequent
  which may yield wrong information during the                                       words often do not carry much
  extraction process. So, all the incorrect spelling words
                                                                                     meaning.
  will be correct to move to the next step. This step
  requires a medical dictionary that contains most of the                            Capitalization, case folding: often it is
  medical expressions in different forms such as verbs,                              convenient to lower case every
  adjectives, nouns… etc. Figure (6) represent the                                   character.
  details of this step.
                                                                         2. Part of speech tagging: A Part-Of-Speech Tagger
                                                                            (POS Tagger) is a piece of software that reads text
                                                                            in some language and assigns parts of speech to
                                                                            each word (and other token), such as nouns, verbs,
                                                                            adjectives, etc. [19]
                                                                         3. Statements segmentation: The output of this part
                                                                            divides the clinical text into several statements.
                                                                            [19]




                                                                29                                  http://sites.google.com/site/ijcsis/
                                                                                                    ISSN 1947-5500
                                                     (IJCSIS) International Journal of Computer Science and Information Security,
                                                     Vol. 9, No. 6, June 2011




    4.   Noun phrase extraction: In this part, all noun
         phrases are extracted and the complex noun
         phrase is decomposed into smaller noun phrases.




                                                                                    Figure 8 UMLS expressions coding

                                                                     The pseudo code of UMLS coding algorithm can be:
                                                                          For each Statement S in Statements //in physician
                                                                          sheet
                                                                          Begin
                                                                              For each noun-phrase N in S
                                                                              Begin
                                                                                     If N exists in UMLS then,
                                                                                        Extract N and C // where c is the
               Figure 7 Text mining and NLP tasks                                       UMLS code
                                                                                        Put N with C as pair <N, C>
Step 4: Unified Medical Language System (UMLS)                                       End if
Coding                                                                        End
    To identify the clinical information, there is a need for             End
    a huge repository for all clinical expressions to extract
    the matched clinical expressions. UMLS used to
    achieve this purpose. The UMLS is a compendium of                  Step 5: Classify EHR Components
    many controlled vocabularies in the biomedical                       The suggested technique applied on physician’s
    sciences and created in 1986. It provides a mapping                  examination sheet. The examination sheet contains the
    structure among these vocabularies and allows                        following classes:
    translating among the various terminology systems. It                         History
    may be viewed as a comprehensive thesaurus and                                Examination
    ontology of biomedical concepts. [20]                                         Diagnosis
                                                                                  Procedure
                                                                         Each part treated as a class and all coded clinical data
    UMLS consists of the following components: [20]                      that were produced from the previous steps classified
            Metathesaurus, the core database of the                      into one of the previous classes.
            UMLS, a collection of concepts and terms
            from the various controlled vocabularies                     The first step in the classification process is building a
            and their relationships.                                     collective set of features that is typically called a
            Semantic Network, a set of categories and                    dictionary. The UMLS clinical expressions in the
            relationships that are being used to classify                dictionary form represent the base to create a
            and relate the entries in the Metathesaurus.                 spreadsheet of numeric data corresponding to the
            Specialist Lexicon, a database of                            previous defined classes.
            lexicographic information to be used in
                natural language processing.
               A number of supporting software tools.
    Morphologically analyzed words are compared to the
    UMLS entries to find the best matched expression                                TABLE (1): CLASSES DICTIONARY
    according to its Morphological position. Each noun
    phrase which matches a clinical expression entry in
    the UMLS, put as a pair that contains the noun phrase
    with its UMLS’s clinical codes.




                                                                v



                                                                30                                http://sites.google.com/site/ijcsis/
                                                                                                  ISSN 1947-5500
                                                                (IJCSIS) International Journal of Computer Science and Information Security,
                                                                Vol. 9, No. 6, June 2011




                                                                                  cosine value is close to 1 this means that the clinical
                                                                                  phrase is more similar to the compared class.
      Each row defines a class and each column represents a
      UMLS code. The cell in the spreadsheet represents a                       Step 6: Storing data in EHR Repository
      measurement of the feature corresponding to the                             The classified clinical phrase stored in its class inside
      column and the class corresponding to the row. The                          the EHR database with its matched UMLS code. For
      dictionary of words covers all the possibilities and the                    example, a physician wrote the following:
      number corresponds to the columns. All cells values
      ranged between zero and one depending on whether                        There is enlarged prostate with tender base of the bladder.
      the words were encountered in the Class or not. The
      form of classes’ dictionary is shown in table (1).                          This statement contains two findings, and then this
                                                                                  statement compared with each class. The cosine vector
      The second step is measuring the similarity between                         scores for this statement against each defined class
      extracted expressions and the defined classes then                          according to the previous equations are calculated.
      classify each expression to the most similar class. The                     The winning class will be the high score one. The data
      Cosine algorithm selected to calculate the Similarity                       will store in the winning class with its UMLS codes as
      between the extracted clinical phrases and predefined                       pairs inside EHR repository:
      classes. Steps of Cosine Similarity algorithm are:                                            < enlarged prostate, Finding>
                  Compute the similarity of new clinical                                        < tender base of the bladder, Finding>
                  phrase to all Classes in Dictionary.                            The EHR put in a structured form for analysis and data
                  Select the Class that is most similar to the                    mining operation, or as a perfect resource for decision
                  new clinical phrase.                                            support system.
                  The class which occurs most frequently is
                  the similar one.
                                                                                         VI.     THE EXPERIMENTAL STUDY
                                                                                    The aim of the experiment is to prove the success of
                                                                                  the suggested technique in a real world cases. For any
                                                                                  experiment, there are some hypotheses; the hypotheses
                                                                                  of this experiment are:
                                                                                              Physician has little experience of computer
                                                                                              using.
                                                                                              Physician’s handwriting is readable.
                                                                                              The used medical abbreviations should be
                                                                                              standard.
                                                                                              The experiment applied during the
                                                                                              examination session.
Figure 9: Computing similarity scores for New Clinical Phrase
                                                                                    The required equipments to implement the
                                                                                  experiment are:
      For cosine similarity, only positive words shared by                                   An electronic pen pad.
      the compared phrases are considered. Frequency of                                      A Laptop or personal computer.
      word occurrence is also valued. The clinical phrase is                                 Windows vista or later
      compared with each class by the following equation:                                    SQL server 2008
      [21]                                                                                   Microsoft office 2007 or later (For
                                                                                             applying OCR in Pin pad)
       Norm (P) = W (j): is the weight of the word phrase in                                 .Net framework 4
      class                                                                                  UMLS database system
      Cosine (P1, P2) = wp1 (j) * wp2 (j))/ (Norm (P1) *                                     Medical dictionary (for spelling correction)
      Norm (P2))                                                                    The implementation of the experimental study is
        Wpi: is the weight of the word phrase in class i                          going through the following steps:

      The cosine similarity of two Classes will range from 0                           Step 1: At the nurse office the patient
      to 1. The angle between two term frequency vectors                               demographics data recorded using the following
      cannot be greater than 90°, consequently, when the                               screen.




                                                                         31                                http://sites.google.com/site/ijcsis/
                                                                                                           ISSN 1947-5500
                                                (IJCSIS) International Journal of Computer Science and Information Security,
                                                Vol. 9, No. 6, June 2011




                                                                             Figure 12: Applying OCR on the diagnosis sheet


                                                                       Step 4: After the OCR done, the system starts to
                                                                       checks and corrects the spelling errors of the
         Figure 10: EHR demographics form                              examination data according to the installed
                                                                       medical dictionary through an interaction session
                                                                       with the physician.
Step 2: The physician uses the pen pad to write
the diagnosis.




          Figure 11: Pen pad to Computer Form




The physician has the freedom to erase, add or
modify any partition of his/her diagnosis. This
step helps him/her to work as regular without any                          Figure 13: Applying spell check on the examination text
additional effort. The data is directly recorded on
the computer which will help the physician to
retrieve it easy with its form or as structured data.

Step 3: After the physician finished his/her hand
writing, he/she press OCR button to convert the
diagnosis from image form to machine coded text
as shown in the following figure:                                      Step 5: After the spelling correction done, the
                                                                       physician presses “insert into EHR” button to
                                                                       convert the diagnosis data from unstructured to
                                                                       the structured form. Conversion is done through
                                                                       the following steps:
                                                                               Text preprocessing: All brackets, unwanted
                                                                               stuff, and word boundaries are removed.




                                                         32                                  http://sites.google.com/site/ijcsis/
                                                                                             ISSN 1947-5500
                                                   (IJCSIS) International Journal of Computer Science and Information Security,
                                                   Vol. 9, No. 6, June 2011




        Parts of speech tagging: Assigning parts of                                   o    One tablet twice daily for three
        speech to each word.                                                               months
        Statements segmentation: Examination text                                     o    One tablet
        is split into multiple statements.                                            o    Twice daily
        Phrase tagging: Each phrase is tagged with                                    o    Three months
        the suitable code to identify all phrases                                     o    R3 Depavit B12 ampule
        contained in the diagnosis sheet.
The output of this step is the examination of                             Step 7: All noun phrases are coded with UMLS
words with their parts of speech; this output exists                      codes. The output of this step represented in table
in the following format:                                                  (2).
(TOP (S (NP (DT A) (ADJP (NP (CD 15) (NNS years)) (JJ                   TABLE (2): NOUN PHRASES WITH THEIR UMLS CODES.
old)) (JJ female) (NN patient)) (VP (VBZ complains) (PP (IN
from) (NP (JJ nocturnal) (NN enuresis))) (PP (IN since) (NP
(NN birth)))) (. . .)))
(TOP (S (NP (NP (JJ Plain) (NN X-ray)) (PP (IN of) (NP (DT
the) (NN abdomen)))) (VP (VBD was) (ADJP (JJ free))) (. .)))
  (TOP (S (NP (JJ Abdominal) (NN ultra) (NN sonography))
            (VP (VBD was) (ADJP (JJ free))) (. .)))
(TOP (S (NP (PRP he)) (VP (VBZ has) (NP (NP (NNP
Enuresis)) (SBAR (S (NP (DT The) (NN patient)) (VP (MD
should) (VP (VB receive))))) (: :) (NP (NP (NNP R1) (NNP
Uipam) (NN tablet)) (NP (NP (CD one) (NN tablet)) (NP (RB
twice) (RB daily)) (PP (IN for) (NP (CD three) (NNS
months))))))) (. .)))
(TOP (S (PP (IN R2) (NP (NNP Dipripam) (CD 20) (NN mg)
(NN capsule))) (NP (NP (CD one) (NN tablet)) (NP (RB
twice) (RB daily)) (PP (IN for) (NP (CD three) (NNS
months)))) (. .))) (TOP (S (NP (DT R3) (NNP Depavit) (NNP
B12) (NN ampule)) (. .)))



      Figure 14: Output of Text mining technique




                                                                          Each statement got score according to UMLS
       Noun Phrase Extraction:                                            codes and the class’s dictionary which declared in
       All noun phrases are extracted and                                 table (1). Table (3) shows the statements and their
       compounded. Noun phrases are divided                               scores.
       into a smaller noun phrases, such as the
       following:
            o A 15 years old female patient                                      TABLE (3): STATEMENTS’ SCORE.
            o 15 years
            o Nocturnal enuresis since birth
            o Birth
            o Plain X-ray of the abdomen
            o Plain X-ray
            o The abdomen
            o Abdominal ultra sonography
            o Enuresis
            o The patient
            o R1 Uipam tablet
            o One tablet twice daily for three
               months                                                     Step 8: According to the scores showed in table
            o One tablet                                                  (3), the statements classified into their classes.
            o Twice daily                                                 The predefined classes are:
            o Three months                                                       History
            o Dipripam 20 mg capsule                                             Examination
                                                                                 Diagnosis
                                                                                 Procedure




                                                               33                             http://sites.google.com/site/ijcsis/
                                                                                              ISSN 1947-5500
                                                   (IJCSIS) International Journal of Computer Science and Information Security,
                                                   Vol. 9, No. 6, June 2011




 The classifier uses the COS similarity algorithm                         Table (6) shows the overall precession
 to classify each statement according to the class                        percentage in each of tested department.
 dictionary. Table (4) shows the score of each
 statement relative to nearst class.
                                                                         TABLE (6): RESULTS OF THE EXPERIMENTAL STUDY.
TABLE (4): COS SIMILARITY SCORES FOR EACH CLASS.
                                                                              Department               Overall Precise
                                                                           Surgical Oncology                92.96%
                                                                            Surgery Urology                 91.55%
                                                                               Cardiology                   92.33 %
                                                                             General Surgery                88.61%
                                                                           Overall precession                91.36

                                                                       Some factors affect the results, such as quality of
                                                                     physician hand writing. The effect of this factor clears
                                                                     in the result of experiment four, since it is the lowest
                                                                     precision percentage (91.36 %). High precision OCR
                                                                     tool can minimize the effect of this factor; but it may
 Step 9: After determining the winning class for                     be expensive. The results indicated that the suggested
 each statement, each noun phrase with its UMLS                      technique success with high percentage in a real world
 code saved inside the EHR in the winning class as                   experiment, which means that this technique can be
 a paired tag. Table (5) shows this format.                          applied in the real live in future.
      TABLE (5): DATA THAT INSERTED INSIDE THE EHR
                                                                                 VIII.      CONCLUSION
                                                                        The suggested technique succeeded in working as a
                                                                     bridge between unstructured and structured medical
                                                                     data. The medical data stored inside the EHR system
                                                                     in its right position without any additional physical or
                                                                     mental effort by physician, which in turn satisfy the
                                                                     main objective of this research.


                                                                                         REFERENCES

                                                                 [1] Institute of Medicine. “Review of the Adoption and
                                                                     Implementation of Health IT Standards by the DHHS
                                                                     Office of the National Coordinator for Health
 Step 10: This extracted information compared                        Information
 with the physician manual results to identify the                   Technology”http://www.iom.edu/Activities/Workforc
 suggested technique precision.                                      e/HealthITStandards.aspx

     VII.     RESULTS DISCUSSION                                 [2] Richard Dick, Elaine B. Steen, and Don Detmer, “The
                                                                     Computer Based Patient Record: An Essential
 The experimental study conducted on four                            Technology for Health Care”, National Academy
 Medical departments. In each department 10                          Press, 1997.
 diagnosis sheets tested. The tested departments
 are:                                                            [3] See HIMSS web page for the consensus definition of
       Surgical Oncology                                             an electronic health record.
       Surgery Urology                                               http://www.himss.org/ASP/topics_ehr.asp.
       Cardiology
       General Surgery                                           [4] J.H. van Bemmel and M.A. Musen, “Handbook of
                                                                     Medical Informatics”, Springer, 1997.




                                                            34                                 http://sites.google.com/site/ijcsis/
                                                                                               ISSN 1947-5500
                                                     (IJCSIS) International Journal of Computer Science and Information Security,
                                                     Vol. 9, No. 6, June 2011




[5] K. Ananda Mohan,” National Electronic Health                   [18] Dina Demner-Fushman, James G. Mork, Sonya E.
    Record Models”, Tata Consultancy Services                           Shooshan, Alan R. Aronson ,“UMLS content views
    (TCS),2004.                                                         appropriate for NLP processing of the biomedical
                                                                        literature vs. clinical text”, Elsevierhealth, 2009.
[6] Miller, R. H. and Sim, Ida. “Physicians’ Use Of
    Electronic Medical Records: Barriers And Solutions”.
                                                                   [19] Malgorzata Marciniak,Agnieszka Mykowiecka,”
    Health Affairs, 2004.
                                                                        Aspects of Natural Language
                                                                        Processing”,Springer,2009.
[7] Waegemann, “EHR vs. CPR vs. EMR. Healthcare
    Informatics”, 2003.
                                                                   [20] Catherine R. Selden,Betsy L. Humphreys,” Unified
[8] Himali Saitwala, Xuan Fengb, Muhammad Walji,                        Medical Language System: Current Bibliographies in
    Vimla Patel, Jiajie Zhanga, ”Assessing performance of               Medicine”, National institute of health,1990.
    an Electronic Health Record (EHR) using Cognitive
                                                                   [21] Jiawei Han,Micheline Kamber,” Data mining:
    Task Analysis” , Elsevierhealth, 2010.
                                                                        concepts and techniques”,Diana Cerra,2006.
[9] Lisa Pizziferri, Anne F. Kittler, Lynn A. Volk, Melissa
    M. Honourb, Sameer Gupta, Samuel Wang, Tiffany
    Wang, Margaret Lippincott, Qi Li and David W.
    Bates,” Primary care physician time utilization before
    and after implementation of an electronic health
    record: A time-motion study”, Elsevierhealth,2004.
[10] American Academy of Family Physicians. “Family
     Practice Management Monitor”, AAFP pushes for
     affordable EMR system, 2004.
[11] Oleh Hrycko,” Electronic Discovery in Canada: Best
     Practices and Guidelines”,CCH,2007.
[12] Angus Roberts , Robert Gaizauskas, Mark Hepple,
     George Demetriou, Yikun Guo, Ian Roberts, Andrea
     Setzer,” Building a semantically annotated corpus of
     clinical texts”, Elsevierhealth,2009.
[13] Hanna M. Seidlingab, Marilyn D. Paternoac, Walter E.
     Haefelib, David W. Bates,” Coded entry versus free-
     text and alert overrides: What you get depends on how
     you ask”, Elsevierhealth,2010.
[14] Adam Wright, Elizabeth S. Chenc, d and Francine L.
     Maloney,” An automated technique for identifying
     associations between medications, Laboratory results
     and problems”, Elsevierhealth, 2010.

[15] Ergin Soysal, IlyasCicekli, NazifeBaykal,” An
     ontology based information extraction system for
     radiological reports”, Elsevierhealth, 2010.

[16] Christian Senger, Jens Kaltschmidt, Simon P.W.
     Schmitt,Markus G. Pruszydlo, Walter E.
     Haefeli ,“Misspellings in drug information system
     queries: Characteristics of drug name spelling errors
     and strategies for their prevention”, Elsevierhealth,
     2010.

[17] Yong-gang Cao, James J. Cimino, John Ely, Hong Yu,
     “Automatically extracting information needs from
     complex clinical questions”, Elsevierhealth, 2010.




                                                              35                                http://sites.google.com/site/ijcsis/
                                                                                                ISSN 1947-5500
                                                             (IJCSIS) International Journal of Computer Science and Information Security,
                                                                                                                       Vol. 9, No. 6, 2011

Creating an Appropriate Programming Language for
             Student Compiler Project
                                                         Elinda Kajo Mece
                                               Department of Informatics Engineering
                                                 Polytechnic University of Tirana
                                                         Tirana, Albania
                                                         ekajo@fti.edu.al


Abstract— Finding an appropriate and simple source language, to           Compiler frameworks are widely used as a simple tool for
be used in implementing student compiler project, is one of               implementing new languages based on existing ones. The
challenges, especially in cases when the students are not familiar        complexity begins to increase if the differences between the
with high level programming languages. This paper presents a              existing language and the new one become significant [4].
new programming language intended principally for beginners
and didactic purposes in the course of compiler design. SimJ, a
                                                                          That is why we used Java as a base language for SimJ. For this
reduced form of the Java programming language, is designed for            purpose we have chosen Polyglot [4,5] as a compiler
a simple and faster programming. More readable code, no                   framework for creating compiler for languages similar to Java.
complexity, and basic functionality are the primary goals of
SimJ. The language includes the most important functions and                    II.    THE POLYGLOT FRAMEWORK
data structures needed for creating simple programs found
generally in beginners programming text books. The Polyglot
compiler framework is used for the implementation of SimJ.                Polyglot is an extensible Java compiler toolkit designed for
Keywords- compiler design; new programming language; polyglot
                                                                          experimentation with new language extensions. The base
framework                                                                 polyglot compiler, jlc ("Java language compiler"), is a mostly-
                                                                          complete Java front end [1]; that is, it parses [1,2] and
       I.    INTRODUCTION                                                 performs semantic checking on Java source code. The
A compiler course takes a significant place in computer                   compiler outputs Java source code. Thus, the base compiler
science curricula. This course is always associated with an               implements the identity translation. Language extensions are
implementing project. Being a multidimensional course, it                 implemented on top of the base compiler by extending the
requires the students to be familiar with high level                      concrete and abstract syntax and the type system [4].
programming languages among the other things. The first                   After type checking the language extension, the abstract
impact with these high level languages is almost always                   syntax tree (AST) [1,14] is translated into a Java AST and the
considered confusing because of their complexity. This                    existing code is output into a Java source file which can then
becomes more obvious in object-oriented languages like Java               be compiled with javac.
[8]. Object-orientation [15] hinders to learn Java step-by-step           Polyglot supports the easy creation of compilers for languages
from basic principles, because right from the beginning the               similar to Java. The Polyglot framework is useful for domain-
learner has to define at least one public class with a method             specific languages, exploration of language design, and for
with signature public static void main(String[] args). So the             simplified versions of Java for pedagogical use. As mentioned
teacher has two choices here: trying to explain most of the               above, the last part is where we intend to focus on this paper.
concepts involved (classes, methods, types, arrays, etc.) or just         A Polyglot extension is a source-to-source compiler that
provide the surrounding program text and let the learner add              accepts a program written in a language extension and
code to the body of the method main.                                      translates it to Java source code [4,5]. It also may invoke a
SimJ is a simple, Java based programming language. It is                  Java compiler such as javac to convert its output to bytecode
conceived and designed to ease teaching of basic                          [13]. A SimJ oriented view of this process, including the
programming to beginners. We believe that they should learn               eventual compilation to Java bytecode, is shown in figure 1.
easily the basic concepts, before they are exposed to more
complex programming issues. It is much simpler for a new
programmer to write println ("Hello world) instead of writing
a confusing line like System.out.println ("Hello world"). This
simple but concise example shows the importance of the first
impact with programming languages. The role of SimJ is to                         Figure 1. The Polyglot Compiler Framework Architecture
make this impact less “painful”.




                                                                     36                                http://sites.google.com/site/ijcsis/
                                                                                                       ISSN 1947-5500
                                                           (IJCSIS) International Journal of Computer Science and Information Security,
                                                                                                                     Vol. 9, No. 6, 2011
                                                                         public class A {
The first step in compilation is parsing input source code to              public static void main(String[] args) {
produce an AST. Polyglot includes an extensible parser                               try {
generator, PPG [5], which allows the implementer to define                              BufferedReader reader = new BufferedReader(
the syntax of the language extension (SimJ in our case) as a set                   new InputStreamReader (System.in));
of changes to the base grammar for Java [7]. The extended                               System.out.print(“Your name:” );
AST may contain new kinds of nodes either to represent                                  String name = reader.readLine();
syntax added to the base language or to record new                                      System.out.print(“\nHello, ” + name + “!”);
information in the AST.                                                              }
The core of the compilation process is a series of compilation                catch (IOException ioexeption) {
passes applied to the abstract syntax tree. Both semantic                               System.out.println(ioexeption);
analysis and translation [1] to Java may comprise several such                       }
passes. The pass scheduler selects passes to run over the AST              }
of a single source file, in an order defined by the extension,           }
ensuring that dependencies between source files are not
violated. Each compilation pass, if successful, rewrites the             class A {
AST, producing a new AST that is the input to the next pass.                main() {
A language extension may modify the base language pass                         print(“Your name:”);
schedule by adding, replacing, reordering, or removing                         String name = readLine();
compiler passes. The rewriting process is entirely functional;                 print(“\nHello, ” + name + “!”);
compilation passes do not destructively modify the AST.                     }
Compilation passes do their work using objects that define               }
important characteristics of the source and target languages. A
type system object acts as a factory for objects representing
types and related constructs such as method signatures[4,5].                          Figure 2. Example code writen in Java and SimJ
The type system object also provides some type checking
functionality. A node factory [4] constructs AST nodes for its          The simplified versions of the printing methods are quite
extension. In extensions that rely on an intermediate language,         obvious, since they are almost always used in simple
multiple type systems and node factories may be used during             programs. It is also important to mention that, compared to
compilation. After all compilation passes complete, the usual           Java, the structure of the program is unchanged thus
result is a Java AST. A Java compiler such as javac is invoked          preserving its object-orientation character.
to compile the Java code to bytecode.                                   Another important goal of this language is to help teaching of
                                                                        compiler design [1].
          III.    SIMJ PROGRAMMING LANGUAGE                             SimJ language specification [3,10,11] shown in figure 3 is
SimJ (stands for Simple Java) is a simplified version of the            very simple, short, equipped with the fundamental and mostly
Java programming language conceived especially for                      used parts of a programming language at the beginning level
beginners. The language is very simple, easy to learn and is            [9,7]. Related work (i.e. MiniJava [1]) shows that simplicity is
very similar to Java. Previous work has been done in this field         the primary characteristic of these languages.
(i.e. the J0 programming language [5] but these languages are           As mentioned previously we think that similarities with Java
quite different compared to Java syntax [7]. We think that              are important but also they should not lose their identity. In
similarity with Java is very important in order to allow the            MiniJava for example the System.out. println(), that is the
programmer to switch to Java without any problems regarding             same as in Java, is defined to do the printing but the meaning
the syntax when he thinks is ready to explore the full potential        of System.out in this language cannot be found. With SimJ we
and the advanced features of it.                                        try to address these problems by creating a simple but well
Figure 2 shows an example of the same code written in Java              defined language that syntactically talking is not a reduced
and in SimJ. This example shows, as mentioned above, that               exact copy of the mother language but has its own identity.
the code in SimJ is clearly more readable than the one in Java.
Generally, programming courses and textbooks for beginners
include many programs that during their execution require or
the input of the user. In Java this part it’s definitely neither         Program ::= MainClass ( Class )*
                                                                         MainClass ::= "class" Identifier "{" "main" "(" ")" "{" Statement "}" "}"
simple nor easy to implement at the beginning level. We                  Class      ::= "class" Identifier "{" (Variable)* (Method)* "}"
address this problem by removing the complex part and                    Variable ::= Type Identifier ";"
leaving only the “understandable” one (i.e. readLine()).                 Method ::= Type Identifier "(" (Type Identifier ("," Type Identifier)*)?
                                                                         ")" "{" (Variable)* (Statement)* "return" Expression ";" "}"
                                                                         Type ::= "boolean"
                                                                         | "int"
                                                                         | "char"
                                                                         | "string"




                                                                   37                                    http://sites.google.com/site/ijcsis/
                                                                                                         ISSN 1947-5500
                                                                                 (IJCSIS) International Journal of Computer Science and Information Security,
                                                                                                                                           Vol. 9, No. 6, 2011
 | "int" "[" "]"                                                                                IV.       IMPLEMENTATION
 | Identifier

 Statement ::= "{" ( Statement )* "}"
                                                                                           For the implementation of SimJ we have used Polyglot as a
 | "if" "(" Expression ")" Statement "else" Statement                                      framework that improves and simplifies compiler design for
 | "while" "(" Expression ")" Statement                                                    languages similar to Java. This process consists in creating a
 | "for" "(" Expression ";" Expression ";" Expression ")" Statement                        new language extension. Extensions (in our case SimJ) usually
 | "switch" "(" Expression ")" "{" ("case" Expression ":"
 Statement "break" ";")* "default" ":" Statement "}"                                       have the following sub packages [5]:
 | "print" "(" Expression ")" ";"
 | "println" "(" Expression ")" ";"                                                            •      ext.simj.ast – AST nodes specific to SimJ
 | "readLine" "(" ")" ";"
 | "readInt" "(" ")" ";"
                                                                                                      language.
 | Identifier "=" Expression ";"                                                               •      ext.simj.extension – New extension and
 | Identifier "[" Expression "]" "=" Expression ";"                                                   delegate objects specific to SimJ.
 Expression ::= Expression ( "||" | "&&" | "<" | ">" | "!=" | "==" | "+" | "-"
 | "*" | "/" ) Expression
                                                                                               •      ext.simj.types – Type objects and typing
 | Expression "[" Expression "]"                                                                      judgments specific to SimJ.
 |Expression "."Identifier"("(Expression("," Expression)*)?")"                                 •      ext.simj.visit – Visitors specific to SimJ.
 | <INTEGER>
 | <STRING>                                                                                    •      ext.simj.parse – The parser and lexer for the
 | <CHARACTER>                                                                                        SimJ language.
 | "true"
 | "false"
                                                                                           In    addition,    our     extension    defines   the     class
 | Identifier
 | "this"                                                                                  ext.simj.ExtensionInfo [5], which contains the
 | "new" "int" "[" Expression "]"                                                          objects which define how the language is to be parsed and
 | "new" Identifier "(" ")"                                                                type checked. There is also a class ext.simj.Version
 | "!" Expression
 | "(" Expression ")"                                                                      defined [5], which specifies the version number of SimJ. The
 Identifier ::= <IDENTIFIER>                                                               Version class is used as a check when extracting extension-
                                                                                           specific type information from .class files.
Figure 3: SimJ language specification                                                      The design process of SimJ includes the following tasks [5]:
This is an important point that helps reducing possible
ambiguities and makes the language more understandable.                                        •      Syntactic differences between SimJ and Java are
SimJ includes the basic building blocks of a programming                                              defined based on the Java grammar found in polyglot/
language. From this point of view it is quite similar with Java                                       ext/jl/parse/java12.cup.
[8,7]. We have implemented the basic primitive data types                                      •      Any new AST nodes that SimJ requires are defined
(figure 2):                                                                                           based on the existing Java nodes found in polyglot.ast
                                                                                                      (interfaces) and polyglot.ext.jl.ast (implementations).
     •      boolean – true or false                                                            •      Semantic differences between SimJ and Java are
     •      int – integers                                                                            defined. The Polyglot base compiler (jlc) implements
     •      char – characters                                                                         most of the static semantic of Java as defined in the
                                                                                                      Java Language Specification [7].
     •      string – sequence of characters (string in SimJ for
            simplicity is considered a primitive data type)                                    •      Translation from SimJ to Java is defined. The
                                                                                                      translation produces a legal Java program that can be
     •      int[] – array of integers
                                                                                                      compiled by javac.
Mostly used control flow statements [9,8] are implemented in
                                                                                               We implement SimJ by creating a Polyglot extension with
SimJ (figure 2). Their syntax is the same as in Java
                                                                                           the characteristics described above. Implementation follows
considering that they have no redundant complexity to be
                                                                                           these steps [5]:
removed:
                                                                                               •      build.xml is modified and a target for SimJ is
     •      if else
                                                                                                      added. This is done based on the skeleton extension
     •      for
                                                                                                      found in polyglot/ext/skel. Running the
     •      while
                                                                                                      customization script polyglot/ext/newext
     •      switch
                                                                                                      copies the skeleton to polyglot/ext/simj, and
Principal operators [9,8] are also present in SimJ. These                                             substitutes our languages name at all the appropriate
include: addition, subtraction, multiplication, division, logical                                     places in the skeleton.
and, logical or, logical not, smaller than, greater than, not                                  •      A new parser is implemented using PPG. This is done
equal, equal.                                                                                         by modifying




                                                                                      38                                http://sites.google.com/site/ijcsis/
                                                                                                                        ISSN 1947-5500
                                                                           (IJCSIS) International Journal of Computer Science and Information Security,
                                                                                                                                     Vol. 9, No. 6, 2011
           polyglot/ext/simj/parse/simj.ppg using                                        [7]    Gosling, J., Joy, B., Steele, G., Bracha, G. (2005). The Java Language
                                                                                                Specification (3rd ed.). Addison Wesley.
           the SimJ syntax.
                                                                                         [8]    Arnold, K., Gosling, J., Holmes, D. (2005). The Java Programming
      •    The required new AST nodes are implemented. The                                      Language (4th ed.). Addison Wesley Professional.
           node factory                                                                  [9]    Kernighan, B.W., Ritchie, D.M. (1988). The C Programming Language
           polyglot/ext/simj/ast/SimJNodeFactor                                                 (2nd ed.). Prentice Hall.
           y_c.java is modified in order to produce these                                [10]   Clinger, W., Rees, J. (2001). Report on the Algorithmic Language
           nodes.                                                                               Scheme. Retrieved January 24, 2007, from http://www-swiss.ai.mit.edu/
                                                                                                ~jaffer/r4rs_toc.html.
      •    Semantic checking for SimJ is implemented based on
                                                                                         [11]   Krishnamurthi, Sh. (2006). Programming Languages: Application and
           its rules.                                                                           Interpretation.      Retrieved      January      28,     2007,      from
      •    The translation from SimJ to Java is implemented                                     http://www.cs.brown.edu/~sk/Publications/Books/ ProgLangs/.
           based on the translation defined above. This is                               [12]   Cornell University, Department of Computer Science. (2003). J0: A Java
           implemented as a visitor pass that rewrites the AST                                  Extension for Beginning (and Advanced) programmers. Retrieved
                                                                                                January 20, 2007, from http:// www.cs.cornell.edu/Projects/j0/.
           into an AST representing a legal Java program.
                                                                                         [13]   Lindholm, T., Yellin, F. (1999). The Java Virtual Machine Specification
                                                                                                (2nd ed.). Addison Wesley.
                                V.       CONCLUSIONS                                     [14]   Jones, J. (2003). Abstract Syntax Tree Implementation Idioms. Retrieved
                                                                                                February 6, 2007, from http://jerry.cs.uiuc.edu/~plop/plop2003/Papers/.
Our motivation for creating SimJ was to provide a simple,                                [15]   Ambler, S.J. (2006). Introduction to Object-Orientation and UML.
understandable and easy to learn programming language                                           Retrieved             February          11,          2007,          from
                                                                                                http://www.agiledata.org/essays/objectOrientation101.html.
similar to Java that improves the learning of programming
                                                                                         [16]   O’Docherty, M. (2005). Object-Oriented Analysis and Design:
basic structures and being a source language exemplar for                                       Understanding System Development with UML 2.0. John Wiley & Sons
implementing student compiler project. We discovered that the                            [17]   Graver, J.O. (1992). The Evolution of an Object-Oriented Compiler
existing approaches did not fully address the problem of a                                      Framework.          Retrieved      January      30,      2007,      from
simplified Java like structured language and that is not only a                                 http://cs.ubc.ca/rr/proceedings/spe91-95/spe/vol22/ issue7/spe767jg.pdf
reduced copy of it. Our language is simple but improves
existing solutions by merging their advantages and trying to
avoid the weak points.
Using Polyglot Framework to build the compiler we conclude
that it is an effective and easy way to produce compilers for
Java-like languages like SimJ. It is simple and has a well
defined structure thus offering the possibility to generate a
base skeleton for new language extensions on which we can
add the desired specifications.
Our language, SimJ is a well structured simplified version of
the Java programming language that is not only a reduced
copy of it. SimJ could be used by beginners that want to learn
Java but don’t know anything about object oriented
programming. It is also a good choice for learning compiler
design because of its well defined and easy to implement
structure.

                                 REFERENCES


[1]   Appel, A.W , Palsberg, J. (2002). Modern Compiler         Implementation
      in Java (2nd ed.). Cambridge University Press.
[2]   Metsker,S. J. (2001). Building Parsers with Java. Addison Wesley.
[3]   Slonneger, K., Kurtz, B.L. (1995). Formal Syntax and Semantics of
      Programming Languages, A Laboratory Based Approach. Addison
      Wesley.K. Elissa, “Title of paper if known,” unpublished.
[4]   Mystrom, N., Clarkson, M.R., Myers, A.C. (2003). Polyglot: An
      Extensible Compiler Framework for Java. Retrieved January 20, 2007,
      from                                            http://techreports.library.
      cornell.edu:8081/Dienst/UI/1.0/Display/cul.cs/TR2002-1883.
[5]   Cornell University, Department of Computer Science. (2003). How to
      Use      Polyglot.     Retrieved      January     20,      2007,      from
      http://www.cs.cornell.edu/projects/polyglot/.
[6]   Cornell University, Department of Computer Science. (2003).. PPG: A
      Parser Generator for Extensible grammars. Retrieved January 20, 2007,
      http://www.cs. cornell.edu/projects/polyglot/.




                                                                                    39                                     http://sites.google.com/site/ijcsis/
                                                                                                                           ISSN 1947-5500
                                                           (IJCSIS) International Journal of Computer Science and Information Security,
                                                           Vol. 9, No. 6, June 2011
                                                                                                                                                ,




        The History of Web Application Security Risks
                      Fahad Alanazi                                                            Mohamed Sarrab
        Software Technology Research Laboratory                                    Software Technology Research Laboratory
                De Montfort University                                                      De Montfort University
                Leicester, LE1 9BH UK                                                       Leicester, LE1 9BH UK
               P0800238x@mydmu.ac.uk                                                         msarrab@dmu.ac.uk


Abstract—this article refers generally to current web application          employed to protect them.This paper will identify and discuss
risks that are causing public concern, and piquing the interest of         ten web applications’ vulnerabilities, which constitute a threat
many scientists and organizations, as a result of an increase in           to web applications’ security; assessing information provided
attacks. The primary concern of many governments,                          by researchers and OWASP regarding risk assessment and
organizations and companies is data loss and theft. Thus, these            protection.
organizations are seeking to insure their web applications against
vulnerabilities. Revealing that awareness of the vulnerabilities of
                                                                                             II.   INJECTION FLAWS
web applications leads to recognition of the need for
improvements. The three main facets of web security are:                   In 2007 OWASP [30] mentioned numerous Injection flaws
confidentiality, integrity and safety of content, and continuity.          including: SQL, LDAP, XPath, XSLT, HTML, XML and OS;
This paper identifies and discusses ten web application                    with SQL being the most common of such injection types. In
vulnerabilities, detailing the opinions of researchers and OWASP           2004 OWASP [29] cited the main cause of vulnerability in
regarding risk assessment and protection.                                  web applications to be there use of features of the operating
                                                                           system and external programs to implement functions. This
                                                                           enables attackers to exploit previous information from an
                     I.     INTRODUCTION                                   HTTP request, to inject malicious code as the web application
                                                                           passes information through.
The Internet is a fascinating and multi-faceted technology,
opening a window on the world by allowing people across the                The attack occurs when data is sent to the interpreter after the
globe to access information simply and quickly; allowing them              user has initiated a command or query. The attacker exploits
to broadcast their ideas and culture, communicate and access               this situation with the injection of malicious code alongside
research data from anywhere. It is now even seen as a form of              the command or query, which enables full access to the system
e-government; based on its achievements in the last four years             bypassing any protection and calling for data from operating
and the acquisition of 300 million users.                                  systems and databases.OWASP in 2010 [31] described this
                                                                           type of attack, as the attacker sending simple text to exploit the
However, the Internet lacks geographic borders, or national
                                                                           syntax that targets the interpreter. Almost all data sources use
controls and this has led to concerns about the security of
                                                                           an injection vector’ which includes internal sources. This flaw
conducting business online. Indeed; there are those who
                                                                           is typically found in SQL queries, LDAP queries and OS
expend considerable effort in seeking to penetrate and steal
                                                                           commands [21].
important information from websites, justifying apprehension
amongst the owners of this information and electronic service              Recommendations
providers. Therefore, companies are doing their utmost to
maintain the confidentiality, privacy and accuracy of                          •     Avoid using interpreters if possible.
information they hold (integrity); systems can now be
protected in a number of ways and some of the programs that                    •     Input validation.
have helped in intrusion detection and reducing viruses have
somewhat eased the trepidation of network users.                               •     Avoid detailed error messages that may be useful to
                                                                                     an attacker.
Recently attackers have turned their focus to web applications
which allow surfing, shopping, communication with                              •     Reject all script injection (Gregory (2009).
companies in other countries, etc. This is because they rely on
databases to facilitate information exchange and the                       SQL Injection
distribution of information. These applications have an
increasing number of users, increasing their attractiveness to             SQL injection is common among injection flaws, and yet
attackers, despite the numerous programmers and developers                 applications those are vulnerable to itare used in our daily




                                                                      40                                 http://sites.google.com/site/ijcsis/
                                                                                                         ISSN 1947-5500
                                                         (IJCSIS) International Journal of Computer Science and Information Security,
                                                         Vol. 9, No. 6, June 2011




lives, relying on their safety; e.g. for making bookings and                             III.    Cross Site Script (XSS)
paying bills. As the number of such applications increases, so
does the sophistication of the attacks that target them. The             Cross site scripting is another intrusion method that
hackers use many methods to create defects in web                        manipulates the web browser to display malign code, which
applications; of these SQL injection is one of the easiest and           then initiates in the user’s session. This can be done in a
most dangerous, potentially damaging the whole system.                   number of ways typically in Hypertext Markup Language
                                                                         [HTML] [15]. Cross site scripting can be used in a number of
SQL injection is an attack in which SQL code is inserted or              ways from theft of a cookie to taking over an entire session.
appended into application user input parameters that are later           This is referred to as an intruder guided attack [18]. Insertion
passed to a back-end SQL server for parsing and execution                of a script into a field can be an efficient attack but
[8]. SQL injection is a serious threat to any site or application        circumventing the filter can be a problem. Cross site scripting
that contains a database; by injecting, and executing, the SQL           uses an array of methods for abuse and intrusion [15].
code with basic code, attackers can gain unauthorized access
to private databases containing important and secure                     According to Ciampa[11] a Cross Site Script (XSS) attack is
information, thus compromising the integrity of sensitive data           characterized by the use of special engineering; allowing the
by allowing for alteration or deletion [2]. SQL injection                attacker, through the use of JavaScript language, to extract
attacks affect authentication processes impinging on the                 important information from the victim before utilizing it.
verification of user identity and allowing attackers to connect          Lopez and Hammerli [24] argue that XSS is targeted on the
to the system without the password by using the query                    web application’s site and uses either stored XSS or reflected
language injection.                                                      XSS. The hackers attempt to attack users’ browsers and take
                                                                         control with malicious script. When an attack is successful, the
Preventing SQL injection                                                 attacker can access important resources in the web application;
                                                                         i.e. Cookies.
    •    String input must use two single quotation marks
         rather than a single quotation mark. If there is single         According to Belapurkar et al [5] these attacks rely on users to
         quotation mark this should be replaced by two single            input information and this means attackers can inject
         quotation marks [10].                                           dangerous code whilst inputting data to gain access to the site.
                                                                         The XSS often occur when the web application requires input
    •    Verification occurs from a single quotation mark in             via a Username and Password page, as attackers can benefit
         the inputs field, so if there is a single quotation it          from this by tricking the user. In addition, any script entered
         should be remove.                                               in/form fields or in an URL is likely to pose a risk to the site
                                                                         of this type of attack. XSS depends on injecting client-side
    •    Verification and removal of TSQL comments such as               script, leading to account theft and changes to the content on a
         – and /**/ because these comments might damage the              page. XSS occurs when the web application fails to escape
         data.                                                           user-submitted content properly before rendering it into
                                                                         HTML [19].
    •    Detection and verification of TSQL keywords such as
         SELECT, which might be used to query specific                   OWASP cited the ability of attackers to use XSS to send
         elements.                                                       malicious code or script to an unsuspecting user, affecting
                                                                         sensitive and important information that the browser has
    •    Ensure clients and server input.                                maintained as well as cookies and session tokens. The
                                                                         malicious script can rewrite and rephrase the contents of the
    •    Use of elaborate SQL constructs that might cause                HTML page because the browser does not know the origin of
         errors and impede the execution of injected code.               the script, or whether it can be trusted.OWASP divided this
                                                                         type of attack into two categories:
    •    Verification from system records to limit the number
         of users that do not have/do have an account in the             •   Stored: This attack is occurs through injection of
         system to detect any unauthorized access to the                     malicious code or script into the target server and is stored
         system by comparing these numbers.                                  permanently in messages, comment forums or databases
                                                                             etc. If/when the user requests information, the stored
    •    Use a secure policy for the system; by determining                  malicious script information is transferred to the server.
         permissions, for example limiting some permission to
         only reading and writing [16].                                  •   Reflected: This type of attack is the most common type
                                                                             and is reflected off the web server as in an error message.
                                                                             This type of attack tricks the user when they click on links
                                                                             where malicious script or code has been entered.




                                                                    41                              http://sites.google.com/site/ijcsis/
                                                                                                    ISSN 1947-5500
                                                          (IJCSIS) International Journal of Computer Science and Information Security,
                                                          Vol. 9, No. 6, June 2011




OWASP highlights the dangers of disclosure. When attackers                    •    Use XSS filter to detect any malicious code [23].
hijack user’s sessions, full control is gained and the attacker
can access end user files. The attacker can also redirect the                 •    Avoid special characters in input box such as <>, ―
user to pages or other sites and can modify presentation of                        ―, % , ; ) because these characters can help the
content by installing Trojan programs. Therefore, OWASP                            attacker to acquire sensitive data.
recommend verification from inputs and filtering to scripts
because most XSS attacks occur in JavaScript.XSS attack is                    •    Limit the data that might be a part of scripting attack
dangerous for applications and servers due to the fact that                        [17].
most of these display simple web pages that contain errors
such as 500 “internal server error”. These may include                                        IV.     Buffer Overflow
information which enables attackers to corrupt the server and
the user’s browser by reflected attack.                                   Buffer Overflow is an attack that occurs when web
                                                                          applications have no control over input that might contain
In 2007, OWASP [30] referenced cross site script as a subset              commands, encoding or improper formats. The attacker uses
of HTML injection. In this type of attack the victim‘s browser            buffer overflow by inputting and overrunning the memory
is exploited by the attacker through executed script by user              space which is used by the operating system [6]. Dubrawsky
sessions. All malicious scripts are related to JavaScript, but            [12] argued that buffer overflow happens when the attacker
any scripting language supported by the victims’ browser may              inputs additional information into the buffer that is (a holding
be vulnerable to this type of attack. OWASP described all the             area for data) that cannot handle. Buffer overflow attack relies
associated web applications that are vulnerable to three types            on programming language work that includes C and C++.
of XSS attack:
                                                                          The buffer overflow occurs when the memory size exceeds the
•   Reflected XSS: Easiest for exploiting the page.                       allocation for a buffer as a result failure to limit the inputted
                                                                          information. Furthermore, it occurs when the web applications
•   Stored XSS: The most dangerous is that it can take hostile            use low-level programming languagesbecause these languages
    data, store it within a file or database then at a later time         do not perform automated bounds checking.
    display the data for the user without a filter to detect input
                                                                          Buffer overflow can happen if data is not checked for the
    to the website.
                                                                          length of value when copying it into the buffer from another
                                                                          source, i.e. a Network socket [7]. This agrees with supports
•   DOM based XSS: The JavaScript and variables are being
                                                                          Wells’ [35] argument that storage flaws affects web
    manipulated rather than HTML elements.
                                                                          application security. According to Wells security measures
OWASP did not concentrate on these three areas, as in                     must be employed which include data encryption because web
addition there is a possibility of risky and unpredictable                applications could contain sensitive information.
browser behaviors which may lead to attack. XSS may affect
                                                                          Buffer overflows are in essence a technique used when data is
any components that the browser uses.
                                                                          written into a fixed sized memory block resulting in memory
JavaScript allows for attack due to its strengths as a                    around the destination buffer becoming jammed and over
programming language which allows manipulation of the                     capacity. This would give the intruder access to parts of the
rendered page by adding new elements, internal DOM,                       processing memory allowing for the entry of malign code [13].
changing or deleting the page. Additionally, this type of attack          This involves writing data to places in the memory stack that
permits use of XmIHttpRequest because attackers can                       contain information about the operating system, if this data is
circumvent the browser and forward the victim‘s data to                   accessed and overwritten then this usually results in a machine
aggressive sites, then create malicious codes to force open the           crashing and the system resetting; the intruder can also make
browser for a long period of time.                                        the process memory point to his code, which could result in
                                                                          passwords being accessed or new accounts being created [9].
Recommendations                                                           The best way to overcome this kind of attack is to completely
                                                                          avoid using a memory management system [13].
    •    Encode sensitive data.
                                                                          OWASP [29] referred to web application components being
    •    Validate input data for length.                                  improperly validated in some languages, leading to buffer
                                                                          overflow attacks to access the system. This type of attack is
    •    To detect XSS in input donot use blacklist.                      difficult to detect and eradicate when discovered. Buffer
                                                                          overflow can be found in the web application or‚ both the web
    •    Before using any untrusted data HTML tags should                 server or application server products that serve the static and
         be removed [14].                                                 dynamic aspects of the site. It can be found in custom web
                                                                          application code but detection buffer overflow flaws are less



                                                                     42                              http://sites.google.com/site/ijcsis/
                                                                                                     ISSN 1947-5500
                                                        (IJCSIS) International Journal of Computer Science and Information Security,
                                                        Vol. 9, No. 6, June 2011




likely in custom web applications. If a custom application is           where mistakes have commonly been made; unencrypted
discovered, the ability of the attacker is reduced, because ‚the        critical data; insecure storage of keys, certificates, and
source code and detailed error messages for the application are         passwords; improper storage of secrets in memory; poor
normally not available to the attacker.                                 randomness selections; poor choice of algorithms; attempting
                                                                        to invent new encryption algorithms; failure to include support
To determine if the server products are vulnerable there should         for encryption key changes and other required maintenance
be a review of all code that accepts input from users via the           procedures. Therefore, all websites which use encryption to
HTTP request to ensure that it can properly handle arbitrarily          protect sensitive and important information in storage and
large input and ensure that it provides appropriate size                transit are vulnerable to these kinds of attacks.
checking on all such inputs.
                                                                        Detection of these flaws takes place in the following ways:
Buffer overflow was not mentioned in OWASP [30] or [31]                 Examine tokens, session IDs, cookies and other credentials to
because it was detected by either an; Intrusion Detection               see if they are obviously not random. As a means of protection
System or IDS software, hardware or a combination of                    from this type of attack OWASP recommended a preference
both.There are two types of IDS:                                        for re-entering data and not storage. OWASP also proposed,
                                                                        where a need to use encryption exists, utilizing a library that is
•   Network intrusion detection system: This can capture data           exposed to public scrutiny and make sure that there are no
    packets travelling on the network.                                  open vulnerabilities [26, 29].
•   Host-based intrusion detection systems: These can look
    into the system and application log files to detect any             In 2007 OWASP [30] cited failure to encrypt sensitive
    intruder activity.                                                  information in web applications to be the result of poorly
                                                                        designed cryptography. There are many associated
Recommendations                                                         cryptographic flaws that use inappropriate or strong ciphers,
                                                                        which may lead to the discovery of sensitive data. As a result
     •   Do not use C and C++ programming language when                 OWASP mentioned that all web applications are
         building a web application [32].                               vulnerable.These were the most common problems in 2007.
                                                                        Not encrypting sensitive data using home grown algorithms;
     •   Limit input data to prevent long input strings that            insecure use of strong algorithms; continued use of known
         might include malicious code [17].                             weak algorithms (MD5, SHA-1, RC3, RC4…etc.); hard
                                                                        coding keys; and, storing keys in unprotected stores.OWASP
             V.     Insecure Cryptographic Storage                      [31] again stated that the most common flaws relate to not
                                                                        encrypting data, however, due to limited access precise flaws
Web applications sometimes use cryptographic functions in
                                                                        are difficult to determine.
order to secure data. Unless these functions are coded
properly, this is not an easy thing to do.They can only offer a         Recommendations
weak form of protection. Applications that do not offer a good
level of protection often use inappropriate ciphers. Thus, it is            •    Use only public algorithms.
advisable to ensure that everything is to be encoded is encoded
[21].                                                                       •    Avoid using weak algorithms.
Recommendations:                                                            •    Infrastructure credentials for web application such as
                                                                                 database credentials should be securely encrypted
    •    One should use only approved public algorithms.                         [21].
         These include AES, RSA and public key.
                                                                            •    To protect insecure storage one must use proper
    •    Cryptography stores private keys with care. Try not                     encryption and access control for all data that is
         to submit key over channels that are not guaranteed                     stored [17].
         secure [21].
                                                                                  VI.     Cross Site Request Forgery (CSRF)
In 2004 OWASP [29] highlighted this type of attack because
most web applications need to store sensitive and important             Cross Site Request Forgery (CSRF) relies on XSS attack to
information such as passwords and account records in a file             input dangerous code to the end user’s browser. This type of
system or database. Web applications developers thus resort to          attack does not target the site that is implemented in these
encryption to protect this important information. However               malicious codes but tricks the user to access other sites. CSRF
some developers have made mistakes whilst integrating                   affects web applications because it allows the attacker to
encryption into their web applications, they have also failed to        change the victim’s stored information e.g. password
focus on other aspects of the site. There are several areas             [13].Holovaty and Kaplan-Moss [19] show that CSRF occurs



                                                                   43                               http://sites.google.com/site/ijcsis/
                                                                                                    ISSN 1947-5500
                                                          (IJCSIS) International Journal of Computer Science and Information Security,
                                                          Vol. 9, No. 6, June 2011




when the attacker tricks the users by loading an URL from an              open to this type of attack.In the 2004 and 2007 [29, 30]
authentication site to take advantage of their sites. According           versions, this type of attack is based on sending forged
to Kategorileri [21], Broken Authentication and Session                   requests by submitting images, XSS flaws and other
management cause privacy violations. These flaws might lead               techniques to trick the user. Thus the attacker is able to
to hijacking of administrative or user accounts, given the fact           implement and change data whilst the victim is unable to carry
that there is no protection for credentials and session tokens            out permitted authorised functions. OWASP remarked that all
throughout a web applications lifecycle.                                  multistep transactions are unsafe because attackers can access
                                                                          a series of requests by using JavaScript or multiple tags.
A cross site request forgery is an intrusion that is a request for
a page that appears to be sent from a trusted user. One                   To verify whether the application is vulnerable it should be
common example of this is when an image on a page is                      checked. Each link and form includes tokens that help the
embedded; this contains a link to a PHP script [4, 15, 33].               attackers to predict a particular action detail for each user.
Such intrusions can be used to gain entry to password                     Therefore OWASP have recommended that unique tokens be
protected parts of a website. If an intruder has convinced a              inserted per user sessions and per request, thus disabling the
user to log onto a web application, then it can be used to                attacker’s ability to predict URL, HTML requests and user
access to malign JavaScript. This can take over the user’s                sessions details for a particular action [27].
session by releasing a false POST, using the user’s existing
session [22].                                                             The conclusions drawn by OWASP in 2010 [31] indicated that
                                                                          where the token is not unique, JavaScript or multiple tags help
Cross site request forgery intrusion can also be initiated by             the attackers to exploit the web application; this helps the
sending a fake HTTP request from the user’s session. This can             attackers to predict URL, HTML requests and user sessions
send information such as the user’s session cookie and other              details and acquire sensitive data. In addition, JavaScript or
authorisation information. This is then passed onto a                     multiple tags that enable all multistep transactions should be
vulnerable web application which then thinks the intrusions               considered unsafe.
are genuine requests for access [31].
                                                                          Recommendations
In 2007 OWASP [30] mentioned that most web applications
are only based on automatically submitted credentials, such as                •   Every form should have a special token [22].
session cookies, basic authentication credentials, source IP
addresses, SSL certificates, or Windows domain credentials.                   •   Variables are filled with a good data in order to
Therefore web applications are at risk. In addition cross site                    escape them [25].
request forgery has several other names: Session Riding and
One-Click Attacks. All web application frameworks in 2007                     •   Crypt ion session [1].
were vulnerable to cross site request forgery attacks.
                                                                              •   Use POST rather than GET [34].
CSRF usually takes place against a forum because it directs
the user to invoke some function, such as a logged page.                      •   Do not click any link you do not recognise because it
Attackers can force the user, without their consent, to make                      might be used to send malicious requests to other
changes to their DSL router. The user‘s authorisation                             applications the user is logged into [13].
credentials are the reason these attacks work typically the
session cookie, so if the attacker could not supply credentials               •   Use browser tools, such as TG, to avoid and block
then the attack would fail.                                                       any change of user authentication by the website [20].

OWASP mentioned Cross Site Scripting (XSS) flaws which                    VII.    Broken Authentication and Session Managements
are not required to work with Cross Site Request Forgery
(CSRF). Any web application with XSS flaws is retractable                 Another weakness that could make one‘s website vulnerable is
and vulnerable to CSRF attack because CSRF attack exploits                improper protection of the certification apparatus, which is
XSS flaws for stealing any non-automatically submitted                    described as broken authentication. Broken session
credential. Defences should be built against CSRF attack by               management relates to functions such as logout, timeout etc.
eliminating XSS vulnerabilities in applications because XSS               Application functions that relate to session management, if not
flaws can circumnavigate most CSRF defences.                              implemented properly allow intruders to generate passwords
                                                                          and keys, consequently assuming the identity of the user [17].
OWASP recommended verifying a web application so as to be
protected from this attack by generating and then requiring               Session management restricts the gateway to applications that
some type of authorisation token that is not automatically                use the web and information, and is authorised to shield and
submitted by the browser. OWASP [30] therefore contended                  ideally capable of protecting administrator privileges, such as
that applications failing to use unique tokens in requests were           the username and password details.Organisations can demand




                                                                     44                              http://sites.google.com/site/ijcsis/
                                                                                                     ISSN 1947-5500
                                                        (IJCSIS) International Journal of Computer Science and Information Security,
                                                        Vol. 9, No. 6, June 2011




customised authentication, but this can lead to intruder                        VIII.         Insecure Direct Object References
sessions being authorised, although this can be countermanded
by using built in security systems, such as SSL encryption             These flaws resulted from developer error that exposes a direct
[28].                                                                  object reference such as a database, key or directory. A direct
                                                                       object reference can occur when a developer leaves access to
In 2007, OWASP [30] identified flaws in the area of                    an object on the server such as a data file or database key. This
authentication and session management as related to the lack           can be countered by means of an authorisation check; if not
of session token protection in the web application. These flaws        performed this can enable intruders to alter references to these
can result in privacy violations through the hijacking of the          files causing havoc to these systems [31]. When authorisation
user‘s administrative accounts. All authentication and session         checks have been restricted or even stopped this vulnerability
management web application frameworks were found to be                 can appear. Where programmers usually use object references
vulnerable to this type of flaw at this time.                          directly in web interface, with no validation checks.
Weaknesses usually occur with ancillary authentication                 Insecure Direct Object Reference allows an attacker to access
functions such as logout, remember me and account update.In            other objects in the web application without authorization by
2010, OWASP [31] stated that flaws within authentication and           manipulating direct object references. Furthermore, this type
session management enable external attackers, and users who            of attack occurs when there is exposure of reference, i.e. a
have accounts on the site, to steal information from other             database record as well as form parameter or URL in an
accounts and hide their actions. Attackers impersonate users           internal implementation object.
allowing access to exposed accounts, session IDs and
passwords by use of leaks in the session management                    OWASP [30] mentioned flaws that can occur when a direct
functions or authentication.                                           object reference, such as a URL or form parameter and
                                                                       database record is exposed by a developer. An attacker could
Recommendations                                                        access the object through manipulation of direct object
                                                                       references, unless an access control check has been put in
     •   Do not accept from URL, or in requests, invalid or            place without authorization. OWASP also mentioned that
         new session identifiers.                                      many applications expose internal object references to users,
                                                                       enabling attackers through use of parameter tampering, to
     •   Limit or rid your code of custom cookies for                  violate access control policy by changing the references.
         authentication or session management purposes.
                                                                       In 2010, OWASP [31] mentioned flaws that occur when
     •   Use simple      and   more    secure   authentication         developers expose references that take place within an internal
         mechanisms.                                                   implementation object such as database key, directory and
                                                                       files to the user. The attacker can therefore gain access to
     •    Use a strong password policy.                                unauthorized data through manipulation of references, due to
                                                                       absence of protection or access control checks.The reason for
     •   Enable login process from an encrypted page.                  the continuation of these flaws in the web applications relates
                                                                       to the fact that many applications which create web pages
     •   Make sure all client side cookies and server side             utilize the actual name or key of an object and do not verify
         session state are destroyed on logout.                        the user is authorized for the target object.
     •   Users should enter their old password when                    Recommendations
         changing to a new password.
                                                                           •    Do not expose private object references to users.
     •   Use limited-time-only random numbers to reset
         access and send a follow up e-mail as soon as the                 •    Validate any private object references.
         password has been reset. Beware self-registered
         users changing their e-mail address - send a message              •    Verify authorization to all referenced objects.
         to the previous e-mail address before enacting the
         change [21].                                                      •    Verify from input that might include attack patterns
                                                                                [21].
     •   Avoid authentication and session management
         manipulation by the user to pass security control                              IX.      Insecure Communications
         [17].
                                                                       OWASP highlighted the need to protect sensitive
                                                                       communication because this will allow media sensitive data to
                                                                       be exposed. Applications often fail to encrypt network traffic



                                                                  45                                 http://sites.google.com/site/ijcsis/
                                                                                                     ISSN 1947-5500
                                                        (IJCSIS) International Journal of Computer Science and Information Security,
                                                        Vol. 9, No. 6, June 2011




that expose an authentication or session token. Therefore               checks are performed before request to access a sensitive
encryption should be used for all authenticated connections             function is granted.
and web pages that are accessible.All web application
frameworks mentioned by OWASP, are vulnerable to this                   Recommendations
flaw.
                                                                            •    Design of the application and architecture should
Such deficiencies enable the attacker to sniffer network traffic                 include access control matrix.
and gain access or capture sensitive and important
information,     including    transmitted     credentials     or            •    An effective access control mechanism to protect all
conversations, since every single request can contain a session                  URL and business functions.
token or authentication credential [30].Security breaches are
also possible when Insecure Communications occur when the                   •    Make a penetration test for the application to ensure
web application does not have encryption for all authenticated                   application security.
connections and sensitive data [21].
                                                                            •    Make sure that administration is protected [21].
Recommendations
                                                                                XI.     Insufficient Transport Layer Protection
    •     Use SSL for all connections that are authenticated or
         transmitting sensitive or value data.                          Insufficient Transport Layer Protection allows an attacker to
                                                                        steal sensitive data or set access to the web application, due to
    •    Protect communications between infrastructure                  vulnerability exposing communication [3]. This arise using
         elements by using protocol level encryption or                 expired, invalid or incorrect certificates which lead to
         transport layer security [21].                                 applications failing to protect network traffic. These flaws are
                                                                        very dangerous because the application does not use SSL/TLS
    •    Encrypt data.                                                  elsewhere during authentication so it might expose sensitive
                                                                        data; i.e. session IDs of users, leading to account theft [31].
             X.     Failure to Restrict URL Access
                                                                        Recommendations
According to Kategorileri [21], Failure to Restrict URL
Access occurs as result of a lack of access control checks. This            •    Use strong algorithms.
is because the web application usually protects an URL to
avoid the page presenting links to unauthorized users.Web                   •    Use SSL for all sensitive pages in the applications.
access to internet addresses or URLs is checked before any
images or buttons on the page appear; this requires web                     •    Use encryption technologies or SSL with backend
applications to perform checks every time these pages are                        and other connections.
viewed, or intruders will be able to gain access by forging
their URL addresses. Tools such as these cannot identify                    •    Make sure the server certificate has not expired or
whether the page is accessible to the user, and therefore it is                  been revoked [4].
difficult to identify whether an issue exists with access [31]

Scanners are tools that can be used to find hidden URLs, but                                XII.     CONCLUSIONS
they are unable to determine whether these functions or pages
are to be protected by any controls or restrictions. In order to
find these hidden pages they use a number of methods such as            This paper presents and discusses ten web application
                                                                        vulnerabilities, Injection Flaw, Cross-Site Scripting (XSS),
fuzzing directory and file names, directory lists, and also
trying to find backup and file folders.                                 Buffer Overflow, Insecure Cryptographic Storage, Cross Site
                                                                        Request Forgery (CSRF), Broken Authentication and Session
This form of attack is called forced browsing and contained             Managements, Insecure Direct Object References, Insecure
guessing links and brute force techniques to find unprotected           Communications, Failure to Restrict URL Access and
pages [30]. This can result in applications which allow access          Insufficient Transport Layer Protection. Detailing the
for control code to develop into a complex model for                    researcher’s opinions and OWASP regarding risk assessment
developers and security specialists to understand.                      and protection. As aadopting the OWASP Top Ten is perhaps
                                                                        the most effective first step towards changing the software
In 2010 OWASP [31] identified further serious threats to web            development culture within organization into one that
applications being that anyone can send a request to a web              produces secure code the paper provides some
application and therefore gain access to the network. Certain           recommendation for adapting these ten web application
applications do not protect page requests correctly; i.e. no            vulnerabilities.




                                                                   46                              http://sites.google.com/site/ijcsis/
                                                                                                   ISSN 1947-5500
                                                          (IJCSIS) International Journal of Computer Science and Information Security,
                                                          Vol. 9, No. 6, June 2011




                         REFERENCES                                                   http://serdarbuyuktemiz.blogspot.com/2008/09/owasp-top-
                                                                                      ten-2007-most-critical-web.html. [Accessed 29/08/2010].
[1].    Alameda, A. (2008). Foundation Rails 2. United States of              [22].   Laurent, S.S. and Dumbill, E. (2007). Learning Rails.
        America: Springer-Verlag New York, Inc, pp387-388.                            United States of America: O'Reilly Media, Inc.
[2].    Alqahtani, A. A. (2010) Security and Protection                       [23].   Lee, W. (2009). Windows 7: Up and Running: A Quick,
        Information in Modern Web Application. Available from:                        Hands-On Introduction. United States of America: O'Reilly
        http://coeia.edu.sa [Accessed 07/07/2010].                                    Media, Inc, p129.
[3].    Auger, R. (2010). Insufficient Transport Layer Protection.            [24].   López, J. and Bernhard M. Hämmerli (2008). Critical
        Available from: http://projects.webappsec.org/Insufficient-                   Information Infrastructures Security: Second International
        Transport-Layer-Protection [Accessed 3/09/2010].                              Workshop, CRITIS 2007, Benalmadena-Costa, Spain,
[4].    AUUGN (2005) The Conference for Unix, Linux and Open                          October 3-5, 2007. Germany: Springer-Verlag Berlin
        Source           Professionals.      Available       from:                    Heidelberg, p288.
        http://books.google.co.uk/books?id=iJw5zAu7LncC&prints                [25].   Makice, K. (2009). Twitter API: up and running. United
        ec=frontcover&dq=AUUGN&hl=en&ei=HvKPTNfSE9CH                                  States of America: O'Reilly Media, Inc, pp98-99.
        4AbD3oyPDg&sa=X&oi=book_result&ct=result&resnum=                      [26].   McClure, S. and Scambray, J. and Kurtz, G. (2009).
        1&ved=0CCoQ6AEwAA#v=onepage&q&f=false                                         Hacking exposed 6: network security secrets & solutions.
        [Accessed 25/08/2010].                                                        United States of America: McGraw-Hill Companies, p592.
[5].    Belapurkar, A. et al. (2009). Distributed systems security:           [27].   Mike Andrews, James A. Whittaker, J.A. (2006). How to
        issues, processes, and solutions. United Kingdom: John                        break Web software: functional and security testing of Web
        Wiley & Sons Ltd, pp105-106.                                                  applications and Web services, Volume 1. US: Pearson
[6].    Boyd, C. and Mao, W. (2003). Information security: 6th                        Education, Inc, pp66-67.
        international conference, ISC 2003, Bristol, UK, October              [28].   Overby,       S.     (2007)      CIO.   Available      from:
        1-3, 2003: proceedings, Volume 2851. Germany: Springer-                       http://books.google.co.uk/books?id=1woAAAAAMBAJ&p
        Verlag Berlin Heidelberg New York, P367.                                      g=PA68&dq=prevent+Broken+authentication+ans+session
[7].    Carey, M. et al. (2008). Nessus network auditing. United                      +management&hl=en&ei=_TB7TKCUF5GSswbomOSyD
        States of America: Andrew Williams, p1.                                       Q&sa=X&oi=book_result&ct=result&resnum=7&ved=0C
[8].    Clarke, J. (2009) SQL Injection Attacks and Defense. USA:                     FYEwBg#v=onepage&q&f=false [Accessed 28/08/2010].
        Syngress Publishing, Inc.                                             [29].   OWSAP (2004) The Ten Most Critical Web Application
[9].    Cole, E. (2002). Hackers beware. United Stated of                             Security         Vulnerabilities.     Available        from:
        America: New Riders Publishing, p248.                                         http://ftp.ipv4.heanet.ie/
[10].   Cumming, A and Russell, G. (2007) SQL Hacks. USA:                     [30].   OWSAP (2007)The Ten Most Critical Web Application
        O‘Reilly Media, Inc.                                                          Security         Vulnerabilities.     Available        from:
[11].   Ciampa, M. (2008). Security+ Guide to Network Security                        http://www.owasp.org/images/e/e8/OWASP_Top_10_2007
        Fundamentals. 3rd ed. Canada: Cengage Learning, p85.                          .pdf [Accessed 26/06/2010].
[12].   Dubrawsky, I. (2009). CompTIA Security+: Exam SYO                     [31].   OWSAP (2010) The Ten Most Critical Web Application
        201, Study Guide and Prep Kit. United States of America:                      Security         Vulnerabilities.     Available        from:
        LanraColantoni, pp109-110.                                                    http://owasptop10.googlecode.com/files/OWASP%20Top
[13].   Dwivedi, H. and Clark, C. and Thiel, D. (2010). Mobile                        %2010%20-%202010.pdf [Accessed 26/06/2010].
        Application Security. Unite States of America: The                    [32].   Peikari, C. And Chuvakin,A. (2004). Security warrior .
        McGraw-Hill Companies, pp7-266.                                               United States of America: O'Reilly Media, Inc, p167.
[14].   Flanagan, D. (2006). JavaScript: the definitive guide. 5th            [33].   Powell, T.A. (2008). Ajax: the complete reference. unite
        ed. United States of America: O'Reilly Media, Inc, pp267-                     States of America: The McGraw-Hill Companies, p322.
        268.                                                                  [34].   Shiflett, C. (2005). Essential PHP security. United States of
[15].   Ford, R. (2007). Infosecurity 2008 threat analysis. United                    America: O'Reilly Media, Inc, pp26-245.
        States of America: Arnorette Pedersen.                                [35].   Wells, C. (2007). Securing Ajax applications. United States
[16].   Gama, J and Naughter, P. (2006) Super System:                                 of America: O'Reilly Media, Inc, p51.
        Turbocharge Database Performance. US: Rampant Teach
        Press, Kittrell, NC, USA.                                                                    AUTHORS PROFILE
[17].   Gregory, P. (2009). CISSP Guide to Security Essentials.
        United States of America: Cengage Learning, p99.                                 Fahad Alanazi is a PhD student in De Montfort University.
[18].   Grossman, J. and Hansen, R. (2007). XSS attacks: cross-                          Faculty of Technology.Software Technology Research
        site scripting exploits and defense. United States of                            Laboratory (STRL). He received his B.Sc in computer science
                                                                                         from Tabouk University in Saudi Arabia and also received
        America: Syngress Publishing, Inc.
                                                                                         MSc in Computer Security from De Montfort University. His
[19].   Holovaty, A. and Kaplan-Moss, J. (2009). The Definitive                          main research interests are Computer security and
        Guide to Django: Web Development Done Right. United                Computer forensic.
        States of America: Springer-Verlag New York, Inc, p345.
[20].   Jakobsson, M. and Ramzan, Z. (2008). Crimeware:                                  Dr. Mohamed Sarrab his Ph.D. degree in Computer Science
        understanding new attacks and defenses. United Kingdom:                          from De Montfort University 2011. He received his B.Sc in
        Symantec Press, p156.                                                            computer science from 7th April University Libya and also
[21].   Kategorileri, Y. (2008). OWASP Top Ten 2007 Most                                 received M.Sc in Computer Science from VSB Technical
        Critical Web Application Security Vulnerabilities                                University of Ostrava Czech Republic. His main research
                                                                                         interests are Computer security, Runtime Verification,
        .Available                                           from:         Computer forensic.




                                                                      47                                  http://sites.google.com/site/ijcsis/
                                                                                                          ISSN 1947-5500
                                                                (IJCSIS) International Journal of Computer Science and Information Security,
                                                                                                                          Vol. 9, No.6, 2011

     Improving the Performance of Translation Wavelet
                 Transform using BMICA
                 Janett Walters-Williams                                                              Yan Li
     School of Computing & Information Technology                                 Department of Mathematics & Computing,
           University of Technology, Jamaica                                   Centre for Systems Biology, University of Southern
                Kingston 6, Jamaica W.I.                                               Queensland, Toowoomba, Australia
                 jwalters@utech.edu.jm                                                         liyan@usq.edu.au


Abstract—Research has shown Wavelet Transform to be one of
the best methods for denoising biosignals. Translation-Invariant
form of this method has been found to be the best performance.
In this paper however we utilize this method and merger with our
newly created Independent Component Analysis method –
BMICA. Different EEG signals are used to verify the method
within the MATLAB environment. Results are then compared
with those of the actual Translation-Invariant algorithm and
evaluated using the performance measures Mean Square Error
(MSE), Peak Signal to Noise Ratio (PSNR), Signal to Distortion
Ratio (SDR),       and Signal to Interference Ratio (SIR).
Experiments revealed that the BMICA Translation-Invariant
Wavelet Transform out performed in all four measures. This
indicates that it performed superior to the basic Translation-
Invariant Wavelet Transform algorithm producing cleaner EEG
signals which can influence diagnosis as well as clinical studies of
the brain.                                                                                   Figure 1: Collecting EEG signals


                                                                            EEG is widely used by physicians and scientists to
    Keywords-B-Spline; Independent Component Analysis; Mutual               study brain function and to diagnose neurological disorders.
Information; Translation-Invariant Wavelet Transform                        Any misinterpretations can lead to misdiagnosis. These signals
                                                                            must therefore present a true and clear picture about brain
                                                                            activities as seen in Figure 2. EEG signals are however highly
                       I.    INTRODUCTION                                   attenuated and mixed with non-cerebral impulses called
The nervous system sends commands and communicates by                       artifacts or noise [15]. The presence of these noises
trains of electric impulses. When the neurons of the human                  introduces spikes which can be confused with neurological
brain process information they do so by changing the flow of                rhythms. They also mimic EEG signals, overlaying these
electrical current across their membranes. These changing                   signals resulting in signal distortion (Figure3). Correct
currents (potentials) generate electric fields that can be                  analysis is therefore impossible; a true diagnosis can only be
recorded from the scalp. Studies are interested in these                    seen when all these noises are eliminated or attenuated. EEG
electrical potentials but they can only be received by direct               recordings are really therefore a combination of noise and the
measurement. This requires a patient to under-go surgery for                pure EEG signal defined mathematically below (using S as the
electrodes to be placed inside the head. This is not acceptable             pure EEG signal, N the noise and E representing the recorded
because of the risk to the patient [25]. Researchers therefore              signal):
collect recordings from the scalp receiving the global
descriptions of the brain activity. Because the same potential is                           =
                                                                                           E (t )    S (t ) + N (t )                          (1)
recorded from more than one electrode, signals from the
electrodes are supposed to be highly correlated. Figure 1
shows how the potentials are collected from the scalp. These
are collected by the use of an electroencephalograph and
called electroencephalogram (EEG) signals.




                                                                       48                              http://sites.google.com/site/ijcsis/
                                                                                                       ISSN 1947-5500
                                                                   (IJCSIS) International Journal of Computer Science and Information Security,
                                                                                                                             Vol. 9, No.6, 2011
                    Figure 2: Clean pure EEG Signal                         appropriately, and then the WT is reversed (inverted) to obtain
                                                                            a new image.




                                                                                   Figure 4. Demonstration of (a) a signal and (b) a wavelet



                                                                              The second type is designed for signal analysis for study of
        Figure 3: EEG Signal corrupted with EKG and line signals            EEG or other biomedical signals. In these cases, a modified
                                                                            form of the original signal is not needed and the WT need not
                                                                            be inverted (it can be done in principle, but requires a lot of
                                                                            computation time in comparison with the first type of WT).
   Numerous methods have been proposed by researchers to                      WT decomposes a signal into a set of coefficients called the
remove artifacts in EEG and are reviewed in [6, 13, 20, 22,                 discrete wavelet transform (DWT) according to:
24]. The goal of these methods is to decompose the EEG
signals into spatial and temporal distinguishable components.                                   C j , k = ∑ E (t ) g j , k (t )                     (2)
                                                                                                             t∈Z
After identification of components constituting noise, the EEG
is reconstructed without them. Methods include Principal
Components Analysis (PCA), the use of a dipole model and                    where Cj,k is the wavelet coefficient and gj,k is the scaling
more recently Independent Component Analysis (ICA) and                      function defined in [23] as:
Wavelet Transform (WT). Which method is considered the
best is not the topic of this research. Here we focus on                                               − j

improving WT using a new ICA method called – B-Spline                                              2    2
                                                                                                             g (2 − j t − k )                             (3)
Mutual Information Independent Component Analysis
(BMICA).                                                                    The wavelet and scaling functions depend on the chosen
                                                                            wavelet family, such as Haar, Daubechies and Coiflet.
                                                                            Compressed versions of the wavelet function match the high-
  The paper is organized as follows: after this introduction of
                                                                            frequency components, while stretched versions match the
EEG signals and the need to denoise Section 2 presents the
                                                                            low-frequency components. By correlating the original signal
denoising methods utilized in the paper. We then review the
                                                                            with wavelet functions of different sizes, the details of the
reasons for merger in Section 3 and describe the experiments
                                                                            signal can be obtained at several scales or moments. These
conducted in Section 4. In Section 5 we present the results,
comparison of these results and a summary. Finally in Section               correlations with the different wavelet functions can be
                                                                            arranged in a hierarchical scheme called multi-resolution
6 we present the conclusion.
                                                                            decomposition. The multi-resolution decomposition algorithm
                                                                            separates the signal into “details” at different moments and
                     II.   LITEATURE REVIEWE                                wavelet coefficients [19-20]. As the moments increase the
A. Wavelet Transform                                                        amplitude of the discrete details become smaller, however the
Wavelet Transform (WT) is a form of time-frequency analysis                 coefficients of the useful signals increase [27-28].
been used successfully in denoising biomedical signals by                      Considering Eq. (1) the wavelet transform of E(t) produces
decomposing signals in the time-scale space instead of time-                wavelet coefficients of the noiseless signal S(t) and the
frequency space. It is so because it uses a method called                   coefficients of the noise N(t). Researchers found that wavelet
wavelet shrinkage proposed by Donoho and Johnstone [7].                     denoising is performed by taking the wavelet transform of the
Each decomposed signal is called a wavelet. Figure 4 shows                  noise-corrupted E(t) and passing the detail coefficients, of the
the difference between a wave/signal and a wavelet.                         wavelet transform, through a threshold filter where the details,
   There are two basic types of WT. One type is designed to be              if small enough, might be omitted without substantially
easily reversible (invertible); that means the original signal can          affecting the main signals. There are two main threshold filters
be easily recovered after it has been transformed. This kind of             – soft and hard. Research as shown that soft-thresholding has
WT is used for image compression and cleaning (noise and                    better mathematical characteristics [27-29] and provides
blur reduction). Typically, the WT of the image is first                    smoother results [10]. Once discarded these coefficients are
computed, the wavelet representation is then modified                       replaced with zeroes during reconstruction using an inverse




                                                                       49                                    http://sites.google.com/site/ijcsis/
                                                                                                             ISSN 1947-5500
                                                               (IJCSIS) International Journal of Computer Science and Information Security,
                                                                                                                         Vol. 9, No.6, 2011
wavelet transform to yield an estimate for the true signal,                where u is the estimated ICs. For this solution to work the
defined as:                                                                assumption is made that the components are statistically
                                                                           independent, while the mixture is not. This is plausible since
                                                                           biological areas are spatially distinct and generate a specific
                                   (
         S (t ) D ( E (t )]) W −1 Λ th (W ( E (t ) ) )   )                 activation; they however correlate in their flow of information
         ^
         =            =                                       (4)
                                                                           [11].
                                                                             ICA algorithms are suitable for denoising EEG signals
where Λ th is the diagonal thresholding operator that zeroes               because
                                                                                (i) the signals recorded are the combination of temporal
out wavelet coefficients less than the threshold, th. It has been
                                                                                     ICs arising from spatially fixed sources
shown that this algorithm offers the advantages of smoothness
                                                                                (ii) the signals tend to be transient (localized in time),
and adaptation. It has been shown that this algorithm offers the
                                                                                     restricted to certain ranges of temporal and spatial
advantages of smoothness and adaptation however it may also
                                                                                     frequencies (localized in scale) and prominent over
result in a blur of the signal energy over several transform
                                                                                     certain scalp regions (localized in space) [20].
details of smaller amplitude which may be masked in the
noise. This results in the detail been subsequently truncated
when it falls below the threshold. These truncations can result            B-Spline Mutual Information Independent Component
in overshooting and undershooting around discontinuities                   Analysis (BMICA)
similar to the Gibbs phenomena in the reconstructed denoised                   There have been many Mutual Information (MI) estimators
signal. Coifman and Donoho [4] proposed a solution by                      in ICA literature which are very powerful yet difficult to
designing a cycle spinning denoising algorithm which                       estimate resulting in unreliable, noisy and even bias
(i) shifts the signal by collection of shifts, within range of             estimation. Most algorithms have their estimators based on
      cycle spinning                                                       cumulant expansions because of ease of use [16]. B-Spline
(ii) denoise each shifted signal using a threshold (hard or                estimators according to our previous research [26] however,
      soft)                                                                have been shown to be one of the best nonparametric
(iii) inverse-shift the denoised signal to get a signal in the             approaches, second to only wavelet density estimators. In
      same phase as the noisy signal                                       numerical estimation of MI from continuous microarray data,
(iv) Averaging the estimates.                                              a generalized indicator function based on B-Spline has been
      The Gibbs artifacts of different shifts partially cancel each        proposed to get more accurate estimation of probabilities;
other, and the final estimate exhibits significantly weaker                hence we have designed a B-Spline defined MI contrast
artifacts [4]. This method is called a translation-invariant (TI)          function. Our MI function is expressed in terms of entropy as:
denoising scheme. Experimental results in [1] confirm that
single TI wavelet denoising performs better than the                                  I ( X , Y ) = H ( X ) + H (Y ) − H ( X , Y )
traditional single wavelet denoising. Research has also shown                                                                                        (7)
that TI produces smaller approximation error when
approximating a smooth function as well as mitigating Gibbs                where
                                                                                    H ( X ) = −∑ p ( xi ) log p ( xi )
artifacts when approximating a discontinuous function.
B. Independent Component Analysis                                                                  i

Independent Component Analysis (ICA) is an approach for the                         H ( X , Y ) = −∑ p ( xi , y j ) log p ( xi , y j )
                                                                                                       i, j
solution of the BSS problem [5]. It can be represented
mathematically according to Hyvarinen, Karhunen & Oja [12]                                                       (8)
as:
                                                                           Eq. (6) contains the term −H(X, Y), which means that
                                                                           maximizing MI is related to minimizing joint entropy. MI is
                             =
                             X As + n                          (5)         better than joint entropy however because it includes the
                                                                           marginal entropies H(X) and H(Y) [13]. Entropy in our design
where X is the observed signal, n is the noise, A is the mixing            is based on probability distribution functions (pdfs) and our
matrix and s the independent components (ICs) or sources. (It              design defines a pdf using a B-Spline calculation resulting in
can be seen that mathematically it is similar to Eq. 1). The
problem is to determine A and recover s knowing only the
                                                                                                                         N
                                                                                                                 1             ~
measured signal X (equivalent to E(t) in Eq. (1)). This leads to
finding the linear transformation W of X, i.e. the inverse of the
                                                                                                p ( xi ) =
                                                                                                                 N
                                                                                                                        ∑B
                                                                                                                        u =1
                                                                                                                                   i ,k   ( xu )
                                                                                                                                                      (9)
mixing matrix A, to determine the independent outputs as:
                                                                           where
                  u =
                  = WX WAs                                    (6)                                                n +1
                                                                                                 B ( x ) = ∑ Di Bik k ( x )
                                                                                                                  −
                                                                                                                 i =1                                (10)



                                                                      50                                      http://sites.google.com/site/ijcsis/
                                                                                                              ISSN 1947-5500
                                                                (IJCSIS) International Journal of Computer Science and Information Security,
                                                                                                                          Vol. 9, No.6, 2011
and D1 is calculated based on Cheney and Kincaid (1994).                    signals yi can be evaluated using the known source si.
     MI was used to create our fixed-point Independent                      Biomedical signals however produce unknown source signals.
Component Analysis algorithm called B-Spline Mutual                         In this study therefore we utilize real data collected from four
Information Independent Component Analysis (BMICA).                         sites.
BMICA utilizes prewhitening strategies as well as possess the                    (i) http://sccn.ucsd.edu/~arno/fam2data/publicly_availab
linearity g(u) = tanh and a symmetric orthogonalization.                               le_EEG_data.html. All data are real comprised of
Unmixed signals are determined by:                                                     EEG signals from both human and animals. Data
                                                                                       were of different types.
             = ( zg ( y )' / m − ∑ (1 − g ( y ) 2 ) × I ) / m                          (a) Data set acquired is a collection of 32-channel
                                                   '
              B
                                                                                            data from one male subject who performed a
                                                                (11)                        visual task.
                                                                                       (b) Human data based on five disabled and four
 where z is the result of prewhitening and y is the whitened                                healthy subjects. The disabled subjects (1-5)
 signal determined by                                                                       were all wheelchair-bound but had varying
                                                                                            communication and limb muscle control abilities.
                                 =
                                y z' × B                        (12)                        The four healthy subjects (6-9) were all male
                                                                                            PhD students, age 30 who had no known
                                                                                            neurological deficits. Signals were recorded at
                                                                                            2048 Hz sampling rate from 32 electrodes placed
                  III.    REASONS FOR MERGER                                                at the standard positions of the 10-20
                                                                                            international system.
  WT and ICA in recent years have often been used in Signal                            (c) Data set is a collection of 32-channel data from
Processing [21, 27]. More recently there has been research                                  14 subjects (7 males, 7 females) who performed
comparing the denoising techniques of both. It was found                                    a go-nogo categorization task and a go-no
     (i) if noise and signals are nearly the same or higher                                 recognition task on natural photographs
           amplitude, wavelets had difficultly distinguishing                               presented very briefly (20 ms). Each subject
           them. ICA, on the other hand, looks at the underlying                            responded to a total of 2500 trials. The data is
           distributions thus distinguishing each [29].                                     CZ referenced and is sampled at 1000 Hz.
     (ii) ICA gives high performance when datasets are large.                          (d) Five data sets containing quasi-stationary, noise-
           It suffers from the trade off between a small data set                           free EEG signals both in normal and epileptic
           and high performance [13]. The larger the set,                                   subjects. Each data set contains 100 single
           however the higher the probability that the effective                            channel EEG segments of 23.6 sec duration.
           number of sources will overcome the number of                         (ii) http://www.cs.tut.fi/~gomezher/projects/eeg/database
           channels (fixed over time), resulting in an over                            s.htm. Data here contains
           complete ICA. This algorithm might not be able to                           (a) Two EEG recordings (linked-mastoids reference)
           separate noise from the signals.                                                 from a healthy 27-year-old male in which the
     (iii) ICA algorithms cannot filter noise that is overlapping                           subject was asked to intentionally generate
           with EEG signals without discarding the true signals                             artifacts in the EEG
           as well. This results in data loss. With WT however                         (b) Two 35 years-old males where the data were
           once wavelet coefficients are created, noise can be                              collected from 21 scalp electrodes placed
           identified as they concentrate on scale 21 decreasing                            according to the international 10-20 System with
           significantly when the scale increases, while EEG                                addition electrodes T1 and T2 on the temporal
           concentrates on the 22-25 scales. Elimination of the                             region. The sampling frequency was 250 Hz and
           smaller scales denoise the EEG signals [1]. WT                                   an average reference montage was used. The
           therefore removes any overlapping of noise and EEG                               electrocardiogram (ECG) for each patient was
           signals that ICA cannot filter out.                                              also simultaneously acquired and is available in
Research therefore shows that ICA and wavelets complement                                   channel 22 of each recording.
each other, removing the limitations of each [21].                               (iii) http://idiap.ch/scientific-research/resources/.  Data
   .                                                                                   here comes from 3 normal subjects during non-
                    IV.     EXPERIMENT SETUP                                           feedback sessions. The subjects sat in a normal chair,
                                                                                       relaxed arms resting on their legs
A. Data Sets                                                                     (iv) sites.google.com/site/projectbci. Data here is from a
There are two types of data that can be used in experiments –                          21 age year old right-handed male with no medical
real and synthetic. In synthetic data the source signals are                           conditions. EEG consists of actual random movement
known as well as the mixing matrix A. In these cases the                               of left and right hand recordings with eyes closed.
separation performance of the unmixing matrix W can be                                 Each row represents one electrode. The order of
assessed using the known A and the quality of the unmixed                              electrode is FP1, FP2, F3, F4, C3, C4, P3, P4, 01, 02,




                                                                       51                              http://sites.google.com/site/ijcsis/
                                                                                                       ISSN 1947-5500
                                                       (IJCSIS) International Journal of Computer Science and Information Security,
                                                                                                                 Vol. 9, No.6, 2011
         F7, F8, T3, T4, T5, T6, F2, CZ, PZ. Recording was                       avg h∈H ( S − hTS h ( f ) )                   (15)
        done at 500Hz using Neurofax EEG system.
   These four sites produce real signals of different sizes
                                                                        where H is the range of shifts, T is the wavelet shrinkage
however all were 2D signals.
                                                                        denoising operator, h the circular shift and the maximum of H
B. Methodology                                                          is the length of the signal N from Eq. (8).
    In this paper we are comparing the merger of BMICA with             C. Performance Matrix
TIWT with the results of the normal TIWT. In this research
                                                                          The analysis of the algorithm performance consisted in
the TIWT method for both tests involves the following steps:
                                                                        estimating (1) the accuracy with which each algorithm was
                                                                        able to separate components, and (2) the speed with which
     1. Signal Collection
                                                                        each algorithm was able to reproduce EEG signals. For (1)
This algorithm is designed to denoise both natural and
                                                                        experiments were mainly aimed at assessing the algorithms’
artificially noised EEG signals. They should therefore be
                                                                        ability to perform ICA (extraction of ICs) and not blind source
mathematically defined based on Eq. (1).
                                                                        separation (recovery of original sources). The performance
                                                                        measures that will be used throughout are based on two
    2. Apply CS to signal
                                                                        categories of calculation:
The number of time shifts is determined; in so doing signals
                                                                             1. Separation Accuracy Measures - Signal to Distortion
are forcibly shifted so that their features change positions
                                                                                  Ratio (SDR), Signal to Interference Ratio (SIR), and
removing the undesirable oscillations which result in pseudo-
Gibbs phenomena. The circulant shift by h is defined as:                    2.   Noise/Signal Measures - Mean Square Error (MSE),
                                                                                 Peak Signal to Noise Ratio (PSNR), Signal to Noise
                ( n) )
         Sh ( f =         f ( ( n + h) mod N )            (13)                   Ratio (SNR).

                                                                        Testing on (2) was not executed.
where f(n) is the signal, S is time shift operator and N is the
number of signals. The time-shift operator S is unitary and
therefore invertible i.e. (Sh)-1 = S-h
                                                                                          V.    RESULTS/DISCUSSION
                                                                           Experiments were conducted using the above mentioned
     3. Decomposition of Signal                                         signals, in Matrix Laboratory (MATLAB) 7.10.0.499 (R2010)
The signals are decomposed into 5 levels of DWT using the               on a laptop with AMD Athlon 64x2 Dual-core Processor
Symmlet family, separating noise and true signals. Symmlets             1.80GHz. Figure 5 shows one mixed EEG signal set where
are orthogonal and its regularity increases with the increase in        there are overlays in signals Nos. 6-8 and Nos. 14-18. Figures
the number of moments [8]. After experiments the number of              6 and 7 show the same signal set after applying TIWT and
vanishing moments chosen is 8 (Sym8).                                   BMICA-TIWT merger showing that the overlays have been
                                                                        minimized – noise has been removed. With BMICA-TIWT it
    4. Choose and Apply Threshold Value                                 can be seen that more noise have been eliminated especially in
Denoise using the soft-thresholding method discarding all               signals Nos. 14-18.
coefficients below the threshold value using HardShrink based
on the universal threshold defined by Donoho & Johnstone [7]
given as:

             T =       2σ   2
                                log N                     (14)

 where N is the number of samples and σ2 is the noise power.

   5. Reconstruction of Signals
EEG signals are reconstructed using inverse DWT.

    6. Apply CS
Revert signals to their original time shift and average the
results obtained to produce the denoised EEG signals.

  The proposed algorithm can be expressed as Avg [Shift –
                                                                                               Figure 5: Raw EEG
Denoise -Unshift] i.e. using Eq. (8) it is defined as:




                                                                   52                              http://sites.google.com/site/ijcsis/
                                                                                                   ISSN 1947-5500
                                                        (IJCSIS) International Journal of Computer Science and Information Security,
                                                                                                                  Vol. 9, No.6, 2011
                                                                                        n  n       | pij |       
                                                                                    1
                                                         =            SIR ( dB )       ∑∑
                                                                                    n i =1  j max k | pij |
                                                                                                              − 1
                                                                                                                  
                                                                                                                 
                                                                                                                                     (16)


                                                                   1.4
                                                                                         BMICA/WT                 TIWT
                                                                   1.2
                                                                     1
                                                                   0.8
                                                                   0.6
                                                                   0.4
                        Figure 6: WT
                                                                   0.2
                                                                     0
                                                                          1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18

                                                                          Figure 8: SIR relations between BMICA-WT and TIWT



                                                                    SIR takes into account the fact that, in general, BSS is able
                                                                 to recover the sources only up to (a permutation and) a gain
                                                                 factor α. It is easy to check that if ˆ si = αsi the SIR is infinite.
                                                                 By contrary, when the estimated source is orthogonal to the
                                                                 true source, the SIR is equal to zero.
                                                                    Investigations on the EEG data sets described above showed
                                                                 that BMICA-WT produced higher SIR calculations than
                                                                 TIWT. This can be seen in Figure 8 where for 18 signal sets
                     Figure 7: BMICA-WT                          BMICA-WT produced SIR higher 94% of the time. This
                                                                 suggests that when merger with BMICA, TIWT achieved
                                                                 better separation of EEG signals.

A. Separation Accuracy Measures
                                                                 SDR
                                                                    While SIR assesses the quality of the estimated sources, and
SIR                                                              the Amari Index assess the accuracy of the estimated mixing
   The most common situation in many applications is the         matrix, the accuracy of the separation of an ICA algorithm in
degenerate BSS problem, i.e. n < m. This is most likely the      terms of the signals (i.e. the overall separation performance) is
case when we try to separate the underlying brain sources        calculated by the total Signal to Distortion Ratio (SDR)
from          electroencephalographic          (EEG)         or  defined as:
                                                                                            L
magnetoencephalographic (MEG) recordings using a reduced
set of electrodes. In degenerate demixing, the accuracy of a= = 1,...m,
                                                                                           ∑ xi (n) 2
                                                                                           n =1
                                                                                                                        (17)
                                                                      SDR ( xi , yi )  L
                                                                                                              i
BSS algorithm cannot be described using only the estimated                            ∑ ( yi (n) − xi (n) )
                                                                                                            2

mixing matrix. In this case it becomes of particular importance                       n =1

to measure how well BSS algorithms estimate the sources with
adequate criteria. The most commonly used index to assess the   where xi (n) is the original source signal and yi (n) is the
quality of the estimated sources is the Signal to Interference  reconstructed signal. The SDR is expressed in decibels (dB).
Ratio (SIR) [14]                                                The higher the SDR value, the better the separation of the
                                                                signal from the noise. When the SDR is calculated if it is
                                                                found to be below 8-10dB the algorithm is considered to have
                                                                failed separation.
                                                                    Examinations of experiment results show that BMICA-WT
                                                                tends to produce higher SDRs. In Table 1 it can be seen that




                                                            53                                http://sites.google.com/site/ijcsis/
                                                                                              ISSN 1947-5500
                                                              (IJCSIS) International Journal of Computer Science and Information Security,
                                                                                                                        Vol. 9, No.6, 2011
BMICA-TIWT produces higher SDR 65% of the time. This
indicates that almost every TIWT testing there is a BMICA-
TIWT test which produces a more accurate separation of
                                                                                                                             BMICA/WT        TIWT
signal and noise.
                                                                            50

             TABLE I: SDR FOR 19 EEG SIGNAL SETS
                                                                            40

                    BMICA-WT          TIWT
                                                                            30
                          3.54E+03      2.14E+03
                            -88.843   -1.27E+02
                            -57.376       -80.281                           20
                         -112.4126     -121.4977
                         -564.4613       -640.939
                                                                            10
                            -217.66    -260.2769
                        -2.48E+03     -3.40E+03
                        -8.62E+04     -8.57E+04                              0
                            27.0891        -0.002                                 1   2   3   4   5   6       7          8       9      10          11   12   13   14           15   16    17    18
                          4.77E+04     1.39E+03
                          6.80E+02     7.67E+02                             -10
                          2.38E+03      786.5632                                      Figure 9: PSNR relations between BMICA-WT and TIWT
                          2.73E+02      269.5584
                          1.83E+03     1.66E+03
                          1.12E+00     7.08E+02
                          6.50E+02     9.97E+02
                          8.55E+02      8.81E+02
                          4.74E+02     9.95E+02
                          2.71E+04     2.13E+04                           MSE
                                                                             The Mean Square Error (MSE) measures the average of the
                                                                          square of the “error” which is the amount by which the
B. Noise/Signal Measures                                                  estimator differs from the quantity to be estimated. The
                                                                          difference occurs because of the randomness or because the
                                                                          estimator doesn't account for information that could produce a
PSNR                                                                      more accurate estimate. MSE thus assesses the quality of an
   Peak Signal-to-Noise Ratio, often abbreviated as PSNR, is              estimator in terms of its variation and unbiasedness. Note that
an engineering term for the ratio between the maximum                     the MSE is not equivalent to the expected value of
possible     power     of    a signal and    the     power     of         the absolute error.
corrupting noise that affects the fidelity of its representation.
Because many signals have a very wide dynamic range, PSNR                                                 1       N


is usually expressed in terms of the logarithmic decibel scale.
                                                                                  =MSE
                                                                                                          N
                                                                                                              ∑ [ I ( x, y ) − I '( x, y )]                             2
                                                                                                                                                                            .
                                                                                                                  y =1                                                                    (19)
                                        MAX 2
                         =
                   PSNR 10 × log10 (            ).
                                         MSE            (18)              Since MSE is an expectation, it is a scalar, and not a random
                                                                          variable. It may be a function of the unknown parameter θ, but
   Figure 9 shows the relationship between BMICA-TIWT and                 it does not depend on any random quantities. However, when
TIWT for PSNR. Close examinations show that for all 18                    MSE is computed for a particular estimator of θ the true value
signal sets the PSNR for BMICA-TIWT were higher than                      of which is not known, it will be subject to an estimation error.
those of TIWT. BMICA-TIWT therefore produces a better                     In a Bayesian sense, this means that there are cases in which it
quality of the reconstructed signal i.e. it produces a signal of a        may be treated as a random variable.
higher quality and therefore can be considered a better
algorithm for denoising.                                                     Examination of the experiments shows that BMICA-WT
   In this research MAX takes the value of 255. Unlike MSE                produces smaller MSE than TIWT; see Table 2. Normally
which represents the cumulative squared error between the                 MSE is indirectly proportional to PSNR, i.e. when MSE
denoised and mixed signal, PSNR represents a measure of the               calculated is equal to zero, then PSNR is infinite. A good
peak error i.e. when the two signals are identical the MSE will           algorithm will therefore have a small MSE and a large PSNR.
be equal to zero, resulting in an infinite PSNR. The higher the           Investigations show that BMICA-TIWT produces smaller
PSNR, therefore, the better the quality of the reconstructed              MSE and larger PSNR than TIWT – better algorithm as it
signal i.e. a higher PSNR indicates that the reconstruction is of         produces results closer to the actual data.
a higher quality and therefore the algorithm is considered
good.




                                                                     54                                               http://sites.google.com/site/ijcsis/
                                                                                                                      ISSN 1947-5500
                                                          (IJCSIS) International Journal of Computer Science and Information Security,
                                                                                                                    Vol. 9, No.6, 2011
                                                                      [1]    M. Alfaouri, and K. Daqrouq, “ECG Signal Denoising By Wavelet
                                                                             Transform Thresholding,” American Journal of Applied Sciences vol. 5
                                                                             no. 3, pp 276-281, 2008.
                                                                      [2]    M Alfaouri, K. Daqrouq, I. N. Abu-Isbeih, E. F. Khalaf, A. Al-Qawasmi
            TABLE II. SDR FOR 19 EEG SIGNAL SETS                             and W. Al-Sawalmeh, “Quality Evaluation of Reconstructed Biological
                                                                             Signals”, American Journal of Applied Sciences vol. 6 no.1, pp 187-
                                                                             193, 2009.
                   BMICA-WT         TIWT
                                                                      [3]    M.I. Bhatti, A. Pervaiz, and M.H. Baig, “EEG Signal Decomposition
                      1.33E+03      2.27E+04                                 and Improved Spectral Analysis Using Wavelet Transform”, in
                                                                             Proceedings of the 23rd Engineering in Medicine and Biology Society 2,
                           44.032   583.0681                                 pp 1862-1864, 2001.
                           21.863   529.9447                          [4]    R.R. Coifman, and D.L. Donoho, “Translation-Invariant De-Noising”,
                                                                             Lecture Notes in Statistics: Wavelets and Statistics, 1995 pp 125-150.
                        25.8048     501.1608                          [5]    P. Comon, “Independent Component Analysis, a New Concept?” Signal
                                                                             Processing, Elsevier, vol. 36 no.3, pp 287-314, 1994.
                            5.404   1.24E+03
                                                                      [6]    R.J. Croft, and R.J. Barry, “Removal of Ocular Artifacts from the EEG:
                           2.8685   917.3362                                 A Review”, Clinical Neurophysiology vol. 30 no. 1, pp. 5-19, 2000..
                        15.7071     1.57E+03                          [7]    D.L. Donoho, and I.M. Johnstone, “ Adapting to unknown smoothness
                                                                             via Wavelet Shrinkage”, Journal of the American Statistical Association,
                        53.7782     3.25E+05                                 vol. 90 no. 32, pp. 1200-1224, 1995.
                                                                      [8]    B. Ferguson, and D. Abbott, “Denoising Techniques for Terahertz
                      3.67E+03      5.94E+11                                 Response of Biological Samples”, Microelectronics Journal 32, pp 943-
                      1.04E+04      4.24E+04                                 953, 2001.
                                                                      [9]    R. Gribonval, E. Vincent, and C. Févotte, “Proposals for Performance
                      6.74E+03      4.01E+04                                 Measurement In Source Separation.” , In the Proceedings of the 4th
                                                                             International Symposium on Independent Component Analysis and Blind
                      1.90E+04      3.16E+04
                                                                             Signal Separation (ICA2003), Nara, Japan, pp 763–768, 2003.
                      1.10E+02      4.33E+03                          [10]   Y.M. Hawwar, A.M. Reza, and R.D. Turney, Filtering(Denoising) in the
                                                                             Wavelet Transform Domain, Department of Electrical Engineering And
                      1.84E+04      2.13E+04                                 Computer Science, University of Wisconsin-Milwaukee, 2002.
                                                                             Unpublished
                      6.06E+03      4.05E+04
                                                                      [11]   S. Hoffman, and M. Falkenstien , “The Correction of Eye Blink
                      2.98E+03      4.08E+04                                 Artefacts in the EEG: A Comparison of a Two Prominent Methods”,
                                                                             PLoS One 3(8):e3004, 2008
                      2.75E+03      3.72E+04
                                                                      [12]   A. Hyvarinen, J. Karhunen and E. Oja, “Independent Component
                      6.32E+03      3.15E+04                                 Analysis”, eds. Wiley & Sons 2001
                                                                      [13]   G. Inuso, F. La Foresta, N. Mammone, and F.C. Morabito, “Wavelet-
                                                                             ICA      methodology       for   efficient   artifact    removal    from
                                                                             Electroencephalographic recordings”, in the Proceedings of the
                                                                             International Joint Conference on Neural Networks, pp. 1524-1529,
                                                                             2007.
                                                                      [14]   V. Krishnaveni, S Jayaraman, A. Gunasekaran, and K Ramadoss,
                     VI.     CONCLUSIONS                                     “Automatic Removal of Ocular Artifacts using JADE Algorithm and
                                                                             Neural Network”, International Journal of Intelligent Systems and
   Research have found that WT is the best suited for                        Technologies, vol. 1 no. 4, pp. 322-333, 2006.
denoising as far as performance goes because of its properties        [15]   T.L. Lee-Chiong, Sleep: A Comprehensive Handbook eds John Wiley &
like sparsity, multiresolution and multiscale nature. Non-                   Sons, 2006.
orthogonal wavelets such as UDWT and Multiwavelets                    [16]   M. Lennon, G. Mercier, M.C. Mouchot, and L. Hubert-Moy,
improve the performance at the expense of a large overhead in                “Curvilinear Component Analysis for non-linear dimensionality
                                                                             reduction of hyperspectral images”, in the Proceedings of the SPIE
their computation [28]. Research also shows that TIWT is                     Symposium on Remote Sensing Conference on Image and Signal
considered to be an improvement on WT, removing Gibbs                        Processing for Remote Sensing VII 4541, p 157, 2001.
phenomena. In this work we have found that the addition of            [17]   M.C. Motwani, M.C. Gadiya R.C., Motwani and F.C. Harris Jr.,
BMICA to TIWT has been found to improve its performance.                     “Survey of Image Denoising Techniques”, In the Proceedings of the
With the BMICA merger the separation accuracy of TIWT                        Global Signal Processing Expo and Conference (GSPx), pp 27-30, 2004.
increased although it was not so 100% of time with SDR. As            [18]   V.V.K.D.V. Prasad, P. Siddaiah, and B. Prabhaksrs Rao, “A New
                                                                             Wavelet Based Method for Denoising of Biological Signals”,
far as the noise/signal separation goes however the merger                   International Journal of Computer Science and Network Security
produces a better quality reconstructed signal 100% of the                   (IJCSNS), vol. 8, no. 1, pp. 238-244, 2008.
time.                                                                 [19]   N. Ramachandran, and A.K. Chellappa, “Feature extraction from EEG
                                                                             using wavelets: spike detection algorithm”, In the Proceedings of the 7th
   .                                                                         International Conference on Mathematics in Signal Processing, 2006
                                                                      [20]   R. Romo-Vazquez, R. Ranta, V. Louis-Dorr, and D. Maquin, “Ocular
                        REFERENCES                                           Artifacts Removal in Scalp EEG: Combining ICA and Wavelet
                                                                             Denoising”, Physics in Signal and Image Processing (PSISP 07), 2007
                                                                      [21]   P. Senthil Kumar, R. Arumuganathan, K. Sivakumar, and C. Vimal, “A
                                                                             Wavelet based Statistical Method for De-noising of Ocular Artifacts in




                                                                 55                                      http://sites.google.com/site/ijcsis/
                                                                                                         ISSN 1947-5500
                                                                           (IJCSIS) International Journal of Computer Science and Information Security,
                                                                                                                                     Vol. 9, No.6, 2011
       EEG Signals”, International Journal of Computer Science and Network
       Security (IJCSNS) vol. 8 no. 9, pp. 87-92, 2008.
[22]   P. Senthil Kumar, R. Arumuganathan, K. Sivakumar, and C. Vimal,
       “Removal of Ocular Artifacts in the EEG through Wavelet Transform
                                                                                                                                 AUTHORS PROFILE
       without using an EOG Reference Channel”, International Journal of
       Open Problems in Computer Science and Mathematics (IJOPCM) vol. 1
       no. 3, pp 189-198, 2008.
[23]   P. Senthil Kumar, P., R. Arumuganathan, K. Sivakumar, and C. Vimal                                             Janett Walters-Williams received the B.S.
       “An Adaptive method to remove ocular artifacts from EEG signals using                                          and M.S. degrees, from the University of the
       Wavelet”, Journal of Applied Sciences Research, vol. 5 no. 7, pp. 741-                                         West Indies in 1994 and 2001, respectively.
       745, 2009.                                                                                                     She is presently a Doctoral student at the
[24]   L. Su and G Zhao, “Denoising of ECG Signal Using Translation                                                   University of Southern Queensland. After
       Invariant Wavelet Denoising Method with Improved Thresholding”, In                                             working as an assistant lecturer (from 1995),
       the 27th Annual Conference IEEE Engineering in Medicine and Biology,                                           in the Dept. of Computer Studies, in the
       pp 5946-5949, 2005.                                                                                            University of Technology, she has been a
[25]   Ungureanu, M., Bigan, C., Strungaru, R., and Lazarescu, V. 2004.                lecturer in the School of Computing & Information Technology, since 2001.
       Independent Component Analysis Applied in Biomedical Signal                     Her research interest includes Independent Component Analysis, Neural
       Processing", in proceedings of Measurement Science Review 4(2).                 Network Applications, signal/image processing, bioinformatics and artificial
                                                                                       intelligence.
[26]   J. Walters-Williams, and Y. Li. “Estimation of Mutual Information: A
       Survey”. 4th International Conference on Rough Set and Knowledge
       Technology (RSKT2009), pp.389-396, 2009.
[27]   W. Zhou, and J. Gotman, “Removal of EMG and ECG Artifacts from
       EEG Based on Wavelet Transform and ICA”. In the Proceedings of the
       26th Annual International Conference on the IEEE EMBS, 2004, pp. 392-
       395.
[28]   W. Zhou, and J. Gotman, “Removing Eye-movement Artifacts from the
       EEG during the Intracarotid Amobarbital Procedure” In Epilepsia vol.                                          Yan Li received the B.E., M. E., and Dr. Eng.
       46 no.3, pp. 409-411, 2005.                                                                                   degrees from Hiroshima Univ. in 1982, 1984,
[29]   G. Zouridakis, and D. Iyer, “Comparison between ICA and Wavelet-                                              and 1990, respectively. She has been an
       based Denoising of single-trial evoked potentials” In the Proceedings of                                      associate professor at the University of
       the 26th Annual International Conference of the IEEE Engineering in                                           Queensland since 2008. She is the winner of
       Medicine and Biology Society, pp. 87-90, 2004.                                                                the 2008 Queensland Smart Woman-Smart
                                                                                                                     State Awards in ICT as well as one of the
                                                                                                                     Head of Department awardees for research
                                                                                                                     publications in 2006 and 2008. She is an
                                                                                                                     Australian Reader to assess Australia Research
                                                                                                                     Council Discovery and Linkage Project
                                                                                                                     Proposals and has organized the RSKT 2009
                                                                                       and CME 2010 international conferences. Her research interest includes
                                                                                       signal/image processing, independent component analysis, Biomedical
                                                                                       Engineering, Blind Signal Separation and artificial intelligence




                                                                                  56                                   http://sites.google.com/site/ijcsis/
                                                                                                                       ISSN 1947-5500
                                                                       (IJCSIS) International Journal of Computer Science and Information Security,
                                                                                                                           Vol. 9, No. 6, June 2011

    Hole Filing IFCNN Simulation by Parallel RK(5,6)
                      Techniques
                                                  (Hole Filing by Parallel RK(5,6))


                   Sukumar Senthilkumar*                                                             Abd Rahni Mt Piah
                  Universiti Sains Malaysia                                                       Universiti Sains Malaysia
              School of Mathematical Sciences                                                  School of Mathematical Sciences
                  11800 USM Pulau Pinang                                                          11800 USM Pulau Pinang
                         MALAYSIA                                                                        MALAYSIA
           E-mail: ssenthilkumar1974@yahoo.co.in                                                 E-mail: arahni@cs.usm.my
                   ssenthilkumar@usm.my


Abstract— This paper concentrates on employing different                           developed by Butcher [8-10] to solve many computational
parallel RK(5,6) techniques for hole-filing via unique                             problems. Evans and Sanugi [11] developed parallel
characteristics of improved fuzzy cellular neural network                          integration techniques of Runge-Kutta for step by step solution
(IFCNN) simulation to improve the performance of an                                of ordinary differential equations to obtain results.
image or handwritten character recognition. Results are                            Ponalagusamy and Ponammal [12-14] developed new parallel
presented according to the range of template selected for                          fifth order algorithm to solve robot arm model, time varying
simulation.                                                                        network for first order initial value problems and new
                                                                                   generalised plasticity equation for compressible powder
    Keywords- Parallel 5-order 6-stage numerical integration                       metallurgy materials with results on stability region for test
techniques, Improved fuzzy cellular neural network, Hole filing,                   equation. Keyes et al. [15] provided a survey towards
Simulation, Ordinary differential equations.                                       applications requiring memories and processing rates of large-
                                                                                   scale parallelism, leading algorithmicist applications of
                          I.     INTRODUCTION                                      parallel numerical algorithms. Further, focused on practical
                                                                                   medium-granularity parallelism, approachable through
    Parallel computing techniques are used to carry out                            traditional programming languages. Gear [16] gave the
computations simultaneously, operating on the principle that                       potentiality behavior for parallelism in solving real time
large problems are often can be divided into smaller ones,                         problems using ordinary differential equations. A survey of
which can then be solved concurrently. It is a simultaneous                        potential for parallelism in Runge-Kutta techniques and
process of multiple computing resources to solve a                                 parallel numerical techniques for initial value problems for
computational problem easily and quickly. In real time it is                       ordinary differential equations are demonstrated by Norsett
practically believed by researchers that a possible way of                         and Jackson [17] and Jackson [18]. Using fourth order explicit
solving many significant computationally intensive problems                        Runge-Kutta method, a parallel mesh chopping algorithm for a
in science and engineering is by employing parallel algorithms                     class of initial value problem is illustrated by Katti and
effectively.                                                                       Srivastava [19]. Harrer et al. [20] introduced explicit Euler,
                                                                                   predictor-corrector and fourth-order Runge-Kutta algorithms
    From the literature, it is observed that most of the real time                 for simulating cellular neural networks. The RK-Butcher
problems are solved by adapting Runge-Kutta (RK) methods                           algorithm has been introduced by Bader [21, 22] for finding
which in turn are applied to compute numerical solutions for                       truncation error estimates, intrinsic accuracies and early
various problems, which are modeled in terms of initial value                      detection of stiffness in coupled differential equations that
problems as in Alexander and Coyle [3], Evans [4], Hung [5],                       arises in theoretical chemistry problems. Senthilkumar and
Shampine and Watts [6] and Shampine and Gordon [7].                                Piah [23] implemented parallel Runge-Kutta arithmetic mean
Shampine and Watts [6] developed mathematical codes for                            algorithm to obtain a solution to a system of second order
Runge-Kutta fourth order method to solve many numerical                            robot arm. In this paper a new attempt has been made to
problems. Runge-Kutta formula of fifth order has been                              employ parallel RK(5,6) algorithm for hole filing problem
This research work is carried out by the first author under a post doctoral        under IFCNN environment. Oliveira [24] introduced a popular
fellow scheme at the School of Mathematical Sciences, Universiti Sains             sequential RK-Gill algorithm to evaluate effectiveness factor
Malaysia, 11800 USM Pulau Pinang, MALAYSIA.                                        of immobilized enzymes.
*Corresponding Author.




                                                                              57                              http://sites.google.com/site/ijcsis/
                                                                                                              ISSN 1947-5500
                                                            (IJCSIS) International Journal of Computer Science and Information Security,
                                                                                                                Vol. 9, No. 6, June 2011

    Computing value is easy in case of implementing VLSI                sufficiently utilized. FCNN is a locally connected network
CNN chips, thereby making real-time operations possible.                [37] and the output of a neuron is connected to the inputs of
Roska [28] and Roska et al. [29] have presented the first               every neuron/cell in its r × r neighborhood, and similarly the
widely used simulation system which allows simulation of a              inputs of a neuron are only connected to the outputs of every
large class of CNN and is especially suited for image                   neuron in its r × r neighborhood. It is apparent that feedback
processing applications. It also includes signal processing,            (not recurrent) connections are presented in detail. The
pattern recognition and solving ordinary and partial                    architecture of IFCNN is shown in Figure 1.
differential equations, as in Gonzalez et al. [30]. The existing
RK-Butcher fifth order method hole filing problem has been
studied by Murugesh and Badri [32] via CNN simulation
model. Similarly, hole filing problem has been analyzed by
Murugesan and Elango [50] by means of existing RK fourth
order method under CNN simulation. Dalla Betta et al. [46]
implemented CMOS implementation of an analogy
programmed cellular neural network. Anguita et al. [31]
discussed in detail about parameter configurations for hole
extraction in cellular neural networks.

Zadeh [35] and Zadeh et al. [36] introduced the concept of
fuzzy sets (FSs) theory. Different notions of higher-order FSs
have been proposed by different researchers. Recently, fuzzy
cellular neural network (CNN) model [43-45] has attracted a
great deal of interest among researchers from different
disciplines. A locally interconnected, regularly repeated,
analogue (continuous- or discrete-time) circuits with a one-or-
two-or three-dimensional grid architecture called CNNs
introduced by Chua and Yang [25-26] and Chua [27]. Each
cell (neuron) in CNN is a non-linear dynamic system coupled
only to its nearest neighbors. Because of this local
interconnection property, CNNs have been considered
specifically suitable for very-large-scale integration
implementations. Shitong et al. [37] proposed improved fuzzy
cellular networks to incorporate the novel fuzzy status
containing the useful information beyond a white blood cell
into its state equation, resulting in enhancing the boundary
integrity. Laiho et al. [38] proposed template design for CNNs                                                Figure 1. Architecture of IFCNN
with 1-bit weights.
                                                                        The state equation of IFCNN is given by,
    This paper is ordered as follows. A brief introduction on               dxij           −1
improved fuzzy cellular non-linear network is presented in              c             =       xij + ∑                       A(i, j; k , l ) ykl +
section 2. Section 3 deals with the performance of hole-filler                dt           Rx      c ( k ,l )∈N r ( i , j )
template design and simulation results. Section 4 discusses
parallel RK(5,6) numerical integration techniques. Finally,                      ∑
                                                                         c ( k ,l )∈N r ( i , j )
                                                                                                    B(i, j; k , l )ukl
concluding remarks is presented in section 5.
                                                                        + I ij +                    ∧
                                                                                                    %             ( Af min (i, j; k , l ) + ykl ) +
              II.   A BRIEF OVERVIEW OF IFCNN                                          c ( k ,l )∈N r ( i , j )

                                                                                 ∨
                                                                                 %                 ( Af max (i, j; k , l ) + ykl ) +                               (1)
                                                                        c ( k ,l )∈N r ( i , j )
    The capability of the conventional cellular neural network
to solve different kinds of image processing problems and the           +             ∧
                                                                                      %                ( B f min (i, j; k , l )ukl ) +
                                                                            c ( k ,l )∈N r ( i , j )
capability of fuzzy logic to cope with uncertainty in images
are the inherent features of FCNN [37]. Moreover, it also has                    ∨
                                                                                 %                 ( B f max (i, j; k , l )ukl ) +
                                                                        c ( k ,l )∈N r ( i , j )
inbuilt connections with mathematical morphology. The
unique characteristic of IFCNN is incorporating novel fuzzy                       ∧
                                                                                  %                ( Ff min (i, j; k , l ) xkl ) +
status with feed-forward and feedback templates in FCNN                 c ( k ,l )∈N r ( i , j )

such that the useful information beyond the region can be                        ∨
                                                                                 %                 ( Ff max (i, j; k , l ) xkl )
                                                                        c ( k ,l )∈N r ( i , j )




                                                                   58                                                       http://sites.google.com/site/ijcsis/
                                                                                                                            ISSN 1947-5500
                                                                    (IJCSIS) International Journal of Computer Science and Information Security,
                                                                                                                        Vol. 9, No. 6, June 2011

and the input equation of Cij is given by,                                     considered in the above template to congregate the IFCNN’s
                                                                               symmetric requirements.
uij = Eij ≥ 0,                                                      (2)
                                                                                     III.         A BRIEF SKETCH ON HOLE-FILLER AND SIMULATION
1 ≤ i ≤ M; 1 ≤ j ≤ N.                                                                                                                     RESULTS
                                                                               In a bipolar image, all the holes are filled and remains
                                                                               unaltered outside the holes, in case of hole filing IFCNN
the output equation of Cij is given by,
                                                                               simulation [46-50]. Allow R x = 1, C = 1 and take +1 to
                                1⎡
yij = f ( xij ) =                  xij + 1 − xij − 1 ⎤ ,            (3)        represent the black pixel and –1 for the white pixel. If the
                                2⎣                   ⎦
                                                                                                                                                  { }
                                                                               bipolar image is input with U = u ij into IFCNN and images
1 ≤ i ≤ M; 1 ≤ j ≤ N.                                                          having holes are enclosed by the black pixels, then initial state
                                                                               values are set to be xij (0) = 1 . The output values are obtained
The constraints /conditions are given by
                                                                               as y ij (0) = 1,1 ≤ i ≤ M ,1 ≤ j ≤ N from equation (1).

Afmax(i,j;k,l) = Afmin(k,l;i,j);                                               Consider the templates A, B and independent current source I
                                                                               as
Afmax(i,j;k,l) = Afmax(k,l;i,j);

Ffmax(i,j;k,l) = Ffmin(k,l;i,j);                                                   ⎡0 a 0⎤
                                                                               A = ⎢a b a ⎥ ,
                                                                                   ⎢      ⎥                                              a > 0, b > 0
Ffmax(i,j;k,l) = Ffmax(k,l;i,j);
                                                                                   ⎢0 a 0⎥
                                                                                   ⎣      ⎦
1 ≤ i ≤ M; 1 ≤ j ≤ N.                                               (4)
                                                                                                                                                                                 (6)
 xij (0) ≤ 1 ;1 ≤ i ≤ M; 1 ≤ j ≤ N.                                                 ⎡0 0 0 ⎤
                                                                                B = ⎢0 4 0 ⎥ ,
                                                                                    ⎢      ⎥                                               I = -1
u ij (0) ≤ 1 ;1 ≤ i ≤ M; 1 ≤ j ≤ N.                                                 ⎢0 0 0 ⎥
                                                                                    ⎣      ⎦

 A(i, j; k , l ) = A(k , l ; i, j )
                                                                               where the template parameters a and b are to be determined. In
                                                        ~       ~              order to make the outer edge cells become the inner ones,
From the above Eqs. (1) - (4), ∧ , ∨ , Nr(i,j), and A are                      normally auxiliary cells are added along the outer boundary of
identical as in FCNN. Comparing (4) with FCNN, the only                        the image and their state values are set to be zeros by circuit
one discrepancy between the equation is the novel fuzzy                        realization resulting in zero output values. The state equation
status.                                                                        (1) can then be rewritten as

                ~                                                              dxij
   (         ∧
       ckl ∈N r ( i , j )
                            ( Ff min (i, j; k , l ) + xkl ) +
                                                                                 dt
                                                                                         = − xij +                      ∧
                                                                                                                        %
                                                                                                              c ( k ,l )∈N r ( i , j )
                                                                                                                                         ( Af min (i, j; k , l ) + ykl ) +
                                                                                                                                                                                 (7)
            ~                                                                           ∨
                                                                                        %                 ( Af max (i, j; k , l ) + ykl ) + 4uij (t ) − I .
          ∨
   ckl ∈N r ( i , j )
                        ( Ff max (i, j; k , l ) + xkl ) +           (5)        c ( k ,l )∈N r ( i , j )



   xkl ))                                                                      For instance, here the cells C(i+1,j), C(i-1,j), C(i,j+1) and
                                                                               C(i,j-1) are non-diagonal cells. Designing of hole-filler
is adhered to Eq. (1), which obviously reflects the required                   template [31] and its various sub-problems are discussed using
information where Ffmin(i,j;k,l) and Ffmax(i,j;k,l) indicates the              CNN simulations [46-50]. Figures 2 and 3 show the hole filing
connected weights between cell Cij and Ckl respectively.                       of an image (before and after) by employing a parallel
Hence, the complete template determines the connection                         RK(5,6) type-III technique. The settling time Ts and
between cell and its neighbors, consists of (2r × 1) and (2r ×                 computation time Tc for different step sizes are considered for
1) matrices A, B, Ffmin and Ffmax. The symmetric matrices are                  the purpose of comparison. The settling time Ts is the time




                                                                          59                                                              http://sites.google.com/site/ijcsis/
                                                                                                                                          ISSN 1947-5500
                                                                           (IJCSIS) International Journal of Computer Science and Information Security,
                                                                                                                               Vol. 9, No. 6, June 2011


from start of computation until the last cell leaves the interval                   which is based on a specified limit (e.g., |dx/dt|< 0.01). The
[-1.0, 1.0]                                                                         computation time Tc is the time taken for settling the network
                                                                                    and adjusting the cell for proper position once the network is
                                                                                    settled. The simulation shows the desired output for every
                                                                                    neuron/cell. Specifically, note that +1 and -1 indicate the black
                                                                                    and white pixels, respectively. The marked selected template
                                                                                    parameters a and b are restricted to the shaded area, as shown
                                                                                    in figure 4 for the simulation.
                                                                                        IV.   PARALLEL RUNGE-KUTTA FIFTH ORDER TECHNIQUES: A
                                                                                                          BRIEF OVERVIEW

                                                                                    A. Parallel Runge-Kutta 5-Order 6-Stage Type-I Technique
             Figure 2(a). Original image and hole filed image


                                                                                    A parallel Runge-Kutta 5-order 6-stage type-I technique [12-
                                                                                    14] is one of the simplest method used to solve ordinary
                                                                                    differential equations. It is an explicit formula which adapts
                                                                                    the Taylor’s series expansion in order to obtain the
                                                                                    approximation. A parallel Runge-Kutta 5-order 6-stage type-I
                                                                                    technique is used to determine yj and y j , j = 1, 2,3,....m such
                                                                                                                            &
                                                                                    that
                                                                                                       7      32     2    32    7
                                                                                    y n +1 = y n + [      k1 + k 3 + k 4 + k 5 + k 6 ]
            Figure 2(b). Original image and hole filed image
                                                                                                       90     90    90    90    90
                                                                                                                                                       (8)

   Figure 2. Hole filing before and after adapting type-III parallel RK(5,6)        Thus, the corresponding parallel Runge-Kutta 5-order 6-stage
                                  technique                                         type –I technique of Butcher array represents


                                                                                    0

                                                                                    2          2
                                                                                    5          5

                                                                                    1          11        5
                                                                                    4          64        64
  Figure 3. Hole filing before and after employing type-III parallel RK(5,6)
                                  technique                                         1           3         5
                                                                                    2          16        16

                                                                                    3          9         − 27   3           9
                                                                                    4          32         32    4          16

                                                                                               −9        35                − 12       8
                                                                                    1                            0
                                                                                               28        28                 7         7


                                                                                               7                32          2         32         7
                                                                                                          0
                      Figure 4. Range of the template
                                                                                               90               90         90         90         90



                                                                               60                               http://sites.google.com/site/ijcsis/
                                                                                                                ISSN 1947-5500
                                                                     (IJCSIS) International Journal of Computer Science and Information Security,
                                                                                                                         Vol. 9, No. 6, June 2011

The 6 stage 5th order algorithm with 5 parallel and 2                               0
processors, by selecting a43 = 0 to evaluate k3 and k4
simultaneously is given by                                                          1          1
                                                                                    3          3
.   k1ij = Δtf ( xij (t n )) ,
                                                                                    2          4          6
                        2 Δt      2                                                   .
    k = Δtf ( xij (tn +
      ij
      2                      ) ) + k1ij ,                                           5          25         25
                         5        5
                                                                                    1          1                    15
                        Δt    11      5 ij                                                               -3
    k3 = Δtf ( xij (tn + ) ) + k1ij +
     ij                                        *
                                         k2 = k3 ij ,                               2          4                     4
                        4     64      64
                                                                                    2          6          − 90       50         8
                       Δt     3       5 ij                                          3          81          81        81         81
    k = Δtf ( xij (tn + ) ) + k1ij +
      ij
      4
                                              *
                                        k2 = k4 ij ,
                       4     16      16
                                                                                    4          −6         36          10        8
                 Δt    9       275 ij 3 ij 9 ij                                     5          75         75          75        75
k = Δtf ( xij (tn ) ) + k1ij −
      ij
      5                           k 2 + k3 + k 4
                 2     32      32      4    16
   *ij
= k5 ,
                                                                                               23                   125              − 812           125
                                                                                                            0                   0
                       3   9      35 ij 12 ij 8 ij                                            192                   192               192            192
k6 = Δtf ( xij (tn + Δt ) − k1ij + k2 − k4 + k5
  ij

                       4 28       28     7    7
     *ij
= k6 .                                          (9)                             Therefore, the final integration is a weighted sum of four
                                                                                calculated derivatives per time step which is given by
Therefore, the final integration is a weighted sum of the five
calculated derivatives which is given as                                                                23      125       81      125
                                                                                    y n +1 = y n + h[      k1 +     k3 −     k5 +     k6 ] .
                                                                                                        90      192      192      192
                            Δt
    tn+1
                                                                                                                                                           (12)
     ∫     f ( x(t ))dt =      [7k1ij + 32k3ij + 12k4 + 32 k5 + 7 k6 ].
                                                    ij      ij     ij

    tn
                            90                                                  The 6 stage 5th order algorithm with 5 parallel and 2
                                                                    (10)        processors by selecting a65 = 0 to evaluate k5 and k6
                                                                                simultaneously is given by

B. Parallel Runge-Kutta 5-Order 6-Stage Type-II Technique                       .   k1ij = Δtf ( xij (t n )) ,

A parallel Runge-Kutta 5-order 6-stage type-II technique [12-                                                 Δt      1
14] is also one of the simplest method used to solve ordinary                       k 2 = Δtf ( xij (t n +
                                                                                      ij
                                                                                                                 ) ) + k1ij ,
differential equations. It is an explicit formula which adapts                                                3       3
the Taylor’s series expansion in order to obtain the
approximation. A parallel Runge-Kutta 5-order 6-stage type-II                                                 2 Δt      4       6 ij
                                                                                    k 3 = Δtf ( xij (t n +
                                                                                      ij
                                                                                                                   ) ) + k1ij +    k2 ,
technique determines yj and y j , j = 1,2,3,....m such that
                               &                                                                               3        25      25
                        23      125       81      125
    yn +1 = yn + h[        k1 +     k3 −     k5 +     k6 ].                                        Δt k1ij           15k 3
                                                                                                                         ij
                        90      192      192      192                           k = Δtf (t n + ) +
                                                                                        ij
                                                                                        4                   − 3k 2 +
                                                                                                                 ij

                                                                    (11)                            2    4             4
                                                                                                         2     6 ij 90 ij 50 ij 8 ij
Thus, the corresponding parallel Runge-Kutta 5-order 6-stage                    k 5 = Δtf ( xij (t n + Δt ) + k1 − k 2 − k 3 + k 4
                                                                                  ij

technique of type-II Butcher array represents
                                                                                                         3 81          81   81  81
                                                                                     *ij
                                                                                = k5




                                                                           61                                       http://sites.google.com/site/ijcsis/
                                                                                                                    ISSN 1947-5500
                                                                  (IJCSIS) International Journal of Computer Science and Information Security,
                                                                                                                      Vol. 9, No. 6, June 2011
               4 Δt      6
k6 = Δtf ( xij (tn +
 ij
                    ) ) − k1ij +                                           Therefore, the final integration is a weighted sum of five
                5        75         *ij
                                 = k6 .                       (12)         calculated derivatives per time step which is given by
36 ij 10 ij 8 ij
   k2 +    k3 + k 4
75      75     75                                                                           17       250
                                                                               yn +1 = yn + h[  k1 −     k3 +
                                                                                           306       153
Therefore, the final integration is a weighted sum of the five                                                                                         (15)
                                                                               442      8192       31
calculated derivatives which is given by                                           k4 +      k5 +     k6 ].
tn+1
                         Δt                                                    255      9945      234
 ∫     f ( x(t ))dt =       [23k1ij + 125k3 − 81k5ij + 125k6 ].
                                          ij               ij

 tn
                        192
                                                              (13)         The 6 stage 5th order algorithm with 5 parallel and 2
                                                                           processors by selecting a54 = 0 to evaluate k5 and k4
C. Parallel Runge-Kutta 5-Order 6-Stage Type-III Technique                 simultaneously is given by

A parallel Runge-Kutta 5-order 6-stage type-III technique [12-             .   k1ij = Δtf ( xij (t n )) ,
14] is another simple method used to solve ordinary
differential equations. It is also an explicit formula which
adapts the Taylor’s series expansion for an approximation. A                                              Δt      1
                                                                               k 2 = Δtf ( xij (t n +
                                                                                 ij
                                                                                                             ) ) + k1ij
parallel Runge-Kutta 5-order 6-stage type-III technique                                                   5       5
determines yj and y j , j = 1, 2,3,....m such that
                   &
             17       250                                                                                 2 Δt       39 ij    5 ij
yn +1 = yn + h[  k1 −     k3 +                                                 k 3 = Δtf ( xij (t n +
                                                                                 ij
                                                                                                               )) +     k1 +    k2 ,
            306       153                                                                                  5        160      32
                                                              (14)
442      8192       31
    k4 +      k5 +     k6 ].                                                               Δt k1ij 5k 2
                                                                                                      ij
                                                                                                           2k 3
                                                                                                              ij

255      9945      234                                                         k = Δtf (t n ) +
                                                                                 ij
                                                                                 4                −      +          *ij
                                                                                                                 = k4
                                                                                           2    24 24       3
Thus, the corresponding parallel Runge-Kutta 5-order 6-stage
technique of type-III Butcher array represents                                                                3    1       3 ij 1 ij
                                                                               k 5 = Δtf ( xij (t n + Δt
                                                                                 ij
                                                                                                                ) + k1ij − k 2 − k 3 = k5 ij
                                                                                                                                        *

                                                                                                             16 8         16    4
0
                                                                                                9 ij
                                                                                                  k1 +
                                                                               k6 = Δtf ( xij (tn + Δt ) ) −
                                                                                ij

1            1                                                                                 14                                                      (16)
5            5                                                             15 ij 8 ij 12 ij 8 ij
                                                                              k 2 + k 3 − k 4 + k5 .
                                                                           14      7     7     7
2            39           5
  .                                                                        Therefore, the final integration is a weighted sum of the five
5           160          32                                                calculated derivatives which is given by

1            1          −5       2                                             tn+1
                                                                                                            17 k1ij 250k3ij

2            24         24       3                                              ∫
                                                                               tn
                                                                                      f ( x(t ))dt = Δt[−
                                                                                                             306
                                                                                                                   −
                                                                                                                     153
                                                                                                                            +
                                                                                                                                                       (17)
 3           1          −3       1                                             442k   8192k
                                                                                        ij
                                                                                              31k    ij         ij

16           8          16       4                                                  +   4
                                                                                            +     ]. 5          6
                                                                                255    9945   234
             −9         15        8      12       8
 1
             14         14        7       7       7                                                 V.      CONCLUDING REMARKS

             − 17            − 250     442      8192     31
                  0                                                        In this paper, hole filing problem is addressed under IFCNN
             306              153      255      9945    234                model using parallel RK(5,6) techniques and its validity is
                                                                           illustrated by simulation results. It is observed that the hole is




                                                                      62                                        http://sites.google.com/site/ijcsis/
                                                                                                                ISSN 1947-5500
                                                                       (IJCSIS) International Journal of Computer Science and Information Security,
                                                                                                                           Vol. 9, No. 6, June 2011

                                                                                    [15]    D.E. Keyes, A. Sameh and V.V. Krishnan, “Parallel Numerical
filled and the outside image remains unaffected, that is, the                              Algorithm”, Kluwer Academic Publishers, 1997.
edges of the images are preserved and are intact. The                               [16]   G.W. Gear, “The potential for parallelism in ordinary differential
templates of the cellular neural network are not unique and                                equations”, Technical Report UIUCDCS-R-86-1246, Computer
this is important in its implementation. The significance of this                          Science Department, University of Illinois, Urbana, IL, 1986.
                                                                                    [17]   K.R. Jackson and S.P. Norsett, “The potential for parallelism in
work is to improve the performance of handwritten character                                Runge-Kutta methods: Part-I: RK formulas in standard form”,
recognition because in many language scripts, numerals and in                              SIAM Journal on Numerical Analysis, Vol. 32, pp. 49-82, 1995.
images etc., there are many holes and the CNN described                             [18]    K.R. Jackson, “A survey of parallel numerical methods for initial
above can be used in addition to the connected component                                   value problems for ordinary differential equations”, IEEE
                                                                                           Transactions on Magnetics, Vol. 27, pp. 3792-3797, 1991.
detector. It is also noticed that IFCNN preserves the boundary                      [19]   C.P. Katti and D.K. Srivastava, “On a parallel mesh chopping
integrity.                                                                                 algorithm for fourth order explicit Runge-Kutta method”, Applied
                                                                                           Mathematics and Computation, Vol. 143, pp. 563-570, 2003.
                                                                                    [20]   H. Harrer, A. Schuler and E. Amelunxen, “Comparison of different
                         ACKNOWLEDGMENT                                                    numerical integrations for simulating cellular neural networks”, In
                                                                                           CNNA-90 Proceedings of IEEE International Workshop on
                                                                                           Cellular Neural Networks and their Applications, pp. 151-159,
    The first author would like to extend his sincere gratitude                            1990.
                                                                                    [21]    M. Bader, “A comparative study of new truncation error estimates
to Universiti Sains Malaysia for supporting this work under its                            and intrinsic accuracies of some higher order Runge-Kutta
post doctoral fellowship scheme. Much of this work was                                     algorithms”, Computers & Chemistry, Vol. 11, pp.121-124, 1987.
carried out during his stay at Universiti Sains Malaysia in                         [22]   M. Bader, “A new technique for the early detection of stiffness in
                                                                                           coupled differential equations and application to standard Runge-
2011. He wishes to acknowledge Universiti Sains Malaysia’s                                 Kutta algorithms”, Theoretical Chemistry Accounts, Vol. 99, pp.
financial support.                                                                         215-219, 1988.
                                                                                    [23]   S. Senthilkumar and A.R.M. Piah, “Solution to a system of second
                            REFERENCES                                                     order robot arm by parallel Runge-Kutta arithmetic mean
                                                                                           algorithm”, InTechOpen, pp. 39-50, 2011.
                                                                                    [24]   S.C. Oliveira, “Evaluation of effectiveness factor of immobilized
  [1]    M. Korch, “Simulation-based analysis of parallel Runge-Kutta                      enzymes using Runge-Kutta-Gill: How to solve mathematical
         solvers”, LNCS 3732, pp. 1105-1114, 2006.                                         undetermination at particle center point?”, Bio Process
  [2]    Z. Jia, “A parallel multiple time-scale reversible integrator for                 Engineering, Vol. 20, pp. 185-187, 1999.
         dynamics simulation”, Future Generation Computer Systems”,Vol.             [25]   L.O. Chua and L. Yang, “Cellular neural networks: Theory”, IEEE
         19, pp. 415-424, 2003.                                                            Transactions on Circuits and Systems, Vol. 35, pp. 1257-1272,
  [3]    R.K. Alexander and J.J. Coyle, “Runge-Kutta methods for                           1988.
         differential-algebraic systems”, SIAM Journal of Numerical                 [26]    L.O. Chua and L. Yang, “Cellular neural networks: Applications”,
         Analysis, Vol. 27, pp. 736-752, 1990.                                             IEEE Transactions on Circuits and Systems, Vol. 35, pp. 1273-
  [4]    D.J. Evans, “A new 4th order Runge-Kutta method for initial value                 1290, 1988.
         problems with error control”, International Journal of Computer            [27]   L. O. Chua, “CNN: A Paradigm for Complexity”, World Scientific
         Mathematics, Vol.139, pp. 217-227, 1991.                                          Series on Nonlinear Science, Series A, Vol. 31, 1998.
  [5]    C. Hung, “Dissipativity of Runge-Kutta methods for dynamical               [28]   T. Roska, “CNN Software Library”, Hungarian Academy of
         systems with delays”, IMA Journal of Numerical Analysis, Vol.20,                  Sciences, Analogical and Neural Computing Laboratory, [Online].
         pp. 153-166, 2000.                                                                Available:http://lab.analogic.sztaki.hu/Candy/csl.html, 1.1. 2000.
  [6]    L.F. Shampine and H.A. Watts, “The art of a Runge-Kutta code.              [29]   Roska et al. “CNNM Users Guide”, Version 5.3x, Budapest, 1994.
         Part-I”, Mathematical Software, Vol. 3, pp. 257-275, 1977.                 [30]   R.C. Gonzalez, R.E. Woods and S.L. Eddin, “Digital Image
  [7]    L.F. Shampine and M.K. Gordon, “Computer solutions of ordinary                    Processing using MATLAB”, Pearson Education Asia, Upper
         differential equations”, W.H. Freeman, San Francisco. p. 23, 1975.                Saddle River, N.J, 2009.
  [8]    J.C. Butcher, “On Runge processes of higher order”, Journal of             [31]    M. Anguita, F.J. Fernandez, A.F. Diaz, A. Canas and F.J. Pelayo,
         Australian Mathematical Society, Vol. 4, p. 179, 1964.                            “Parameter configurations for hole extraction in cellular neural
  [9]    J.C. Butcher, “The Numerical Analysis of Ordinary Differential                    networks”, Analog Integrated Circuits and Signal Processing, Vol.
         Equations:      Runge-Kutta and General Linear Methods”, John                     32, pp. 149–155, 2002.
         Wiley & Sons, Chichester, 1987.                                            [32]   V. Murugesh and K. Badri, “An efficient numerical integration
 [10]    J.C. Butcher, “On order reduction for Runge-Kutta methods                         algorithm for cellular neural network based hole-filler template
         applied to differential-algebraic systems and to stiff systems of                 design”, International Journal of Computers, Communications and
         ODEs”, SIAM Journal of Numerical Analysis, Vol. 27, pp. 447-                      Control, Vol. 2, pp. 367-374, 2007
         456, 1990.                                                                 [33]   K.K. Lai and P.H.W. Leong, “Implementation of time-multiplexed
 [11]    D.J. Evans and B.B. Sanugi, “A parallel Runge-Kutta integration                   CNN building block cell”, IEEE Proceedings of Microwave, pp.
         method”, Parallel Computing, Vol. 11, pp. 245-251, 1989.                          80-85, 1996.
 [12]    R. Ponalagusamy and K. Ponammal, “Investigations on robot arm              [34]   K.K. Lai and P.H.W. Leong, “An area efficient implementation of
         model using a new parallel RK-fifth order algorithm”, International               a cellular neural network”, NNES '95 Proceedings of the 2nd New
         Journal of Computer, Mathematical Sciences and Applications,                      Zealand Two-Stream International Conference on Artificial Neural
         Vol. 2, pp. 155-164, 2008.                                                        Networks and Expert Systems, pp. 51-54, 1995.
 [13]    R. Ponalagusamy and K. Ponnammal,“A new parallel RK-fifth                  [35]   L.A. Zadeh, “Fuzzy sets”, Information and Control, Vol. 8, No. 3
         order algorithm for time varying network and first order initial                  pp. 338-353, 1965.
         value problems”, Journal of Combinatorics, Information & System            [36]   L.A. Zadeh, K. Fu, K. Tanaka and M. Shimura (eds), “Fuzzy Sets
         Sciences, Vol. 33, pp. 397-409, 2008.                                             and Their Applications to Cognitive and Decision Processes”,
 [14]    R. Ponalagusamy and K. Ponnammal, “New generalised plasticity                     Academic Press, New York, 1975.
         equation for compressible powder metallurgy materials: A new               [37]    W. Shitong, K.F.L. Chung and F. Duan, “Applying the improved
         parallel RK-Butcher method”, International Journal of                             fuzzy cellular neural network IFCNN to white blood cell
         Nanomanufacturing, Vol. 6, pp. 395-408, 2010.                                     detection”, Neurocomputing, Vol. 70, pp. 1348-1359, 2007.




                                                                               63                                http://sites.google.com/site/ijcsis/
                                                                                                                 ISSN 1947-5500
                                                                     (IJCSIS) International Journal of Computer Science and Information Security,
                                                                                                                         Vol. 9, No. 6, June 2011

[38]   M. Laiho, A. Paasio, J. Flak and K. A. I. Halonen, “Template                                             Senthilkumar was born in Neyveli Township,
       design for cellular nonlinear networks with 1-bit weights”, IEEE                                         Cuddalore District, Tamilnadu, India on 18th July
       Transactions on Circuits and Systems-I: Regular Papers, Vol. 55,                                         1974. He received his B.Sc in Mathematics from
       No. 3, pp. 904-913, 2008.                                                                                Madras University in 1994, M.Sc in Mathematics
[39]    T. Yang, L. B. Yang, C. W. Wu, L. O. Chua, “Fuzzy cellular                                              from Bharathidasan University in 1996, M.Phil
       neural networks: Theory”, In Proceedings of IEEE International                                           in Mathematics from Bharathidasan University in
       Workshop on Cellular Neural Networks and Applications, pp.181-                                           1999 and M.Phil in Computer Science &
       186, 1996.                                                                                               Engineering from Bharathiar University in 2000.
[40]   T. Yang, L. B. Yang, “The global stability of fuzzy cellular neural                                      He also has a PGDCA and PGDCH in Computer
       networks”, IEEE Transactions on Circuit and Systems-I, Vol. 43,            Science and Applications and Computer Hardware from Bharathidasan
       pp. 880-883, 1996.                                                         University which he obtained in 1996 and 1997, respectively. He has a
[41]   T. Yang and L.B. Yang, “Fuzzy cellular neural network: A new               doctoral degree in Mathematics and Computer Applications from National
       paradigm for image processing”, International Journal of Circuit           Institute of Technology [REC], Tiruchirappalli, Tamilnadu, India. Currently,
       Theory and Applications, Vol. 25, pp. 469-481, 1997.                       he is a post doctoral fellow at the School of Mathematical Sciences, Universiti
[42]   T. Yang and L.B. Yang, “Application of fuzzy cellular neural               Sains Malaysia, 11800 USM Pulau Pinang, Malaysia. Prior to this
       networks to Euclidean distance transformation”, IEEE                       appointment, he was a lecturer/assistant professor in the Department of
       Transactions on Circuits and Systems-I, CAS-44, pp. 242-246,               Computer Science at Asan Memorial College of Arts and Science, Chennai,
       1997.                                                                      Tamilnadu, India. He has published many good research papers in
[43]   A. Kandel, “Fuzzy Techniques in Pattern Recognition”, John                 international conference proceedings and peer-reviewed/refereed international
       Wiley, New York, 1982.                                                     journals with high impact factor. He has made significant and outstanding
[44]    R.R. Yager and L.A. Zadeh (eds), “An Introduction to Fuzzy
                                                                                  contributions to various activities related to research work. He is also an
       Logic in Intelligent Systems”, Kluwer, Boston, 1992.
                                                                                  associate editor, editorial board member, reviewer and referee for many
[45]   J.A. Nossek, G. Seiler, T. Roska and L.O. Chua, “Cellular neural
                                                                                  scientific international journals. His current research interests include
       networks: Theory and circuit design”, International Journal of
                                                                                  advanced cellular neural networks, advanced digital image processing,
       Circuit Theory and Applications, Vol. 20, pp. 533-553, 1992.
                                                                                  advanced numerical analysis and methods, advanced simulation and
[46]   G. F. Dalla Betta, S. Graffi, M. Kovacs and G. Masetti, “CMOS
                                                                                  computing and other related areas.
       implementation of an analogy programmed cellular neural
       network”, IEEE Transactions on Circuits and Systems-Part–II,
       Vol. 40, pp. 206–214, 1993.                                                Abd Rahni Mt Piah was born in Baling, Kedah Malaysia on 8th May 1956. He
[47]   C.L. Yin, J.L. Wan, H. Lin and W.K. Chen, “Brief                                                       received his B.A. (Cum Laude) in Mathematics
       Communication: The cloning template design of a cellular neural                                        from Knox College, Illinois, USA in 1979. He
       network”, Journal of the Franklin Institute, Vol. 336, pp. 903-909,                                    received his M.Sc in Mathematics from
       1999.                                                                                                  Universiti Sains Malaysia in 1986. He obtained
[48]    L. O. Chua and P. Thiran, “An analytic method for designing                                           his Ph.D in Approximation Theory from the
       simple cellular neural networks”, IEEE Transactions on                                                 University of Dundee, Scotland UK in 1993. He
       Circuitsand Systems-I, Vol. 38, pp. 1332-1341, 1991.                                                   has been an academic staff member of the School
[49]    T. Matsumoto, L.O. Chua and R. Furukawa, “CNN cloning                                                 of Mathematical Sciences; Universiti Sains
       template: hole filler”, IEEE Transactions on Circuits and Systems,                                     Malaysia since 1981 and at present is an
       Vol. 37, pp. 635-638, 1990.                                                Associate Professor. He was a program chairman and deputy dean in the
[50]   K. Murugesan and P. Elango, “CNN based hole filler template                School of Mathematical Sciences, Universiti Sains Malaysia for many years.
       design using numerical integration technique”, LNCS 4668, pp.              He has published various research papers in refereed national and international
       490-500, 2007.                                                             conference proceedings and journals. His current research areas include
                                                                                  Computer Aided Geometric Design (CAGD), Medical Imaging, Numerical
                                                                                  Analysis and Techniques and other related areas.




                                                                             64                                    http://sites.google.com/site/ijcsis/
                                                                                                                   ISSN 1947-5500
                                                               (IJCSIS) International Journal of Computer Science and Information Security,
                                                                                                                   Vol. 9, No. 6, June 2011

   Location Estimation and Mobility Prediction Using
                Neuro-fuzzy Networks
                                                      In Cellular Networks

                        Maryam Borna                                                     Mohammad Soleimani
           Department of Electrical Engineering                                   Department of Electrical Engineering
        Iran University of Science and Technology                             Iran University of Science and Technology
                       Tehran, Iran                                                           Tehran, Iran
               maryam.borna@gmail.com                                                     soleimani@iust.ac.ir


Abstract- In this paper an approach is proposed for location                However for managing networks resources consumption
estimation, tracking and mobility prediction in cellular                 and reducing the costs of location update and call delivery
networks in dense urban areas using neural and neuro-fuzzy               procedures, prediction of user's next probable location can be
networks. In urban areas with high buildings, due to the effects         helpful. This is done by analyzing some patterns of his
of multipath fading and Non-Line-of-Sight conditions, the                mobility behavior. Therefore searching for users will be done
accuracy of positioning methods based on direction finding and           in smaller groups of cells avoiding expensive queries to
ranging degrades significantly. Also in these areas, due to high         Home Location Register (HLR). This is also useful in other
user traffic there's a need for network resources management.            wireless networks such as Ad-Hoc networks for efficient
Knowing the next possible position of user would be helpful in
                                                                         bandwidth allocation and uninterrupted hand over between
this case. Here using fingerprint positioning concept, after
choosing appropriate parameters for fingerprinting in GSM
                                                                         access points.
cellular networks, MLP and RBF neural networks were used                     Next sections are as follows: section II describes the
for position estimation. Then by the use of neuro-fuzzy                  problem of positioning in dense urban areas and related
networks a tracking and post-processing method is applied to             studies in the literature. In section III proposed approach of
estimated locations. For mobility prediction purpose the use of          this paper for positioning and mobility prediction is
ANFIS neuro-fuzzy is implemented.                                        explained and contains 3 subsections: fingerprint based
                                                                         positioning in subsection A, post processing of estimated
    Keywords-position    estimation;   neuro-fuzzy;   prediction;
cellular networks.
                                                                         path in subsection B and path prediction in subsection C are
                                                                         discussed. Section IV includes the results of evaluating the
                     I.    INTRODUCTION                                  proposed approach on database collected from GSM mobile
                                                                         phone network in city of Tehran. Results were discussed in
    Positioning in wireless networks is estimating a node's              section V and last section concludes the paper.
distance with reference to a fixed node or locating it by its
geographical coordinates. Positioning is based on parameters                              II. PROBLEM DEFINITION
used by mobile or fixed nodes for communication such as
Received Signal Strength (RSS), Time of Arrival (TOA) and                    With developments in cellular phone networks different
Angle of Arrival (AOA). According to the type of wireless                methods were considered for facilitating user positioning,
network and transmission protocols, different parameters are             such as Cell-ID, Cell-ID+TA, A-GPS, AOA, … the more
used for communication.                                                  accuracy increases the more expensive the deployment
                                                                         would be and the need for hardware and software changes in
    Among several types of wireless networks, cellular phone             both cell phone device and network infrastructure rises.
networks due to increasing usage of cell phones for                      Moreover most of these methods are sensitive to Non Lin of
communications are more distributed with more subscribers                Sight (NLOS) communication between transmitter and
so it can be said that one of the most probable items found in           receiver and multipath fading, conditions that dense urban
everyone's pocket is his cell phone. Having location                     areas are involved with. Although everyday there are more
information in cellular phone networks, various services can             and more mobile phone devices equipped with GPS receivers
be provided based on user's location ranging from                        with positioning accuracy up to few meters, but in urban
commercial and advertising services to routing, navigation               areas with high buildings where it's less likely to have line of
and emergency calls. In cellular phone networks these                    sight communication with at least 3 GPS satellites, or inside
services are referred to as Location Based Services (LBS).               buildings where the signals attenuate significantly passing
Also by locating a user's exact position network's resources             through the walls, positioning accuracy degrades
can be efficiently managed and allocated leading to proper               considerably. In such cases there's a need for an auxiliary
handover between cells and reduced co-channel interference.              method to overcome these problems.




                                                                    65                          http://sites.google.com/site/ijcsis/
                                                                                                ISSN 1947-5500
                                                                (IJCSIS) International Journal of Computer Science and Information Security,
                                                                                                                    Vol. 9, No. 6, June 2011
    The difficulty in these areas is the complexity of                    standards and data collection tool. Data collection and
propagation model of electromagnetic waves caused by                      pattern learning are done offline before online real-time
multipath fading, diffraction and scattering that makes it hard           positioning.
for geometrical and statistical positioning methods relying on
relations between signal parameters and Tx-Rx separation.                     Input parameters of neural networks, fingerprint of a
                                                                          point, must be measurable, collectable and different from
    Fingerprint based positioning methods are better for                  place to place. Here we aim to attain intended fingerprints
mentioned cases [1] [2]. In these methods first a database of             from information provided by mobile phone device without
signal parameters in certain places is collected with no                  software or hardware changes in the device and additional
knowledge of propagation model of the environment and                     signaling between MS and BTS. The data can be obtained
position estimation will be done upon these information and               from mobile phone routing table. This table is used for
possible mapping among them.                                              selecting the best cell to reside and is resorted every few
                                                                          seconds. In GSM900 standard for mobile phone networks,
    One way to find the mapping relations in the fingerprint              this table contains a list of 30 radio channels (ARFCN)
database is to use Artificial Neural Networks. These                      sorted in descending order based on received power. In
networks are able to estimate the complex nonlinear                       addition to received power other parameters of currently
functions like mapping relations by parallel processing of                selected and neighboring cells are available like cell name,
neurons. Position estimation can be considered as a function              absolute radio frequency number for broadcasting cell's
approximation problem in neural networks that aims to find                status, received power level, received signal quality and
the nonlinear mapping between inputs (fingerprints) and                   timing advance (TA). Also the attributes of BTS antennas of
outputs or targets (mobile phone's coordinates).           In             each cell like its height and installation coordinates are
comparison with other database lookup methods like K-                     accessible.
Nearest Neighbor (KNN) that uses fingerprint parameters to
find its nearest Euclidean neighbors, neural networks are                     From the mentioned parameters those were chosen for
better. On the other hand since neural networks approximate               fingerprint that own following properties:
functions and fingerprint details are somewhat related to
delay and power loss of arrived signals and in turn these are                      •    Being sensitive to spatial changes. Therefore
dependent on Tx-Rx separation, it seems that neural                                     fixed parameters within a cell boundary like
networks combine both features of RSS and TOA-TDOA                                      radio channel numbers, cell antenna height and
based systems [3] [4]. Two common models of neural                                      similar parameters are not suitable for
networks in function approximation, multi layer perceptron                              fingerprinting.
(MLP) and Radial Basis Function (RBF) networks, are used                           •    Parameters should be representative of
more [2] [3] [5] [6].                                                                   multipath fading effects in propagation
     After user localization, history of his travelled places can                       environment. Received signal level, received
be considered as a time series that by recognizing his                                  signal quality and Timing Advance (TA) are
mobility pattern, his next location can be predicted. In                                such parameters. However TA is a discrete
literature for user's path prediction in wireless networks with                         value of estimated BTS-MS separation with an
different standards, Recurrent Neural Networks (RNN),                                   accuracy about 550 meters, say if TA=1, MS is
Bayesian Neural Networks (BNN) or neuro-fuzzy networks                                  in a radius of 550 meters from BTS and TA=2
were employed that some used user's behavioral pattern in a                             means MS is in a radius of 550 to 1100 meters
long period of time and different situations to learn his                               from cell antenna. Hence in a cell with radius
mobility pattern and similar users then predict their next                              less than 550 meters-like most urban cells- TA's
location [8] [9] [10].                                                                  value isn't much helpful in positioning.

    In this paper after gathering enough fingerprints with                         •    For less signaling between mobile phone and
appropriate parameters in GSM cellular phone network,                                   BTS it's better to acquire fingerprint in IDLE
positioning of mobile phone device is done by searching for                             mode rather that ACTIVE mode. TA and
the best architecture for MLP and RBF neural networks in a                              Received signal quality are determined in
dense urban area. Afterwards using tracking and prediction                              ACTIVE mode while RSS is monitored
feature of ANFIS neuro-fuzzy network, the estimated path is                             periodically even in IDLE mode.
post processed and user's upcoming path is predicted.                         We chose parameters that fulfill mentioned requirements
                                                                          namely Received signal strength from cell antenna beside its
                  III.   PROPOSED APPROACH
                                                                          coordinates for serving cell and two of neighboring cells. So
A. Fingerprint based positioning                                          there are 9 parameters to be recorded in a single fingerprint
    In fingerprint based positioning methods first a database             beside the coordinates of data gathering location. By
of fingerprints in a certain area is collected. A fingerprint of a        collecting these fingerprints in sufficient data points of the
certain point includes particular information like                        designated area, we have a suitable database for further
geographical coordinates of the point which is a specification            analysis by fingerprint based positioning methods. As
of that point. This information includes estimated signal                 discussed in previous section, for database processing, neural
parameters that are different depending on wireless network               networks predominates other methods so is employed here.
                                                                          Training set tuples are mentioned 9 parameters as input and




                                                                     66                          http://sites.google.com/site/ijcsis/
                                                                                                 ISSN 1947-5500
                                                              (IJCSIS) International Journal of Computer Science and Information Security,
                                                                                                                  Vol. 9, No. 6, June 2011
latitude, longitude of respective data point as target or output
for neural networks.
    One of the problems in neural networks design,
particularly MLP networks is the lack of certain equations
for determining the perfect architecture of the network and
number of neurons in hidden layers. In the training phase of
a NN while evaluating its ability to learn, its response to new
untrained data should also be considered for the network to
generalize well. In order that 84% of data set were used for                    Figure 2. Proposed architecture for RBF neural network
training and the remaining for testing the trained network.
     For finding the best architecture for MLP NN first an
upper limit regarding members of training set is considered
for maximum number of network parameters i.e. total
number of weights and biases in neural network then by
modifying the number of hidden layers and neurons in each
layer in the defined range, the architecture yielding less
positioning error for training, testing and the whole data set
is chosen. For available database the best architecture for
MLP was one hidden layer with 23 neurons. Input layer
neurons were set to 9 and output layer neurons for estimated
latitude and longitude of mobile phone's location are 2. Fig. 1
    In standard Radial Basis Function NN a ruling parameter               Figure 3. Neuro-fuzzy network for Sugeno's fuzzy inference structure
in the design is the radial neuron's spread that determines its
sensitivity to the resemblance between network's inputs and                 Inputs of ANFIS network were estimated path with 5
weights. Searching for spread parameter resulted in value of            delays and the same path with no delays as output. For initial
196 leading to less error for testing data set. The number of           FIS generation fed to ANFIS, we used subtractive clustering
neurons has been set to its maximum i.e. the same as training           that the influence radius of every cluster for all 11
set members. Fig.2                                                      dimensions of data was set to 0.5.

    Another type of RBF networks employed here is                          ANFIS is used here to estimate the user's movement
Generalized Regression Neural Network (GRNN) that has a                 function and smooth the NN estimated path so it can be
fixed structure with little difference to standard RBF                  specified which road the user is moving on, useful in map
networks. Spread parameter for radial neuron has been                   routing and navigation purpose.
obtained like former case and set to 2.                                 C. Predicting user's next location
    After position estimation with designed neural networks,                Subsequent locations travelled by mobile phone user can
there were rather big errors in few points probably caused by           be assumed as a time series. Here we used the prediction
inadequate members of training set and unavailability of RSS            ability of ANFIS network. The structure is the same as
in some points. To lessen this error we applied a post                  before. For training the network, 20% of the beginning of the
procession on estimated path coming in next subsection.                 travelled path with 2 delays was selected as input and the
                                                                        same path with one precession as output. The remaining 80%
B. Post-processing the estimated path                                   of the path was used for testing. In this way trained network
    In this section we use ANFIS (Adaptive Neuro-Fuzzy                  would be able to predict next location by knowing the
Inference Structure) for processing the previously NN                   present and one previous location of user. Here we've used
estimated path of user's travelled places. Employed neuro-              the estimated path by neural network in section III and
fuzzy network is the neural network equivalent for Sugeno               calculated the error with respect to real path.
FIS (Fuzzy Inference Structure). In comparison with MLP,
ANFIS has a fixed architecture and no searching for best                       IV. EVALUATION OF THE PROPOSED APPROACH
structure is needed. It responds faster with less computational             For data collection we've used TEMS® drive tester tool
resource consumption. Fig. 3                                            that is used for optimization and troubleshooting of mobile
                                                                        phone network by monitoring its status. It represents
                                                                        network's data intercepted by mobile phone in a computer
                                                                        interface for further processing and also able to record data
                                                                        collection point coordinates via GPS receiver.
                                                                           We've used this tool for fingerprint database collection in
                                                                        GSM communication network in city of Tehran for about
                                                                        250 data points. For network training and simulation we used
                                                                        MATLAB® neural network and fuzzy logic toolboxes.
Figure 1. Proposed architecture for MLP neural network




                                                                   67                            http://sites.google.com/site/ijcsis/
                                                                                                 ISSN 1947-5500
                                                                                                                                               (IJCSIS) International Journal of Computer Science and Information Security,
                                                                                                                                                                                                   Vol. 9, No. 6, June 2011
    Simulation results of trained neural network showed that                                                                                                               51.48
                                                                                                                                                                                                           Tajrish-Zarrabkhane

designed MLP network performs better than RBF networks,                                                                                                                                  Real track
however RBF networks are faster and easier to design but are                                                                                                              51.475         Trained Track
                                                                                                                                                                                         Predicted Track
suitable when training set members are very high.
                                                                                                                                                                           51.47


                                               51.49                                                                                                                      51.465
                                                                         Real track




                                                                                                                                                              Longitude
                                                                         Neural Network estimated                                                                          51.46
                                                                         ANFIS post-processing
                                               51.48
                                                                         BTS position                                                                                     51.455


                                                                                                                                                                           51.45
                                               51.47
      longitude




                                                                                                                                                                          51.445

                                               51.46
                                                                                                                                                                           51.44
                                                                                                                                                                              35.74   35.75   35.76    35.77     35.78     35.79   35.8   35.81   35.82
                                                                                                                                                                                                                Latitude

                                               51.45
                                                                                                                                                                          Figure 7. Proposed user's mobility prediction with ANFIS

                                               51.44                                                                                                         Fig.4 displays estimated location after post processing
                                                                                                                                                         with ANFIS indicating mitigation of high errors. Fig.5 is
                                                                35.75      35.76     35.77          35.78       35.79       35.8       35.81             Cumulative Error Probability after and before post-
                                                                                                   latitude
                                                                                                                                                         processing showing alleviation of high errors. In Fig.6 the
    Figure 4. Position estimation and post-processing result in a road
                                                                                                                                                         same path is displayed on the map by Google Earth®. It can
                                                                                                                                                         be seen after ANFIS post-processing the road the user is
                                                                                                                                                         travelling can be defined more accurately.
                                                       1

                                                      0.9
                                                                                                                                                             Fig.7 displays real, trained and predicted path by ANFIS.
                                                                                                                                                         CEP-60% is less than 115 meters which makes this
             Cumulative Distribution Function (CDF)




                                                      0.8
                                                                                                                                                         prediction useful in determination of user's next probable cell
                                                      0.7
                                                                                                                                                         to reside leading to better management of network resources
                                                      0.6
                                                                                                                                                         and successful cell reselection.
                                                      0.5

                                                      0.4                                                                                                                       V. CONCLUSION
                                                      0.3                                                                                                    In proposed approach for location estimation in cellular
                                                      0.2                                                                                                networks by neural networks in dense urban areas, mean
                                                      0.1
                                                                                                              NN estimated                               positioning error less than 80 meters and CEP-60% of 65m
                                                                                                              ANFIS post-processing

                                                       0
                                                                                                                                                         were obtained in a 3 by 4 km area that in comparison with
                                                            0      100     200     300      400       500
                                                                                         Positioning Error(m)
                                                                                                              600    700    800       900                most commercial positioning methods implemented in
                                                                                                                                                         cellular networks like E-CGI, E-OTD and AOA with 200 m
 Figure 5. Cumulative Error Probability before and after post-processing                                                                                 positioning error in such conditions, 50% improvement was
                                                                                                                                                         achieved. Meanwhile this method can be a complement to
                                                                                                                                                         GPS positioning in cases GPS signals are weak. In this
                                                                                                                                                         method there is no additional signaling or extra hardware-
                                                                                                                                                         software installation in both phone device and network.
                                                                                                                                                         We've used a fingerprint database of RSS parameters which
                                                                                                                                                         is available in most wireless networks. In comparison with
                                                                                                                                                         other positioning methods based on neural networks, we've
                                                                                                                                                         avoided a fixed structure for MLP NNs by searching for the
                                                                                                                                                         best one that suits certain database with a simple script.
                                                                                                                                                         Applying ANFIS post-processing by approximating user's
                                                                                                                                                         movement function, decreased high errors. The accuracy of
                                                                                                                                                         proposed mobility prediction by ANFIS with respect to
                                                                                                                                                         radius of cells in most cities that are about 100 to 150 m
                                                                                                                                                         makes it useful in anticipation of user's next cell to be
                                                                                                                                                         causing decrement in costs of location update and paging
                                                                                                                                                         procedure.




                                   Figure 6. Part of the path of Fig. 4 on the map of Tehran




                                                                                                                                                 68                                            http://sites.google.com/site/ijcsis/
                                                                                                                                                                                               ISSN 1947-5500
                                                                       (IJCSIS) International Journal of Computer Science and Information Security,
                                                                                                                           Vol. 9, No. 6, June 2011


                             REFERENCES                                                                    AUTHORS PROFILE
[1]   Ines Ahriz, Yacine Oussar, Bruce Denby, and Gerard Dreyfus, "Full-              Maryam Borna has received her Bachelor of Science in Electrical
      Band GSM Fingerprints for Indoor Localization Using a Machine              Engineering with major of Telecommunications from Shahed University,
      Learning Approach," International Journal of Navigation and                Tehran, Iran and her Master of Science in IT Engineering with major of
      Observation, 2010.                                                         Secure Communications from Dept. of Electrical Engineering of Iran
[2]   Claude Takenga and Kyandoghere Kyamakya, "A Low-cost                       University of Science and Technology,Tehran,Iran. Her research interests
      Fingerprint Positioning System in Cellular Networks," in Second            include mobile phone networks, neural networks, microstrip antennas.
      International Conference on Communications and Networking in                    Mohammad Soleimani received the B.S. degree in electrical
      China,CHINACOM '07. , 2007.                                                engineering from the University of Shiraz, Shiraz, Iran, in 1978 and the
[3]   Anthony Taok, Nahi Kandil, and Sofiene Affes, "Neural Network for          M.S. and Ph.D. degrees from Pierre and Marie Curio University, Paris,
      Fingerprinting-Based Indoor Localization Using Ultra-Wideband,"            France, in 1981 and 1983, respectively. He is working as a Professor with
      Journal of Communications, vol. 4, no. 4, 2009.                            the Iran University of Sciences and Technology, Tehran, Iran. His research
[4]   M.H Hung, Shi-Shung Lin, Jui-Yu Cheng, and Wu-Lung Chien, "A               interests are in antennas, small satellites, electromagnetic, and radars. He
      ZigBee Indoor Positioning Scheme using Signal-Index-Pair Data              has served in many executive and research positions including: Minister of
      Preprocess Method to Enhance Precision," in IEEE International             ICT, Student Deputy of Ministry of Science, Research and Technology,
      Conference on Robotics and Automation (ICRA), 2010.                        Head of Iran Research Organization for Science and Technology, Head of
[5]   Aylin Aksu, Joseph Kabara, and Michael B.Spring, "Reduction of             Center for Advanced Electronics Research Center; and Technology
      Location Estimation Error using Neural Networks," in Proceedings of        Director for Space Systems in Iran Telecommunication Industries.
      the first ACM international workshop on Mobile entity localization
      and tracking in GPS-less environments,MELT'08 , 2008.
[6]   C Laoudias et al., "Ubiquitous Terminal Assisted Positioning
      Prototype," in IEEE Wireless Communications and Networking
      Conference, WCNC , 2008.
[7]   Hani Kaaniche and Farouk Kamoun, "Mobility Prediction in Wireless
      Ad Hoc Networks Using Neural Networks," Journal of
      Telecommunications, vol. 2, no. 1, 2010.
[8]   Sherif Akoush and Ahmed Sameh, "Mobile User Movement
      Prediction Using Bayesian Learning for Neural Networks," in
      IWCMC '07 Proceedings of the 2007 international conference on
      Wireless communications and mobile computing, 2007.
[9]   J Amar Prathap Singh and M Karnan, "Intelligent location
      management for UMTS networks using Fuzzy Neural Networks,"
      Journal of Engineering and Technology Research, vol. 2, 2010.




                                                                            69                             http://sites.google.com/site/ijcsis/
                                                                                                           ISSN 1947-5500
                                                                (IJCSIS) International Journal of Computer Science and Information Security,
                                                                                                                          Vol. 9, No. 6, 2011

       A Fuzzy Clustering Based Approach for Mining
            Usage Profiles from Web Log Data
                     Zahid Ansari1, Mohammad Fazle Azeem2, A. Vinaya Babu3 and Waseem Ahmed4

                                                 1,4
                                                   Dept. of Computer Science Engineering
                                        2
                                            Dept. of Electronics and Communication Engineering
                                                        P.A. College of Engineering
                                                               Mangalore, India
                                                           1
                                                             zahid.ansari@acm.org
                                                           2
                                                             mf.azeem@gmail.com
                                                           4
                                                             waseem@computer.org
                                                 3
                                                  Dept. of Computer Science Engineering
                                                Jawaharlal Nehru Technological University
                                                             Hyderabad, India
                                                        dravinayababu@jntuh.ac.in



Abstract— The World Wide Web continues to grow at an                        assignment tasks. Finally we compare our soft computing based
amazing rate in both the size and complexity of Web sites and is            approach of session weight assignment with the traditional hard
well on it’s way to being the main reservoir of information and             computing based approach of small session elimination.
data. Due to this increase in growth and complexity of WWW,
web site publishers are facing increasing difficulty in attracting             Keywords- web usage mining; data preprocessing, fuzzy
and retaining users. To design popular and attractive websites              Clustering, knowledge discovery;
publishers must understand their users’ needs. Therefore
analyzing users’ behaviour is an important part of web page                                       I.    INTRODUCTION
design. Web Usage Mining (WUM) is the application of
datamining techniques to web usage log repositories in order to                 Due to the digital revolution and advancements in computer
discover the usage patterns that can be used to analyze the user’s          hardware and software technologies, digitized information is
navigational behavior [1]. WUM contains three main steps:                   easy to capture and fairly inexpensive to store [6], [7]. As a
preprocessing, knowledge extraction and results analysis. The               result huge amount of data have been collected and stored in
goal of the preprocessing stage in Web usage mining is to                   databases. The rate at which such data is stored is growing at a
transform the raw web log data into a set of user profiles. Each            phenomenal rate. The fast growing tremendous amount of data
such profile captures a sequence or a set of URLs representing a            collected and stored in large and numerous data repositories,
user session.                                                               has far exceeded our human ability for comprehension without
                                                                            powerful tools. The abundance of data, coupled with the need
This sessionized data can be used as the input for a variety of             for powerful data analysis tools has been described as a “data
data mining tasks such as clustering [2], association rule mining           rich but information poor” situation. Hence, there is an urgent
[3], sequence mining [4] etc. If the data mining task at hand is
                                                                            need for a new generation of computational techniques and
clustering, the session files are filtered to remove very small
sessions in order to eliminate the noise from the data [5]. But
                                                                            tools to assist humans in extracting useful information
direct removal of these small sized sessions may result in loss of a        (knowledge) from the rapidly growing volumes of data [8].
significant amount of information especially when the number of             Data mining is the process of exploration and analysis, by
small sessions is large. We propose a “Fuzzy Set Theoretic”                 automatic or semi-automatic means, of large quantities of data
approach to deal with this problem. Instead of directly removing            in order to discover meaningful patterns or rules. It deals with
all the small sessions below a specified threshold, we assign               the “knowledge in the database” [8]. The term KDD refers to
weights to all the sessions using a “Fuzzy Membership Function”             the overall process of knowledge discovery in databases. Data
based on the number of URLs accessed by the sessions. After                 mining is a particular step in this process, involving the
assigning the weights we apply a “Fuzzy c-Mean Clustering”                  application of specific algorithms for extracting patterns from
algorithm to discover the clusters of user profiles. In this paper,         data. The additional steps in the KDD process, such as data
we discuss our methodology to preprocess the web log data                   preparation, data selection, data cleaning, incorporation of
including data cleaning, user identification and session                    appropriate prior knowledge, and proper interpretation of the
identification. We also describe our methodology to perform                 results of mining, ensures that useful knowledge is derived
feature selection (or dimensionality reduction) and session weight          from the data [9].



                                                                       70                              http://sites.google.com/site/ijcsis/
                                                                                                       ISSN 1947-5500
                                                              (IJCSIS) International Journal of Computer Science and Information Security,
                                                                                                                        Vol. 9, No. 6, 2011
    Data mining often builds on an interdisciplinary bundle of            is a complicated task. By filtering out useless data, we can
specialized techniques from fields such as statistics, artificial         reduce log file size to enhance the upcoming mining tasks.
intelligence, machine learning, data bases, pattern recognition,
computer-based visualization etc. The more common model
functions in current data mining practice include classification,
regression clustering, rule generation, discovering association,
summarization and sequence analysis [10]. The World Wide
Web as a large and dynamic information source, that is
structurally complex and ever growing, is a fertile ground for
data mining principles or Web Mining. Web mining is
primarily aimed at deriving actionable knowledge from the
Web through the application of various data mining techniques
[11]. Web data is typically unlabelled, distributed,
heterogeneous, semi-structured, time varying, and high
dimensional. Web data can be grouped into the following
categories [12]: i) Contents of actual Web pages, ii) Intra-page
structures of the web pages, iii) Inter page structures specifying
linkage structures between Web pages, iv) Web usage data
describing how Web pages are accessed and v) User profiles
which include demographic and registration information about
users. Web Usage Mining is the discovery of user access                        Figure 2. Web Log Processing to Discover Weighted Sessions.
patterns from Web servers [1]. Web Usage Mining analyzes
results of user interactions with a Web server, including Web                 User identification refers to the process of identifying
logs, click streams, and database transactions at a Web site or a         unique users from the user activity logs. Usually the log file in
group of related sites. Web usage mining includes clustering              Extended Common Log format provides only the computer’s
(e.g. finding natural groupings of users, pages etc.),                    address and the user agent. For Web sites requiring user
associations (e.g. which URLs tend to be requested together),             registration, the log file also contains the user login. In such
and sequential analysis (the order in which URLs tend to be               cases this information can be used for user identification. For
accessed) [13]. As with any knowledge, discovery and data                 those cases where user login information is not available, we
mining (KDD) process, WUM performs three main steps:                      consider each IP as a user. User Session identification is the
preprocessing, pattern extraction and results analysis. Figure 1          process of segmenting the user activity log of each user into
describes the WUM process.                                                sessions, each representing a single visit to the site.
                                                                          Identification of user sessions from the web log file is a
                                                                          complicated task, due to the existence of proxy servers,
                                                                          dynamic addresses, and cases of multiple users access the same
                                                                          computer [23][2][25][26]. It is also possible that one user might
                                                                          be using multiple browsers or computers. This sessionized data
                                                                          can be used as the input for a variety of data mining algorithms.
                                                                               Once user sessions are discovered, this sessionized data can
                                                                          be used as the input for a variety of data mining tasks such as
                                                                          clustering, association rule mining, sequence mining etc. If the
                Figure 1. Web Usage Mining Process.                       data mining task at hand is clustering, the session files are
                                                                          filtered to remove very small sessions in order to eliminate the
    The goal of the preprocessing stage in Web usage mining is            noise from the data. But direct removal of these small sized
to transform the raw click stream data into a set of user                 sessions may result in loss of a significant amount of
profiles. Each such profile captures a sequence or a set of               information especially when the number of small sessions is
URLs representing a user session. Web usage data                          large. We propose a ”Fuzzy Set Theoretic” approach to deal
preprocessing exploit a variety of algorithms and heuristic               with this problem. Instead of directly removing all the small
techniques for various preprocessing tasks such as data fusion            sessions below a specified threshold, we assign weights to all
and cleaning, user and session identification etc. Figure 2               the sessions using a ”Fuzzy Membership Function” based on
depicts the primary tasks involved in web log data                        the number of URLs accessed by the sessions. After assigning
preprocessing in order to discover the user sessions.                     the weights we apply a ”Fuzzy c-Mean Clustering” algorithm
                                                                          to discover the clusters of user profiles. Fuzzy clustering
    Data fusion refers to the merging of log files from several
                                                                          techniques perform non-unique partitioning of the data items
Web servers. This requires global synchronization across these
                                                                          where each data point is assigned a membership value for each
servers [14]. Data cleaning involves tasks such as, removing
                                                                          of the clusters. This allows the clusters to grow into their
extraneous references to embedded objects, style files, graphics,
                                                                          natural shapes [15]. A membership value of zero indicates that
or sound files, and removing references due to spider
                                                                          the data point is not a member of that cluster. A non-zero
navigations. Popular Web sites generate the log file of the size
                                                                          membership value shows the degree to which the data point
measured in gigabytes per hour. Manipulating such large files
                                                                          represents a cluster. Fuzzy clustering algorithms can handle the




                                                                     71                                http://sites.google.com/site/ijcsis/
                                                                                                       ISSN 1947-5500
                                                                            (IJCSIS) International Journal of Computer Science and Information Security,
                                                                                                                                      Vol. 9, No. 6, 2011
outliers by assigning them very small membership degree for                          explicitly request all of the graphics that are on a Web page,
the surrounding clusters. Thus fuzzy clustering is more robust                       they are automatically downloaded due to the HTML tags.
method for handling natural data with vagueness and                                  Since the main purpose of Web Usage Mining is to get a
uncertainty.                                                                         picture of the user’s behavior, it does not make sense to include
                                                                                     file requests that the user did not explicitly request. During the
    Rest of the paper is organized as follows: in section-II, we                     Data cleaning process we removed the extraneous references to
describe the techniques to preprocess the web log data                               embedded objects, style files, graphics and sound files.
including data cleaning, user and session identification. In                         Elimination of the irrelevant items was accomplished by
Section III, we describe our methodology for feature selection                       checking the suffix of the URL name. All log entries with
(or dimensionality reduction) and session weight assignment.                         filename suffixes such as, gif, jpeg, GIF, JPEG, jpg, JPG, and
In this section we also discuss our work to apply Fuzzy c-                           map were removed. Default list of suffixes were used to
Mean Clustering algorithms to weighted user sessions. Section                        remove undesired files. Another main activity of the cleaning
IV provides the experimental results of our methodology                              process is removal of robots’ requests. Web Robots or spiders
applied to a real Web site access logs. Finally section V                            scan a Web site to extract its content. Web robots automatically
discusses the conclusion and future work.                                            access all the hyperlinks from a Web page. The number of
                                                                                     requests from a web robot is at least the number of the site’s
            II.    PREPROCESSING OF WEB LOG DATA                                     URLs. Removing WR-generated log entries removes
    The primary data sources used in Web usage mining are the                        uninteresting sessions from the log file and simplifies
server log files, which include Web server access logs and                           subsequent the mining tasks. In order to identify WR hosts we
application server logs.                                                             used as list of all user agents known as robots as suggested by
                                                                                     [16].     We      obtained     this    list    from     the   site
                                                                                     “http://www.robotstxt.org”. Figure 4 describes the algorithm
1212265085.247 741 192.168.23.62 TCP MISS/200 10858 GET                              for data cleaning and transformation.
http://www.pace.edu.in/index.php - DEFAULT PARENT/192.168.20.1
                               Mozilla/5.0
                                                                                      Input: Access log file W
                   Figure 3. A Sample Web Log Entry.                                  Output: Cleaned file C
                                                                                      For each line L ε W do
    A sample web server log file entry in Extended Common
Log Format (ECLF) is given in Figure 3 and description of                                  1)     Split L and extract various fields
various fields is given in Table I.                                                        2)     If the URL includes the query string then remove it
                                                                                           3)     Remove all the irrelevant requests whose URL suffix specified
              TABLE I.        DESCRIPTION OF LOG FIELDS                                           in the irrelevant suffix list
                                                                                           4)     Remove all WR-generated requests
            Field Value                          Description
                                                                                           5)     Encrypt IP address to hide user’s identity
  1212265085.247                     The time of request, in                               6)     Store URL in a URL map along with corresponding URL
                                     coordinated universal time                                   number
  741                                The elapsed time for HTTP
                                     request                                               7)     Print required fields in to the output file
  192.168.23.62                      IP address of the client                                            Figure 4. A Sample Web Log Entry.
  TCP_MISS/200                       HTTP reply status code
                                                                                     Table II describes the format of the output file C generated as
  10858                              Bytes sent by the server in                     a result of cleaning and transformations of the web logs. The
                                     response to the request.
                                                                                     output file shows that client IP addresses are replaced with
  GET                                The requested action
                                                                                     aliases in order to hide the identity of the user. The URL
  http://www.pace.edu.in/index.php   URI of the object being requested               column of the table shows that URL strings are replaced by
  -                                  client user name, lf disabled, it is            numbers in order to enhance further processing. We maintain a
                                     logged as -                                     map of URL strings and corresponding URL numbers.
  DEFAULT_PARENT/192.168.20          Hostname of the machine where
                                     we got the object.                                         TABLE II.           FILE FORMAT AFTER DATA CLEANING
  -                                  Content Type of the object
                                                                                                                        User           Elapsed
                                                                                           Time                IP                                 Bytes        URL
                                                                                                                        Agent           Time
                                                                                      20080601014805         IP1        UA1             741       10858         1
A. Data Cleaning                                                                      20080601014806         IP1        UA1             1735      19247         2
                                                                                      20080601014808         IP2        UA2             239        209          1
    A user’s request to view a particular page often results in                       20080601014809         IP1        UA3             674        156          3
several log entries since graphics and scripts are down-loaded                        20080601014813         IP2        UA2             680        179          4
in addition to the HTML file. In most cases, only the log entry
of the HTML file request is relevant and should be kept for the
user session file. This is because, in general, a user does not



                                                                                72                                      http://sites.google.com/site/ijcsis/
                                                                                                                        ISSN 1947-5500
                                                                           (IJCSIS) International Journal of Computer Science and Information Security,
                                                                                                                                     Vol. 9, No. 6, 2011
B. User Identification                                                                  Time-oriented heuristic TOH1 uses an upper bound on the
    Once web log files have been cleaned, next step in the data                     time spent in the entire site during a visit. The timestamp of
preparation is the identification of the user. Since the log files                  every URL access request is compared with that of the first
of web server we are working on do not contain the user login                       access request of the current session. If the time difference is
information, we consider each unique IP and User-Agent                              larger than β, this request becomes the first request of the new
combination as a separate user. Next we separate out all the                        session; otherwise it belongs to the current session. On the
requests corresponding to each individual user. Figure 5                            other hand Time-oriented heuristic TOH2 uses an upper bound
describes the algorithm to generate requests corresponding to                       on page-stay time. The timestamp of every URL access request
each individual user.                                                               is compared with that of the previous access request. If the time
                                                                                    difference is larger than β, this request becomes the first
                                                                                    request of the new session; otherwise it belongs to the current
 Input: File C, the cleaned access log file                                         session. We have selected 30 minutes as the value of threshold
                                                                                    time β for both of the above schemes.
 Output: File U that contains user wise list of URLs accessed by them
        1)     For each line L ε C do                                                 Input: File U, containing access logs of various users.
                     b) Split L to get required fields
                     c) Store them in a map M1 with IP, UserAgent as the              Output: File S, the file that contains different sessions based on TOH1
                           key and another map M2 as value. Key of the map            For each line L ε U do
                           M2 is time and value is rest of the fields                      1) if L represents a user then
        2)     Sort the inner map M2 based on the time key                                 2)          UserId ← L
        3)     Print contents of the map M1 to the output file U                           3)          Output L to file S
                                                                                           4) else if L is the first accessed log of the user then
       Figure 5. Algorithm to separate requests for each individual user                   5)          T1 ← L.time
                                                                                           6) else
                                                                                           7)          T2 ← L.time
The format of the output file U generated after user                                             // Compare the timestamps of current and the first request
identification is depicted in Table III below:                                             8) if T2 - T1 ≤ β then
                                                                                           9)          Output L to file S
                                                                                           10) else
       TABLE III.       FILE FORMAT AFTER USER IDENTIFICATION                              11)         Output UserId to file S
                                                                                           12)         Output L to file S
                                       Elapsed                                             13)         T1 ← L.time
    User               Time                           Bytes       URL
                                        Time
       U1         20080601014805         741          10858          1
                  20080601014806         1735         19247          2                    Figure 6. Algorithm to generate User Sessions based on TOH1
                        …                 …            …             …
       U2         20080601014809         674           156           3                  Algorithm to generate the users sessions based on the time
                        …                 …            …             …              oriented heuristics TOH1 is specified in Figure 6.
       U3         20080601014808         239           209           1
                  20080601014813         680           179           4
                                                                                     TABLE IV.        FILE FORMAT AFTER USER SESSION IDENTIFICATION
C. User Session Identification
    User Session identification is the process of segmenting the                           User                                    Elapsed
                                                                                                                Time                              Bytes         URL
user activity log of each user into sessions, each representing a                         Session                                   Time
single visit to the site. Web sites without user authentication                           U1-S1          20080601014805          741             10858          1
information mostly rely on heuristics methods for                                                        20080601014806          1735            19247          2
sessionization. The sessionization heuristic helps in extracting                                         …                       …               …              …
the actual sequence of actions performed by one user during                               U1-S2          …                       …               …              …
                                                                                                         …                       …               …              …
one visit to the site. In order to identify user sessions we                                …
experimented with two different time oriented heuristics (TOH)                              …
as described below:
                                                                                          U2-S1          20080601014809          674             156            3
   •         TOH1 : The time duration of a session must not exceed                                       …                       …               …              …
             a threshold β. Let timestamp of the first URL request in                      …
             a session is T1. A URL request with timestamp Ti is                          U3-S1          20080601014808          239             209            1
                                                                                                         20080601014813          680             179            4
             assigned to this session if and only if Ti – T1 ≤ β. The
             first URL request with timestamp larger than T1 + β is
             considered as the first request of the next session.                   Table IV shows the format the of the output file S containing
   •         TOH2: The time spent on a page visit must not exceed                   user sessions. Once user sessions are generated we scan each
             a threshold β. Let Ti be the timestamp of the URL most                 session and remove the duplicate URLs from each session. For
             recently assigned to a session. The next URL request                   each unique URL within a user session a single copy of the
             with timestamp Ti+1 belongs to the same session if and                 URL is kept along with it’s frequency of occurrence. We also
             only if Ti+1 – Ti ≤ β. Otherwise, this URL is considered               maintain the count of the total number of unique URLs in each
             to be the first of the next session.                                   session.




                                                                               73                                    http://sites.google.com/site/ijcsis/
                                                                                                                     ISSN 1947-5500
                                                                       (IJCSIS) International Journal of Computer Science and Information Security,
                                                                                                                                 Vol. 9, No. 6, 2011
          III.   DISCOVERY OF USER SESSION CLUSTERS                                                              
                                                                                    W ( si ) = 0, if si ≤ LB     
A. Feature Subset Selection of User Sessions                                                                     
                                                                                                                 
    Each user session can be thought of a single transaction of                     W ( si ) = 1, if si ≥ LB     .                                       (1)
many URL references. We map the user sessions as vectors of                                                      
                                                                                               s − LB
URL references in a n-dimensional space. Let U be a set of n                        W ( si ) = i      , otherwise
                                                                                              UB − LB            
                                                                                                                 
unique URLs appearing in the preprocessed log then
U = { u1 , u 2 , … , un } and let S be a set of m user sessions
discovered by preprocessing the web log data. Then                              C. Clustering the User Sessions
 S = { s1 , s 2 , … , s m } where each user session si ∈ S can be                   Once use sessions are represented in the form of a vector,
represented as a bit vector s = { wu1 , wu2 , … , wum } where wui =1; if        clustering algorithm can be run against them. The goal of this
                                                                                process is to discover session clusters that represent similar
w i ∈s; and wui = 0; otherwise.
 u                                                                              URL access patterns. For example, two session vectors are
                                                                                similar if the Euclidean distance between them is short enough.
    Instead of binary weights, feature weights can also be used                 Clustering aims to divide a data set into groups or clusters
to represent a user session. These feature weights may be based                 where inter-cluster similarities are minimized while the intra
on frequency of occurrence of a URL reference within the user                   cluster similarities are maximized. Details of various clustering
session, the time a user spends on a particular page or the                     techniques can be found in survey articles [18][19][20]. The
number of bytes downloaded by the uses from a page. However,                    ultimate goal of clustering is to assign data points to a finite
the URLs appearing in the access logs and could number in the                   system of k clusters. Union of these clusters is equal to a full
thousands. Distance-based clustering methods often perform                      dataset with the possible exception of outliers.
very poor when dealing with very high dimensional data.
Therefore filtering the logs by removing references to low                          The k-means clustering algorithm is one of the most
support URLs (i.e. that are not supported by a specified number                 commonly used methods for partitioning the data. This
of user sessions) can provide an effective dimensionality                       algorithm partitions a set of m objects into k clusters. The
reduction method while improving clustering.                                    algorithm proceeds by computing the distances between a data
                                                                                point and each cluster center in order to assign the data item to
B. Assiging Weights to User Sessions                                            one of the clusters so that intra-cluster similarity is high but
                                                                                inter-cluster similarity is low. Euclidian distance can be used as
    If the data mining task at hand is clustering, the session files            a measure to calculate the distance between various data points
can be filtered to remove very small sessions in order to                       and cluster centers.
eliminate the noise from the data [5]. But direct removal of
these small sized sessions may result in loss of a significant                                          n
                                                                                                                          2

                                                                                                       ∑
amount of information especially when the number of small                                                      i
                                                                                    d ( xi , v j ) =          x k − vkj                                      (2)
sessions is large. We propose a “Fuzzy Set Theoretic”
                                                                                                       k =1
approach to deal with this problem. Instead of directly
                                                                                    where ,
removing all the small sessions below a specified threshold, we
assign weights to all the sessions using a “Fuzzy Membership                        xi is the i th data point
Function” based on the number of URLs accessed by the                               v j is the j th cluster center
sessions.
                                                                                    d ( xi , v j ) is the distance between xi and v j
                                                                                    n is the number of dimensions of each data point
                                                                                      i
                                                                                    xk is the value of k th dimensions of xi
                                                                                    vkj is the value of k th dimensions of v j
                                                                                    The k-means clustering first initializes the cluster centers
                                                                                randomly. Then each data point xi is assigned to some cluster vj
                                                                                which has the minimum distance with this data point. Once all
                                                                                the data points have been assigned to clusters, cluster centers
                                                                                are updated by taking the weighted average of all data points in
   Figure 7. Fuzzy membership function for session weight assignment            that cluster. This recalculation of cluster centers results in
                                                                                better cluster center set. The process is continued until there is
    Figure 7 depicts a linear Fuzzy membership function for                     no change in cluster centers. Although k-means clustering
session weight assignment. Here LB represents a lower bound                     algorithm is efficient in handling the crisp data which have
on the number of URLs accessed in a session and UB                              clear cut boundaries, but in real world data clusters have ill
represents an upper bound on the number of URLs accessed in                     defined boundaries and often overlapping clusters. This
a session. Let si be the number of URLs accessed in session                     happens because many times the natural data suffer from
                                                                                Ambiguity, Uncertainty and Vagueness [21].
si then the fuzzy membership function takes the following
                                                                                   Fuzzy c-means clustering incorporates fuzzy set theoretic
values:
                                                                                concept of partial membership and may result in the formation



                                                                           74                                             http://sites.google.com/site/ijcsis/
                                                                                                                          ISSN 1947-5500
                                                                            (IJCSIS) International Journal of Computer Science and Information Security,
                                                                                                                                      Vol. 9, No. 6, 2011
of overlapping clusters. The algorithm calculates the cluster                                                                1 / (q −1)
                                                                                                           1                
centers and assigns a membership value to each data item                                                                    
corresponding to every cluster within a range of 0 to 1. The
algorithm utilizes a fuzziness index parameter q where                                   uij =       
                                                                                                          2
                                                                                                             (
                                                                                                      d ij x i , v j    )   
                                                                                                                                                                        (6)
                                                                                                                                  1 / (q −1)
 q ∈ [1, ∞] [22] which determines the degree of fuzziness in the                                     n                          
                                                                                                 ∑           1                  
clusters. As the value of q reaches to 1, the algorithm works
like a crisp partitioning algorithm. Increase in the value of q                                   k =1 
                                                                                                            2
                                                                                                                  (
                                                                                                        d ij x i , v j      )   
                                                                                                                                 
results in more overlapping of the clusters.                                             In order to decide the number of optimum clusters for the
     Let X = {xi | i = 1L m} be a set of n-dimensional data point                    data set X we use a validity function S which is the ratio of
vectors where m is the number of data points and each                                compactness to separation [22] as given below:
 xi = {x1i , x 2 ,L, xn }∀i = 1L m . Let V = {x j | j = 1L c} represent a
               i      i
                                                                                                 c       m                            2

set of n-dimensional vectors corresponding to the cluster center
corresponding to each of the c clusters and each
                                                                                               ∑∑j =1 i =1
                                                                                                              2
                                                                                                             uij      xi − v j
                                                                                         S=                                                                              (7)
 v j = {v1j , v2j ,L , vnj }∀j = 1L c Let uij represent the grade of                                                              2

membership               of      data  point   xi     in    cluster     j.                        m. min v l − v k
                                                                                                         l ≠k
 u ij ∈ [1,0] ∀i = 1L m and ∀j = 1L c . The n × c matrix U = [u ij ] is a                for each c = cmin ,L, cmax
fuzzy c-partition matrix, which describes the allocation of the data
points to various clusters and satisfies the following conditions:                       Let c denote the optimal candidate at each c then, the
                                                                                     solution to the following minimization problem yields the most
      c                                                                             valid fuzzy clustering of the data set.
    ∑u      ij    = 1, ∀i = 1L m
                                                                                            min  min S 
    j =1                        
                                                                     (3)                                                                                              (8)
          c                                                                              cmin ≤c≤cmax       Ωc         
    0<     ∑  uij < m, ∀j = 1Lc 
                                                                                        Clusters formed by the applications clustering algorithms
         j =1                   
                                                                                     represent a group of user sessions that are similar based on co-
    The performance index J(U,V,X) of fuzzy c-mean clustering                        occurrence patterns of URL references. Clustering of user
can be specified as the weighted sum of distances between the                        sessions results in a set C = { c1 , c2 , … , ck } of clusters, where
data points and the corresponding centers of the clusters. In                        each ci is a subset of S, i.e., a set of user sessions. Each cluster
general it takes on the form:                                                        represents a group of users with similar navigational patterns.

                                  ∑∑ u d (x , v )
                                     c   m
                                                 q 2
    J (U ,V , X ) =                              ij ij   i       j   (4)                                         IV.        EXPERIMRNTAL RESULTS
                                  j =1 i =1                                              In order to discover the clusters that exist in user accesses
    where ,                                                                          sessions of a web site, we carried out a number of experiments.
    q ∈ [1, ∞ ] is the fuzziness index of the clustering                             The Web access logs were taken from the P.A. College of
       2
           ( )
    d ij x i , v j is the disatnce between x i and v j                               Engineering,      Mangalore       web       site,    at     URL
                                                                                     http://www.pace.edu.in. The site hosts a variety of information,
           (x , v ) = ∑ w(x ) x
                                 n
       2                                          i          j                       including departments, faculty members, research areas, and
    d ij    i        j                       i    k   − vk
                                                                                     course information. The Web access logs covered a period of
                                k =1
                                                                                     one month, from February 1, 2011 to March 1, 2011. There
    w( xi ) is the weight of the data point xi                                       were 74,924 logged requests in total.
    Minimization of the performance Index J(U,V,X) is usually                           After performing the cleaning step the output file contains
achieved by updating the grade of memberships of data points                         30720 entries. Number of the site URLs with access count
and centers of the clusters in an alternating fashion until                          greater than or equal to 5 are 159. Total numbers of unique
convergence. This performance Index is based on the sum of                           users identified are 24. Table V depicts the results of cleaning
the squares criterion. During each of the iterations, the cluster                    and user identification steps.
centers are updated as follows:
                 m                                                                      TABLE V.                 RESULTS OF CLEANING AND USER IDENTIFICATION
                ∑u
                 i =1
                         q
                         ij x i
                                                                                                                                 Items                 Count
    vj =            m
                                                                     (5)
                 ∑u        q                                                                                 Initial No of Log Entries                 74924
                           ij
                  i =1                                                                                       Log Entries after Cleaning                30720

   Membership values are calculated by the following                                                         No. of site ULRs                            159
formula:                                                                                                     No of Users Identified                      24




                                                                                75                                                    http://sites.google.com/site/ijcsis/
                                                                                                                                      ISSN 1947-5500
                                                                 (IJCSIS) International Journal of Computer Science and Information Security,
                                                                                                                           Vol. 9, No. 6, 2011
                                                                              As far as clustering of the User Sessions is concerned those
                                                                          URLs which are accessed only once do not play any significant
                                                                          role in forming the clusters since they appear in only one of the
                                                                          user sessions. Therefore we eliminate all such URL requests
                                                                          from our further analysis. This type of URL filtering is
                                                                          important in removing noise from the data. Since a user session
                                                                          is represented by an n-dimensional vector, where n represents
                                                                          the number of the site URLs accessed in the log files.
                                                                          Reduction in the number of URLs also reduces the session
                                                                          vector dimensions. The count of the URLs which are accessed
                                                                          only once is 5372. After eliminating them the total number of
                                                                          unique URLs for sub sequent analysis is 1478. In order to
                                                                          identify the user sessions we applied two different kinds of
                                                                          time oriented heuristics TOH1 and TOH2. Details of these
      Figure 8. Percentage of URLs versus URL Access Frequency            results and the comparisons of these approaches can be found
                                                                          from our previous work [17]. The result of application of TOH1
                                                                          is given in Table VI. Graph in Figure 9 depicts the results of
   TABLE VI.       RESULTS OF CLEANING AND USER IDENTIFICATION            application of Time oriented heuristics TOH1 and TOH2.
                        Items                            Count                Figure 10 shows the number of URLs and their
  No. of User Sessions 968                                968             corresponding session support count. Our result shows that 396
  Minimum no. of URLs accessed in a session                1              URLs have a session support count of one. We eliminate these
  Maximum no. of URLs accessed in a session               545
                                                                          URLs since they can’t play any significant role clusters
                                                                          formation. This type of session support filtering provides a
  Average no. of URLs accessed in a session              26.12            form of dimensionality reduction in subsequent clustering tasks
  Minimum no. of unique URLs accessed in a session         1              where URLs appearing in the session file are used as features.
  Maximum unique URLs Accessed in a session               158             Table 4 shows the results of user session identification after the
  Average unique URLs Accessed in a session               6.5             elimination of these low support URLs.


    Total number of unique URLs of the Web Site present in
the log file entries is 6850. Figure 6 shows the percentage of
the URLs against how many times they are accessed in the log
file. It is clear from the graph that 78% of URLs were accessed
only once, 16% of them were accessed twice and only 6% of
them are accessed three or more times. Maximum access count
for a URL is 2234. On average each URL is accessed 4.47
times.




                                                                           Figure 10. No. of URLs Versus No. of Sessions They are Associated with




          Figure 9. Sessionization results for TOH1 and TOH2
                                                                                       Figure 11. No. of Sessions Versus No. of URLs




                                                                     76                                 http://sites.google.com/site/ijcsis/
                                                                                                        ISSN 1947-5500
                                                             (IJCSIS) International Journal of Computer Science and Information Security,
                                                                                                                       Vol. 9, No. 6, 2011
   Figure 11 depicts the session counts against various URL
counts. Our results show that there are quite a large number of
user sessions containing only few URLs. For example there are
67 sessions containing one only URL, 134 containing two
URLs and 56 sessions containing three URLs. User sessions
with smaller number of URLs are less significant for the
purpose of clustering.
    We are interested in only those sessions that access more
than a certain number of URLs, say MinURLs. For example, it
is not very useful to cluster user sessions which just access the
URL for home page and leave. Therefore we impose certain
constraints desirable for better clustering performance and
outcome by using a Fuzzy set theoretic approach to assign the
weights to various user sessions based on the number of URLs                       Figure 12. No. of Clusters Versus Performance Index
they contain. Instead of directly removing all the small sessions
below a specified threshold, we assign weights to all the                    In order to decide the number of optimum clusters we
sessions using a “Fuzzy Membership Function” based on the                calculated the validity index (S), which is the ratio of
number of URLs accessed by the sessions.                                 compactness to separation using the equation (7).
    Based on the sessionization result as shown in graph of
figure 11, we choose the lower bound on the number of URLs
accessed in a session (LB) as 1 and an upper bound on the
number of URLs accessed in a session (UB) as 6. Using
equation (1) weights assigned to various sessions are specified
in Table VII.

      TABLE VII.    SESSION WEIGHTS BASED ON THE URL COUNT

              Session URL Count     Session Weight
              1                            0
              2                           0.2
              3                           0.4
              4                           0.6
              5                           0.8
                                                                           Figure 13. Validity Index Versus No. of Clusters for Weighted Sessions
              6 or more                    1


    Once use sessions are assigned the weights based on the
URL count, Fuzzy c-Mean clustering algorithm is applied to
discover session clusters that represent similar URL access
patterns. Application of the Fuzzy c-means clustering algorithm
resulted in the formation of overlapping clusters. The
performance Index J(U,V,X) of fuzzy c-mean clustering is
calculated using equation (4). It is the weighted sum of
distances between the data points and the corresponding centers
of the clusters. Minimization of the performance Index
J(U,V,X) is achieved by updating the grade of memberships of
data points and centers of the clusters in an alternating fashion
using the equations (6) and (5) respectively, until convergence.
    Fuzzy c-Mean clustering is first applied by choosing the              Figure 14. Validity Index Vs. No. of Clusters for Non-Weighted Sessions
number of clusters as 4. During each of the iterations we
increased the number of clusters by 1 till the number of clusters            Figures 13 and 14 provide the graphs of validity index (S)
is reached to 60. We repeated the above process for weighted             versus number of clusters for weighted and non-weighted
as well as non-weighted sessions. Graph is figure 12 shows the           sessions respectively. Our results show that for the weighted
performance index (J) versus number of clusters for weighted             sessions validity index is minimized when value chosen for the
as well as non-weighted sessions. From the graph it is clear that        number of clusters is 8. On the other hand for the case of non-
“Fuzzy Set Theoretic” weighted session approach results in               weighted sessions, validity index is minimized when the
better minimization of the performance index than non-                   number of clusters is 21. Thus the optimal number of clusters
weighted session approach.                                               for weighted sessions is 8 and for non-weighted sessions it is
                                                                         21.




                                                                    77                                  http://sites.google.com/site/ijcsis/
                                                                                                        ISSN 1947-5500
                                                                      (IJCSIS) International Journal of Computer Science and Information Security,
                                                                                                                                Vol. 9, No. 6, 2011
              V.     CONCLUSION AND FUTURE WORK                                   [11] P. Kolari and A. Joshi, “Web mining: research and practice,” Computing
                                                                                       in Science and Engineering, vol. 6, no. 4, pp. 49–53, 2004.
    In this paper, we discussed our methodology to preprocess                     [12] W. Tong and H. Pi-lian, “Web log mining by an improved aprioriall
the web log data including data cleaning, user identification                          algorithm,” in In proceeding of world academy of science, engineering,
and session identification. We also discussed the details about                        and technology, 2005, pp. 97–100.
how to apply the Fuzzy c- Mean Clustering algorithm in order                      [13] A. Joshi and R. Krishnapuram, “Robust fuzzy clustering methods to
to cluster the user sessions.                                                          support web mining,” 1998.
                                                                                  [14] D. Tanasa and B. Trousse, “Advanced data preprocessing for intersites
    In order improve the clustering results; we proposed a                             web usage mining,” IEEE Intelligent Systems, vol. 19, no. 2, pp. 59–65,
“Fuzzy Set Theoretic” approach for the removing the sessions                           2004.
with very few URLs. Instead of directly removing all the small                    [15] F. Klawonn and A. Keller, “Fuzzy clustering based on modified distance
sessions below a specified threshold, we assign weights to all                         measures,” in Advances in Intelligent Data Analysis, ser. Lecture Notes
the sessions using a “Fuzzy Membership Function” based on                              in Computer Science, D. Hand, J. Kok, and M. Berthold, Eds. Springer
                                                                                       Berlin / Heidelberg, 1999, vol. 1642, pp. 291–301.
the number of URLs accessed by the sessions. We described
                                                                                  [16] D. Tanasa and B. Trousse, “Data preprocessing for wum,” Intelligent
our methodology to perform feature subset selection of session                         Systems, IEEE, vol. 23, no. 3, pp. 22–25, 2004.
vectors and session weight assignment. Finally we compared                        [17] Z. Ansari, M. F. Azeem, A. V. Babu, and W. Ahmed, “Preprocessing
our soft computing based approach of session weight                                    users web page navigational data to discover usage patterns,” in The
assignment with the traditional hard computing based approach                          Seventh International Conference on Computing and Information
of small session elimination. Our results show that the “Fuzzy                         Technology, Bangkok, Thailand, May 2011, proceeding vol. 1 pp. 18-
Set Theoretic” approach of session weight assignment results in                        189.
better minimization of clustering performance index than                          [18] P. Berkhin, “Survey of clustering data mining techniques,” Springer,
                                                                                       2002.
without session weight assignment.
                                                                                  [19] B. Pavel, “A survey of clustering data mining techniques,” in Grouping
    We believe that the above results can be further improved if                       Multidimensional Data. Springer Berlin Heidelberg, 2006, pp. 25–71.
we use fuzzy set theoretic approach for the inclusion of a URL                    [20] R. Xu and I. Wunsch, D., “Survey of clustering algorithms,” Neural
in user session instead of using crisp time threshold β. In our                        Networks, IEEE Transactions on, vol. 16, no. 3, pp. 645–678, May 2005.
current strategy a URL is not included in the current sessions if                 [21] M. Chau, R. Cheng, B. Kao, and J. Ng, “Uncertain data mining: An
it comes even one second later then the specified time                                 example in clustering location data,” in Advances in Knowledge
                                                                                       Discovery and Data Mining, ser. Lecture Notes in Computer Science, W.
threshold. We can apply a similar Fuzzy set theoretic approach                         Ng, M. Kitsuregawa, J. Li, and K. Chang, Eds. Springer Berlin /
to the assign the weights to the URLs based on how many                                Heidelberg, 2006, vol. 3918, pp. 199–204.
times they are accessed.                                                          [22] X. L. Xie and G. Beni, “A validity measure for fuzzy clustering,” IEEE
                                                                                       Trans. Pattern Analysis and Machine Intelligence, vol. PAMI-13, p.
                                                                                       841847, 1987.
                             REFERENCES
                                                                                  [23] R. Cooley, B. Mobasher, J. Srivastava et al., “Data preparation for
                                                                                       mining world wide web browsing patterns,” Knowledge and Information
[1]  R. Cooley, B. Mobasher, and J. Srivastava, “Web mining: Information               Systems, vol. 1, no. 1, pp. 5–32, 1999.
     and pattern discovery on the world wide web,” in Ninth IEEE                  [24] B. Berendt, B. Mobasher, M. Nakagawa, and M. Spiliopoulou, “The
     International Conference on Tools with Artificial Intelligence,                   impact of site structure and user environment on session reconstruction
     Proceedings, 1997, pp. 558–567.                                                   in web usage analysis,” in WEBKDD 2002 - MiningWeb Data for
[2] Y. Fu, K. Sandhu, and M. Shih, “A generalization-based approach to                 Discovering Usage Patterns and Profiles, ser. Lecture Notes in Computer
     clustering of web usage sessions,” Lecture Notes in Computer Science,             Science. Springer Berlin / Heidelberg, 2003, vol. 2703, pp. 159–179.
     pp. 21–38, 2000.                                                             [25] L. D. Catledge and J. E. Pitkow, “Characterizing browsing strategies in
[3] H. L. T. Mobasher, B.and Dai and M. Nakagawa, “Effective                           the world-wide web,” Computer Networks and ISDN Systems, vol. 27,
     personalization based on association rule discovery from web usage                no. 6, pp. 1065–1073, 1995, proceedings of the Third International
     data.” in In: Proceedings of the 3rd ACM Workshop on Web                          World-Wide Web Conference.
     Information and Data Management (WIDM01), Atlanta, Georgia                   [26] B. Berendt and M. Spiliopoulou, “Analysis of navigation behaviour in
     November, 2001.                                                                   web sites integrating multiple information systems,” The VLDB
[4] M. Spiliopoulou and L. C. Faulstich, “Wum: A web utilization miner,”               Journal,vol.9, pp. 56-75, 2000.
     in In Proceedings of EDBT Workshop WebDB98, Valencia, Spain,
     LNCS 1590, Springer Verlag., 1999.
                                                                                                             AUTHORS PROFILE
[5] B. Mobasher, R. Cooley, and J. Srivastava, “Automatic personalization
     based on web usage mining,” Commun. ACM, vol. 43, pp. 142–151,               Zahid Ansari is a Ph.D. candidate in the Department of CSE,
     August 2000.                                                                 Jawaharlal Nehru Technical University, India. He received his ME
[6] U. Fayyad and R. Uthurusamy, “Data mining and knowledge discovery             from Birla Institute of Technology, Pilani, India. He has worked at
     in databases,” Communications of ACM, vol. 39, pp. 24–27, 1996.              Tata Consultancy Services (TCS) where he was involved in the
[7] W. H. Inmon, “The data warehouse and data mining,” Communications             development of cutting edge tools in the field of model driven
     of ACM, vol. 39, pp. 49–50, 1996.                                            software development. His areas of research include data mining, soft
[8] M. K. Jiawei Han, Data Mining: Concepts and Techniques. Academic              computing and model driven software development. He is currently
     Press, Morgan Kaufmarm Publishers, 2001.                                     with the P.A. College of Engineering, Mangalore as a Faculty. He is
[9] P. S. U. M. Fayyad, G. Piatetsky-Shapiro and E. R. Uthurusamy,                also a member of ACM.
     “Advances in knowledge discovery and data mining,” in CA:
     AAAI/MIT Press, 1996.                                                        Mohammad Fazle Azeem is working as Professor and Director of
[10] J. H. Ming-Syan Chen, “Data mining an overview from database                 department of Electronics and Communication Engineering, P.A.
     perspective,” Knowledge and data Engineering, IEEE Transactions on,          College of Engineering, Mangalore. He received his B.E. in electrical
     vol. 8, 1996.                                                                engineering from M.M.M. Engineering College, Gorakhpur, India,




                                                                             78                                   http://sites.google.com/site/ijcsis/
                                                                                                                  ISSN 1947-5500
                                                               (IJCSIS) International Journal of Computer Science and Information Security,
                                                                                                                         Vol. 9, No. 6, 2011
M.S. from Aligarh Muslim University, Aligarh, India and Ph.D. from     His current research interests are algorithms, information retrieval
Indian Institute of Technology (IIT) Delhi, India. His interests       and data mining, distributed and parallel computing, Network
include robotics, soft computing, evolutive computation, clustering    security, image processing etc.
techniques, application of neuro-fuzzy approaches for the modeling,
and control of dynamic system such as biological and chemical          Waseem Ahmed is a Professor in CSE at P.A. College of
processes.                                                             Engineering, Mangalore. He obtained his BE from RVCE, Bangalore,
                                                                       MS from the University of Houston, USA and PhD from the Curtin
A.Vinaya Babu is working as Director of Admissions and Professor       University of Technology, Western Australia. His current research
of CSE at J.N.T. University Hyderabad, India. He received his          interests include multicore/multiprocessor development for HPC and
M.Tech. and PhD in Computer Science Engineering from JNT               embedded systems, and data mining. He has been exposed to
University, Hyderabad. He is a life member of CSI, ISTE and            academic/work environments in the USA, UAE, Malaysia, Australia
member of FIE, IEEE, and IETE. He has published more than 35           and India where he has worked for more than a decade. He is a
research papers in International/National journals and Conferences.    member of the IEEE.




                                                                    79                               http://sites.google.com/site/ijcsis/
                                                                                                     ISSN 1947-5500
                                                             (IJCSIS) International Journal of Computer Science and Information Security,
                                                                                                                       Vol. 9, No. 6, 2011

   Inception of Hybrid Wavelet Transform using Two
     Orthogonal Transforms and It’s use for Image
                     Compression
         Dr. H.B.Kekre,                                Dr.Tanuja K. Sarode                                 Sudeep D. Thepade
        Senior Professor,                                Assistant Professor                               Associate Professor
Computer Engineering Department,                       Computer Engineering                         Computer Engineering Department,
 SVKM’s NMIMS (Deemed-to-be                                 Department,                              SVKM’s NMIMS (Deemed-to-be
            University)                           Thadomal Shahani Engineering                                  University)
  Vile Parle(W), Mumbai, India.                 College, Bandra(W), Mumbai, India.                    Vile Parle(W), Mumbai, India.
      hbkekre@yahoo.com,                             tanuja_0123@yahoo.com                             sudeepthepade@gmail.com

Abstract—The paper presents the novel hybrid wavelet transform           century [19,20]. Generally, wavelets are purposefully crafted to
generation technique using two orthogonal transforms. The                have specific properties that make them useful for image
orthogonal transforms are used for analysis of global properties         processing. Wavelets can be combined, using a "shift, multiply
of the data into frequency domain. For studying the local                and sum" technique called convolution, with portions of an
properties of the signal, the concept of wavelet transform is            unknown signal(data) to extract information from the unknown
introduced, where the mother wavelet function gives the global           signal. Wavelet transforms are now being adopted for a vast
properties of the signal and wavelet basis functions which are           number of applications, often replacing the conventional
compressed versions of mother wavelet are used to study the local        Fourier transform [23,24,25,26]. They have advantages over
properties of the signal. In wavelets of some orthogonal
                                                                         traditional fourier methods in analyzing physical situations
transforms the global characteristics of the data are hauled out
better and some orthogonal transforms might give the local
                                                                         where the signal contains discontinuities and sharp
characteristics in better way. The idea of hybrid wavelet                spikes[27,28,29]. In fourier analysis the local properties of the
transform comes in to picture in view of combining the traits of         signal are not detected easily. STFT(Short Time Fourier
two different orthogonal transform wavelets to exploit the               Transform)[29] was introduced to overcome this difficulty.
strengths of both the transform wavelets.                                However it gives local properties at the cost of global
          The paper proves the worth of hybrid wavelet                   properties. Wavelets overcome this shortcoming of Fourier
transforms for the image compression which can further be                analysis [28,29] as well as STFT. Many areas of physics have
extended to other image processing applications like                     seen this paradigm shift, including molecular dynamics,
steganography, biometric identification, content based image             astrophysics, optics, quantum mechanics etc. This change has
retrieval etc. Here the hybrid wavelet transforms are generated          also occurred in image processing, blood-pressure, heart-rate
using four orthogonal transforms alias Discrete Cosine transform         and ECG analyses, DNA analysis, protein analysis,
(DCT), Discrete Hartley transform (DHT), Discrete Walsh                  climatology, general signal processing, speech, face
transform (DWT) and Discrete Kekre transform (DKT). Te                   recognition, computer graphics and multifractal analysis.
comparison of the hybrid wavelet transforms is also done with            Wavelet transforms are also starting to be used for
the original orthogonal transforms and their wavelet transforms.         communication applications. One use of wavelet
The experimentation results have shown that the transform                approximation is in data compression. Like other transforms,
wavelets have given better quality of image compression than the         wavelet transforms can be used to transform data then, encode
respective original orthogonal transforms but for hybrid
                                                                         the transformed data, resulting in effective compression [24].
transform wavelets the performance is best. Here the hybrid of
DCT and DKT gives the best results among the combinations of
                                                                         Wavelet compression can be either lossless or lossy. The
the four mentioned image transforms used for generating hybrid           wavelet compression methods are adequate for representing
wavelet transforms.                                                      high-frequency components in two-dimensional images.
                                                                             Earlier wavelets of only Haar transform have been studied.
  Keywords-Orthogonal transform; Wavelet transform; Hybrid               In recent work [4,7,11,13] the wavelets of few orthogonal
Wavelet transform; Compression.                                          transforms alias Walsh [16,17,18], DCT [14,15], Kekre [21,22]
                                                                         and Hartley[1,2,3] are proposed. The wavelet transforms in
                      I.    INTRODUCTION                                 many applications are proven to be better than respective
    Wavelets are mathematical tools that can be used to extract          orthogonal transforms [8,9,10,12]. The paper presents the
information from many different kinds of data, including                 innovative hybrid wavelet transform generation method, which
images [4,5,6,7]. Sets of wavelets are generally needed to               generates hybrid wavelet transform of any two orthogonal
analyze data fully. A set of "complementary" wavelets will               transforms. This concept of hybrid wavelet transform can
reconstruct data without gaps or overlap so that the                     acquire the positive traits from both the orthogonal transforms
deconstruction process is mathematically reversible and is with          used to generate it. The hybrid wavelet generation concept
minimal loss. The wavelets are results of the thought process of         opens up new avenues of selection of orthogonal transforms for
many people starting with with Haar's work in the early 20th             hybrid and their use in particular image processing application



                                                                    80                              http://sites.google.com/site/ijcsis/
                                                                                                    ISSN 1947-5500
                                                                         (IJCSIS) International Journal of Computer Science and Information Security,
                                                                                                                                   Vol. 9, No. 6, 2011
to gain some upper edge over individual orthogonal transforms
or respective wavelet transforms. The paper presents the use of                                                [K ] ∗ [K ]t = [D ]                            (3)
hybrid wavelet transforms generated using Discrete Walsh                              where, D is the diagonal matrix. The hybrid wavelet
Transform (DWT), Discrete Kekre Transform (DKT), Discrete                         transform of size NxN generated from any two orthogonal
Hartley Transform (DHT) and Discrete Cosine Transform                             transforms satisfies this property and hence it is orthogonal.
(DCT) for image compression. The experimental results prove
that the hybrid wavelet transforms are better than the respective
                                                                                  B. Non Involutional
orthogonal transforms as well as their wavelet transforms.
                                                                                     An involutionary function is a function that is it’s own
                                                                                  inverse. So involutional transform is a transform which is
     II.     GENERATION OF HYBRID WAVELET TRANSFORM                               inverse transform of itself. The Hybrid wavelet transform is
                                                                                  non involutional transform
             ⎡ a11   a12    L a1 p ⎤     ⎡b11 b12 L b1q ⎤
             ⎢a             L a2 p ⎥     ⎢b
                     a 22                     b22 L b2 q ⎥                        C. Transform on Vector
           A=⎢                     ⎥   B=⎢               ⎥
                21
                                                             (1)
                                           21
             ⎢ M      M     M  M ⎥       ⎢ M   M   M  M ⎥                            The hybrid wavelet transform (say ‘K’) of one-dimensional
             ⎢                     ⎥     ⎢               ⎥
             ⎢a p1
             ⎣       a p2   L a pp ⎥
                                   ⎦     ⎢bq1 bq 2 L bqq ⎥
                                         ⎣               ⎦                        vector q is given by.

                                                                                                                    Q = K ∗q []                              (4)
                                                                                       And inverse is given by

                                                                                                                                     Q
                                                                                                              q = K [ ]t ∗               ij
                                                                                                                                                             (5)
                                                                                                                                μ        ∗μ
                                                                                                                                    Ti        Tj
                                                                                      Where Qij is the value at ith row, jth column of matrix Q and
                                                                   (2)            the term μ in normalization factor can be computed as given
                                                                                  below through equations 6, 7 and 8.
                                                                                                                             t
                                                                                                                μ       =T T                                 (6)
                                                                                                                    T     AB AB
                                                                                       Such that

                                                                                   μ        =μ     μ
                                                                                       T1        A1 B1 ,
                                                                                   μ        =μ     μ
    The hybrid wavelet transform matrix of size NxN (say                               T2        A1 B2 ,
‘TAB’) can be generated from two orthogonal transform                              μ        =μ     μ
matrices ( say A and B respectively with sizes pxp and qxq,                            T3        A1 B3
where N=p*q=pq) as given by equations 1 and 2.Here first ‘q’                       M
number of rows of the hybrid wavelet transform matrix are
calculated as the product of each element of first row of the                      μ        =μ        μ                                                                   (7)
                                                                                       Tq           A1 Bq
orthogonal transform A with each of the columns of the
orthogonal transform B. For next ‘q’ number of rows of hybrid                      μ           =μ                   =L=μ                      =μ
                                                                                       Tq +1         Tq + 2                         T2q            A2
wavelet transform matrix the second row of the orthogonal
transform matrix A is shift rotated after being appended with                      μ             =μ                     =L=μ                   =μ
zeros as shown in equation 2. Similarly the other rows of                              T2q +1         T2q + 2                           T3q          A3
hybrid wavelet transform matrix are generated (as set of q rows                                 M
each time for each of the ‘p-1’ rows of orthogonal transform
matrix A starting from second row upto last row).                                  μ                  =μ                      =L=μ                      =μ
                                                                                       T(p −1)q +1             T2q + 2                        T3q            A3

     III.     PROPERTIES OF HYBRID WAVELET TRANSFORM
                                                                                     Where with reference to equation 1,                                μ A and μ B can    be
   The crossbreed of two orthogonal transforms results into
                                                                                  given as equation 8.
hybrid wavelet transform, which itself satisfies the following
properties.                                                                                                                 ⎡μ A1        0    L     0   ⎤
                                                                                                          t                 ⎢ 0     μ
                                                                                                                                        A2
                                                                                                                                              L     0   ⎥
A. Orthogonal                                                                                        AA       = μ
                                                                                                                    A
                                                                                                                        =   ⎢ M         M     M      M
                                                                                                                                                        ⎥           (8)
    The transform matrix K is said to be orthogonal if the                                                                  ⎢ 0                         ⎥
following condition is satisfied.                                                                                           ⎣           0     L    μ
                                                                                                                                                     Ap ⎦




                                                                             81                                               http://sites.google.com/site/ijcsis/
                                                                                                                              ISSN 1947-5500
                                                                                     (IJCSIS) International Journal of Computer Science and Information Security,
                                                                                                                                               Vol. 9, No. 6, 2011
                                   ⎡μ B1                0       L     0   ⎤                   Where Iij is the pixel intensity value at ith row, jth column of
                                                                                              image I and the calculation of the term μ in normalization
                   t               ⎢ 0              μ
                                                        B2
                                                                L      0 ⎥
              BB       = μ
                             B
                                 = ⎢                                      ⎥                   factor is as given above through equations 6, 7 and 8.
                                      M                 M       M      M
                                   ⎢ 0                                    ⎥
                                   ⎣                    0       L    μ
                                                                       Bq ⎦                                IV.    RESULTS AND DISCUSSION
D. Transform on Two-Dimensional Image                                                             The test bed used in experimentation for proving the worth
The hybrid wavelet transform of two-dimensional image I is                                    of hybrid wavelet transform consists of 11 color images of size
given by.                                                                                     256x256x3and is shown in figure 1. On each image all the
                                                                                              three alias orthogonal transform, wavelet transform and hybrid
                                      []
                                 Q = K ∗I∗ K            [ ]t                  (9)             wavelet transform are applied. In transform domain the high
                                                                                              frequency data is removed and the images are transformed
                                                                                              inversely back to spatial domain. To judge the performance of
And inverse is given by                                                                       the orthogonal transform, wavelet transform and hybrid
                                                                                              wavelet transform in compression; the original images are
                                           I
                                                                                              compared with these modified images (having the data loss as
                       q = K [ ]t ∗            ij
                                                                []
                                                               ∗ K
                                                                              (10)            compression) using mean squared error (MSE). In all size data
                                      ⎛μ ∗ μ ⎞
                                      ⎜       ⎟
                                      ⎝ Ti Tj ⎠                                               compression percentages are considered as 95%, 90%, 85%,
                                                                                              80%, 75% and 70%. The average of such MSEs for all images
                                                                                              for respective transform and considered percentage of data
                                                                                              compression is taken for performance analysis.




  Figure 1: The test bed of eleven original color images belonging to different categories and namely (from left to right and top to bottom) Aishwarya, Balls, Bird,
                                  Boat, Flower, Dagdusheth-Ganesh, TajMahal, Strawberry, Scenery, Tiger and Viharlake-Powai.




                                                                                         82                               http://sites.google.com/site/ijcsis/
                                                                                                                          ISSN 1947-5500
                                                                      (IJCSIS) International Journal of Computer Science and Information Security,
                                                                                                                                Vol. 9, No. 6, 2011
Figure 2: Performance comparison of Image compression using Discrete Cosine transform (DCT), cosine wavelet transform (DCT wavelets) and the hybrid
wavelet transforms of DCT taken with Hartley (DCT_DHT) and Kekre transforms (DCT_DKT) with respect to 95% to 70% of data compression.


Figure 2 shows the average of mean squared error (MSE) differences of the original and respective compressed image pairs
plotted against percentages of data compression from 955 to 70% for image compression done using Discrete Cosine transform
(DCT), cosine wavelet transform (DCT wavelets) and the hybrid wavelet transforms of DCT taken with Hartley (DCT_DHT) and
Kekre transforms (DCT_DKT). Here the performance of hybrid wavelet transforms (DCT_DKT and DCT_DHT) is the best as
indicated by minimum MSE values over the respective DCT and DCT wavelet transform.




Figure 3: Performance comparison of Image compression using Discrete Walsh transform (DWT), Walsh wavelet transform (Walsh wavelets) and the hybrid
wavelet transforms of Walsh transform taken with Hartley (DWT_DHT) and Cosine transforms (DWT_DCT) with respect to 95% to 70% of data compression

The average of mean squared error (MSE) differences of the original and respective compressed image pairs for image compression
done using Discrete Walsh transform (DWT), Walsh wavelet transform (Walsh wavelets) and the hybrid wavelet transforms of
Walsh transform taken with Hartley (DWT_DHT) and Cosine transforms (DWT_DCT) with respect to 95% to 70% of data
compression are plotted in figure 3. Here the performance of hybrid wavelet transforms (DWT_DHT and DWT_DCT) are better
than the Walsh transform and are almost similar to the Walsh wavelet transform. The DWT_DCT hybrid wavelet transform
marginally performs better in case of 95% and 80% data compression.




 Figure 4: Performance comparison of Image compression using Discrete Hartley transform (DHT), Hartley wavelet transform (Hartley wavelets) and the hybrid
 wavelet transforms of Hartley transform taken with Walsh (DHT_DWT) and Cosine transforms (DHT_DCT) with respect to 95% to 70% of data compression.




                                                                            83                                  http://sites.google.com/site/ijcsis/
                                                                                                                ISSN 1947-5500
                                                                        (IJCSIS) International Journal of Computer Science and Information Security,
                                                                                                                                  Vol. 9, No. 6, 2011




   Figure 5: Performance comparison of Image compression using Discrete Kekre transform (DKT), Kekre wavelet transform (Kekre wavelets) and the hybrid
               wavelet transforms of Kekre transform taken with Cosine transforms (DKT_DCT) with respect to 95% to 70% of data compression

Figure 4 gives the average of mean squared error (MSE) differences of the original and respective compressed image pairs for
image compression done using Discrete Hartley transform (DHT), Hartley wavelet transform (Hartley wavelets) and the hybrid
wavelet transforms of Hartley transform taken with Walsh (DHT_DWT) and Cosine transforms (DHT_DCT) with respect to 95%
to 70% of data compression. Here except 95% data compression in al other percentages of data compression, the performance of
hybrid wavelet transforms (DHT_DWT and DHT_DCT) are better than the Hartley transform and are almost similar to the
Hartley wavelet transform with DWT_DCT proved to be marginally better.

In case of image compression using hybrid wavelet transform (DKT_DCT) generated using discrete Kekre transform (DKT) and
discrete Cosine transform (DCT), the performance is almost similar to the Kekre wavelet transform but better than the Kekre
transform as shown in figure 5.




 Figure 6: Overall performance analysis of Image compression using the orthogonal transforms, their respective wavelet transforms and newly introduced hybrid
                      wavelet transforms for Cosine, Kekre, Walsh and Hartley transforms with respect to 95% to 70% of data compression


Figure 6 gives overall performance comparison of image compression using all the proposed hybrid wavelet transforms with
respective orthogonal transform and wavelet transform based compression methods for various percentages of data compression
from 70% to as high as 95%. Overall the best performance is given by DCT_DKT (hybrid wavelet transform of Cosine transform
with Kekre transform) followed by DCT_DWT and DCT_DHT (hybrid wavelet transform of Cosine transform taken respectively
with Walsh transform and Hartley transform). In all the respective orthogonal transforms the hybrid wavelet transforms have shown
better quality of image compression.




                                                                             84                                   http://sites.google.com/site/ijcsis/
                                                                                                                  ISSN 1947-5500
                                                                    (IJCSIS) International Journal of Computer Science and Information Security,
                                                                                                                              Vol. 9, No. 6, 2011




Figure 7: The compression of flower image using the hybrid wavelet transform (DCT_DHT Wavelet) generated using Discrete Cosine transform and Discrete
                                           Hartley transform with respect to 95% to 70% of data compression




Figure 8: The compression of flower image using the hybrid wavelet transform (DCT_DKT Wavelet) generated using Discrete Cosine transform and Discrete
                                            Kekre transform with respect to 95% to 70% of data compression




                                                                         85                                  http://sites.google.com/site/ijcsis/
                                                                                                             ISSN 1947-5500
                                                                       (IJCSIS) International Journal of Computer Science and Information Security,
                                                                                                                                 Vol. 9, No. 6, 2011




  Figure 9: The compression of flower image using the hybrid wavelet transform (DCT_DWT Wavelet) generated using Discrete Cosine transform and Discrete
                                              Walsh transform with respect to 95% to 70% of data compression.

Figures 7, 8 and 9 have shown the compression of flower                               in a Greyscale Image”, International Journal of Computer
image for various hybrid wavelet transforms with respect to                           Applications (IJCA), Volume 1, Number 11, December 2010, pp 32-
                                                                                      38.
the 955 to 70 % of data compression. The subjective quality
                                                                                 [5] Dr. H.B.kekre, Sudeep D. Thepade, Adib Parkar “Storage of Colour
of compression in all cases is quite acceptable as negligible                         Information in a Greyscale Image using Haar Wavelets and Various
distortion is observed in original and compressed images                              Colour Spaces”, International Journal of Computer Applications
even at the 95% data compression. Even the objective                                  (IJCA), Volume 6, Number 7, pp.18-24, September 2010.
criteria (i.e. mean squared error) values of differences                         [6] Dr.H.B.Kekre, Sudeep D. Thepade, Juhi Jain, Naman Agrawal,
between the original and compressed images are minimal.                               “IRIS Recognition using Texture Features Extracted from Walshlet
                                                                                      Pyramid”, ACM-International Conference and Workshop on
                                                                                      Emerging Trends in Technology (ICWET 2011),Thakur College of
                        V.     CONCLUSION                                             Engg. And Tech., Mumbai, 26-27 Feb 2011. Also will be uploaded
                                                                                      on online ACM Portal.
The innovative concept of the hybrid wavelet transforms                          [7] Dr.H.B.Kekre, Sudeep D. Thepade, Akshay Maloo, “Face
generation using any two orthogonal transforms is proposed                            Recognition using Texture Features Extracted form Walshlet
in the paper. Here the hybrid wavelet transforms are                                  Pyramid”, ACEEE International Journal on Recent Trends in
                                                                                      Engineering and Technology (IJRTET), Volume 5, Issue 1,
generated using Discrete Walsh Transform (DWT), Discrete                              www.searchdl.org/journal/IJRTET2010
Kekre Transform (DKT), Discrete Hartley Transform                                [8] Dr.H.B.Kekre, Sudeep D. Thepade, Juhi Jain, Naman Agrawal,
(DHT) and Discrete Cosine Transform (DCT) for image                                   “Performance Comparison of IRIS Recognition Techniques using
compression. The experimental results prove that the hybrid                           Wavelet Pyramids of Walsh, Haar and Kekre Wavelet Transforms”,
                                                                                      International Journal of Computer Applications (IJCA), Number 2,
wavelet transforms are better than the respective orthogonal                          Article                4,               March               2011,
transforms as well as their wavelet transforms. The various                           http://www.ijcaonline.org/proceedings/icwet/number2/2070-aca386
orthogonal transforms can be considered for crossbreeding                        [9] Dr.H.B.Kekre, Sudeep D. Thepade, Akshay Maloo, “Face
to generate the hybrid wavelet transform based on the                                 Recognition using Texture Features Extracted from Haarlet
expected behavior of the hybrid wavelet transform for                                 Pyramid”, International Journal of Computer Applications (IJCA),
                                                                                      Volume 12, Number 5, December 2010, pp 41-45. Available at
particular application. After proving the worth of hybrid                             www.ijcaonline.org/archives/volume12/number5/1672-2256
wavelet transforms for the image compression future work                         [10] Dr.H.B.Kekre, Sudeep D. Thepade, Juhi Jain, Naman Agrawal,
could include the extension of the concept to other image                             “IRIS Recognition using Texture Features Extracted from Haarlet
processing applications like steganography, biometric                                 Pyramid”, International Journal of Computer Applications (IJCA),
                                                                                      Volume 11, Number 12, December 2010, pp 1-5, Available at
identification , content based image retrieval etc.                                   www.ijcaonline.org/archives/volume11/number12/1638-2202.
                                                                                 [11] Dr.H.B.Kekre, Sudeep D. Thepade, Akshay Maloo, “Performance
                       VI.     REFERENCES                                             Comparison of Image Retrieval Techniques using Wavelet Pyramids
[1]   R. V. L. Hartley, "A more symmetrical Fourier analysis applied to               of Walsh, Haar and Kekre Transforms”, International Journal of
      transmission problems," Proceedings of IRE 30, pp.144–150, 1942.                Computer Applications (IJCA) Volume 4, Number 10, August 2010
                                                                                      Edition,                           pp                        1-8,
[2]   R. N. Bracewell, "Discrete Hartley transform," Journal of Opt. Soc.
                                                                                      http://www.ijcaonline.org/archives/volume4/number10/866-1216
      America, Volume 73, Number 12, pp. 1832–183 , 1983.
                                                                                 [12] Dr.H.B.Kekre, Sudeep D. Thepade, Akshay Maloo, “Query by
[3]   R. N. Bracewell, "The fast Hartley transform," Proc. of IEEE
                                                                                      image content using color texture features extracted from Haar
      Volume 72, Number 8, pp.1010–1018 ,1984.
                                                                                      wavelet pyramid”, International Journal of Computer Applications
[4]   Dr. H.B.kekre, Sudeep D. Thepade, Adib Parkar, “A Comparison of                 (IJCA) for the special edition on “Computer Aided Soft Computing
      Haar Wavelets and Kekre’s Wavelets for Storing Colour Information               Techniques for Imaging and Biomedical Applications”, Number 2,




                                                                            86                                http://sites.google.com/site/ijcsis/
                                                                                                              ISSN 1947-5500
                                                                        (IJCSIS) International Journal of Computer Science and Information Security,
                                                                                                                                  Vol. 9, No. 6, 2011
       Article                 2,                August              2010.        head in the Department of Computer Engg. at Thadomal Shahani
       http://www.ijcaonline.org/specialissues/casct/number2/1006-41              Engineering. College, Mumbai. Now he is Senior Professor at MPSTME,
[13]   Dr.H.B.Kekre, Sudeep D. Thepade, “Image Retrieval using Color-             SVKM’s NMIMS. He has guided 17 Ph.Ds, more than 100 M.E./M.Tech
       Texture Features Extracted from Walshlet Pyramid”, ICGST                   and several B.E./ B.Tech projects. His areas of interest are Digital Signal
       International Journal on Graphics, Vision and Image Processing             processing, Image Processing and Computer Networking. He has more than
       (GVIP), Volume 10, Issue I, Feb.2010, pp.9-18, Available online            270 papers in National / International Conferences and Journals to his
       www.icgst.com/gvip/Volume10/Issue1/P1150938876.html                        credit. He was Senior Member of IEEE. Presently He is Fellow of IETE
[14]   N. Ahmed, T. Natarajan and K. R. Rao, “Discrete Cosine                     and Life Member of ISTE Recently 11 students working under his guidance
       Transform”, IEEE Transaction Computers, C-23, pp. 90-93, January           have received best paper awards. Two of his students have been awarded
       1974.                                                                      Ph. D. from NMIMS University. Currently he is guiding ten Ph.D. students.
[15]   W. Chen, C. H. Smith and S. C. Fralick, “A Fast Computational
       Algorithm For The Discrete Cosine Transform”, IEEE Transaction             Dr. Tanuja K. Sarode has Received Bsc.(Mathematics) from Mumbai
       Communications, Com-25, pp.: 1004-1008, Sept. 1977.                                                    University in 1996, Bsc.Tech.(Computer
[16]   George Lazaridis, Maria Petrou, “Image Compression By Means of                                         Technology) from Mumbai University in 1999,
       Walsh Transform”, IEEE Transaction on Image Processing, Volume                                         M.E. (Computer Engineering) degree from
       15, Number 8, pp.2343-2357, 2006.                                                                      Mumbai University in 2004, Ph.D. from Mukesh
                                                                                                              Patel School of Technology, Management and
[17]   J. L. Walsh, “A Closed Set of Orthogonal Functions”, American
                                                                                                              Engineering, SVKM’s NMIMS University,
       Journal of Mathematics, Volume 45, pp. 5-24, 1923.
                                                                                                              Vile-Parle (W), Mumbai, INDIA. She has more
[18]   Zhibin Pan, Kotani K., Ohmi T., “Enhanced fast encoding method                                         than 12 years of experience in teaching.
       for vector quantization by finding an optimally-ordered Walsh                                          Currently working as Assistant Professor in
       transform kernel”, ICIP 2005, IEEE International Conference,                                           Dept. of Computer Engineering at Thadomal
       Volume 1, pp I - 573-6, Sept. 2005.                                        Shahani Engineering College, Mumbai. She is life member of IETE,
[19]   Charles K. Chui, “An Introduction to Wavelets”, Academic Press,            member of International Association of Engineers (IAENG) and
       1992, San Diego, ISBN 0585470901.                                          International Association of Computer Science and Information
[20]   Ingrid Daubechies, “Ten Lectures on Wavelets”, SIAM, 1992.                 Technology (IACSIT), Singapore. Her areas of interest are Image
[21]   Dr.H.B.Kekre, Sudeep D. Thepade, “Image Retrieval using Non-               Processing, Signal Processing and Computer Graphics. She has 90 papers
       Involutional Orthogonal Kekre’s Transform”, International Journal          in National /International Conferences/journal to her credit.
       of Multidisciplinary Research and Advances in Engineering
       (IJMRAE), Ascent Publication House, 2009, Volume 1, No.I, pp               Sudeep D. Thepade has Received B.E.(Computer) degree from North
       189-203, 2009. Abstract available online at www.ascent-                                                Maharashtra University with Distinction in
       journals.com                                                                                           2003. M.E. in Computer Engineering from
[22]   Dr.H.B.Kekre, Sudeep D. Thepade, Archana Athawale, Anant S.,                                           University of Mumbai in 2008 with Distinction,
       Prathamesh V., Suraj S., “Kekre Transform over Row Mean,                                               currently submitted thesis for Ph.D. at SVKM’s
       Column Mean and Both using Image Tiling for Image Retrieval”,                                          NMIMS, Mumbai. He has more than 08 years
       International Journal of Computer and Electrical Engineering                                           of experience in teaching and industry. He was
       (IJCEE), Volume 2, Number 6, October 2010, pp 964-971, is                                              Lecturer in Dept. of Information Technology at
       available at www.ijcee.org/papers/260-E272.pdf                                                         Thadomal Shahani Engineering College,
[23]   K. P. Soman and K.I. Ramachandran. ”Insight into WAVELETS                                              Bandra(w), Mumbai for nearly 04 years.
       From Theory to Practice”, Printice -Hall India, pp 3-7, 2005.                                          Currently working as Associate Professor in
                                                                                                              Computer Engineering at Mukesh Patel School
[24]   Raghuveer M. Rao and Ajit S. Bopardika. “Wavelet Transforms –
                                                                                  of Technology Management and Engineering, SVKM’s NMIMS, Vile
       Introduction to Theory and Applications”, Addison Wesley
                                                                                  Parle(w), Mumbai, INDIA. He is member of International Association of
       Longman, pp 1-20, 1998.
                                                                                  Engineers (IAENG) and International Association of Computer Science
[25]   C.S. Burrus, R.A. Gopinath, and H. Guo. “Introduction to Wavelets          and Information Technology (IACSIT), Singapore. He is member of
       and Wavelet Transform” Prentice-hall International, Inc., New              International Advisory Committee for many International Conferences. He
       Jersey, 1998.                                                              is reviewer for various International Journals. His areas of interest are
[26]   Amara Graps, ”An Introduction to Wavelets”, IEEE Computational             Image Processing Applications, Biometric Identification. He has about 110
       Science and Engineering, vol. 2, num. 2, Summer 1995, USA.                 papers in National/International Conferences/Journals to his credit with a
[27]   Julius O. Smith III and Xavier SerraP“, An Analysis/Synthesis              Best Paper Award at International Conference SSPCCIN-2008, Second
       Program for Non-Harmonic Sounds Based on a Sinusoidal                      Best Paper Award at ThinkQuest-2009 National Level paper presentation
       Representation'', Proceedings of the International Computer Music          competition for faculty, Best paper award at Springer international
       Conference (ICMC-87, Tokyo), Computer Music Association, 1987.             conference ICCCT-2010 and second best research project award at
                                                                                  ‘Manshodhan-2010’.
[28]   S. Mallat, "A Theory of Multiresolution Signal Decomposition: The
       Wavelet Representation," IEEE Trans. Pattern Analysis and Machine
       Intelligence, vol. 11, pp. 674-693, 1989.
[29]   Strang G. "Wavelet Transforms Versus Fourier Transforms." Bull.
       Amer. Math. Soc. 28, 288-305, 1993.


                      AUTHORS PROFILE
Dr. H. B. Kekre has received B.E. (Hons.) in Telecomm. Engineering.
                          from Jabalpur University in 1958, M.Tech
                          (Industrial Electronics) from IIT Bombay in
                          1960, M.S.Engg. (Electrical Engg.) from
                          University of Ottawa in 1965 and Ph.D.
                          (System Identification) from IIT Bombay
                          in 1970 He has worked as Faculty of
                          Electrical Engg. and then HOD Computer
                          Science and Engg. at IIT Bombay. For 13
                          years he was working as a professor and




                                                                             87                                 http://sites.google.com/site/ijcsis/
                                                                                                                ISSN 1947-5500
                                                               (IJCSIS) International Journal of Computer Science and Information Security,
                                                                                                                         Vol. 9, No. 6, 2011

              A Model for the Controlled Development of
                   Software Complexity Impacts
         Ghazal Keshavarz                                   Nasser Modiri                                        Mirmohsen Pedram
       Computer department                               Computer department                                Computer Engineering department
   Science and Research Branch,                         Islamic Azad University                                Tarbiat Mollem University
      Islamic Azad University                                 Zanjan, Iran                                         Karaj/Tehran, Iran
            Tehran, Iran                               nassermodiri@yahoo.com                                      pedram@tmu.ac.ir
   ghazalkeshavarz@gmail.com

Abstract— Several researches have shown software complexity                Recent surveys suggest that 44% to 80% of all defects are
has affected different features of software. The most important            inserted in the requirements phase [2]. Thus, if errors are not
ones are productivity, quality and maintenance of software. Thus,          identified in the requirements phase, it is leading to make
measuring and controlling of complexity will have an important             mistakes, wrong development product and loss valuable
influence to improve these features. So far, most of the proposed          resource.
approaches to control and measure complexity are in code and
design phase and mainly have based on code and cognitive                      However, it will not be possible to develop better quality
methods; But measuring and control the complexity in these                 requirements without a well-defined Requirement Engineering
phases (design and code) is too late. In this paper, with emphasis         (RE) process. Since RE is the starting point of software
on requirement engineering process, we analyze the factors                 engineering and later stages of software development rely
affecting complexity in the early stages of software life cycle and        heavily on the quality of requirements, there is a good reason to
present a model. This model enables software engineering to                pay close attention to it.
identify the complexity reasons that are the origin of many costs
in later phases (especially in maintenance phase) and prevent                  According to CHAOS report that published by the Standish
error publishing. We also specify the relationship between                 Group [3], good RE practices contribute more than 42%
software complexity and important features of software, namely             towards the overall success of a project, much more than other
quality, productivity and maintainability and present a model              factors (see Table 1).
too.
                                                                                            TABLE I.        PROJECT SUCCESS FACTOR
    Keywords- Requirement Engineering, Software Complexity,
Software Quality                                                                                                               Factors
                                                                                                                   % of
                                                                                                                              Strongly
                                                                                     Project Success Factors      Respon
                                                                                                                              Related to
                       I.    INTRODUCTION                                                                           ses
                                                                                                                                 RE
   In decades, software complexity has created a new era in                         User involvement               15.9%          
computer science.                                                                   Executive Management
                                                                                                                   13.9%          
                                                                                    Support
   Software complexity could be defined as the main driver of
                                                                                    Clear Statement of
cost, reliability and performance of software systems.                                                             13.0%          
                                                                                    Requirement
Nonetheless, there is no common agreement on software
                                                                                    Proper Planning                9.6%
complexity definition, but most of them is based on Zuse's
view of software complexity [1]," software complexity is the                        Realistic Expectation          8.2%
degree of difficulty in analyzing, maintaining, testing,
                                                                                    Smaller Project Milestones     7.7%
designing and modifying software". In other words, software
complexity is an issue that is in the entire software                               Competent Staff                7.2%
development process and every stage of product life cycle.
                                                                                    Ownership                      5.3%
    In the development phases of software, complexity strongly                      Clear Vision and
influences the required effort to analyze and describe                                                             2.9%
                                                                                    Objectives
requirements, design, code, test and debugging the system. In                       Hard-working, focused
                                                                                                                   2.4%
maintenance phases, complexity specifies the difficulty in error                    Staff
correction and the required effort to change different software                     Other                          13.9%
module.
    Requirements form the foundation of the software                          In following chapters of this article, both complexity and
development process. Loose foundation brings down the whole                Requirement Engineering field will introduce and provide our
structure and weak requirements documentation (the result of               proposed model to control the complexity by identifying
Requirement Engineering process) result in project failure.




                                                                      88                                http://sites.google.com/site/ijcsis/
                                                                                                        ISSN 1947-5500
                                                             (IJCSIS) International Journal of Computer Science and Information Security,
                                                                                                                       Vol. 9, No. 6, 2011
influential factors on complexity in early stages of the life               Many researchers believe that software complexity is made
cycle.                                                                   up of the following complexity [8]:
                                                                                Problem complexity, which measures the complexity
               II.   REQUIREMENT ENGINEERING                                     of the underlying problem. This type of complexity can
    Before discussing RE activities, it is worth having a                        be traced back to the requirement phase, when the
definition of Requirement Engineering. Zave [4] provides one                     problem is defined.
of the clearest definitions: "Requirement engineering is the
                                                                                Algorithmic complexity, which reflects the complexity
branch of software engineering concerned with the real world
                                                                                 of the algorithm implemented to solve the problem.
goals for, functions of, and constraints on software systems. It
is also concerned with the relationship of these factors to                     Structural complexity reflects the complexity of the
precise specifications of software behavior, and to their                        algorithm implemented to solve the problem.
evolution overtime and across software families."
                                                                                Cognitive complexity measures the effort required to
    Brooks F.P [5] said that, "The hardest single part of                        understand the software.
building a software system is deciding what to build. No other
part of the conceptual work is as difficult as establishing the              Since most of the activities have been in identifying and
detailed technical requirements, including all the interfaces to         measuring the algorithmic, structural and cognitive complexity.
people, to machines, and to other software systems".                     Algorithmic complexity measured implemented algorithm to
                                                                         solve the problem and is based on mathematical methods. This
   The Requirement Engineering consists of five sub process:             complexity is measurable as soon as an algorithm of a solution
Requirement Elicitation, Requirement Analysis, Requirement               is created, usually during the design phase.
Specifications, Requirement Validation, and Requirement
Management.                                                                  Structural complexity is composed of data flow, control
                                                                         flow and data structure. Some metrics are proposed to measure
    Capturing of user requirements and analyzing them forms              this type of complexity, for example McCabe cyclomatic
the first two phases of the Requirement Engineering process.             complexity[9] (that directly measures the number of linear
After elicitation, these requirements are categorized and                independent paths within a module and considered as a correct
prioritized in the requirements analysis phase. Grouping                 and reliable metric), Henry and Kafura metric[10] (measures
requirements into logical entities help in planning, reporting,          the information flow to/from the module are measured, high
and tracking them. Prioritization specifies the relative                 value of information flow represent the lack of cohesion in the
importance and risk of each requirement that help in managing            design that will cause higher complexity) and Halstead
the project effectively. At the requirements specification stage,        metric[11] (which is based on the principle of count of
the collected information during requirements elicitation is             operators and operand and their respective occurrences in the
structured into a set of functional and non-functional                   code among the primary metrics, and is the strongest indicator
requirements for the system and SRS is provided as the output            in determining the code complexity).
of the Requirement Engineering process. Note that a good SRS
must have special circumstances that are expressed in the                   There are some metrics based on cognitive methods such as
standard IEEE [6], for example, no ambiguity, complete,                  KLCID [12] complexity metric (It defines identifiers as the
verifiable, adaptation, variability Traceability, etc. However,          programmer defined variables and based on identifier density.
the customers cannot always specify accurate and complete                To calculate it, the number of unique program lines is
requirements at the start of the process. Removing obsolete              considered).
requirements, adding new ones, and changing them are part of                 Identifying and controlling complexity in code or design
a never ending process during the software development life              stages of development is too late and leads error publishing in
cycle. Traceability aids in assessing the impact of changes and          the whole system.
is fundamental action for Requirements Management process.
On the other hand, Requirements Management ensures that                      So to prevent wasting valuable resources and complexity, it
changes are maintained throughout the software development               is better to focus on early stages of the software life cycle.
life cycle (SDLC).                                                       Therefore, the result of identifying complexity factors is low
                                                                         costs and high quality in software development and especially
 III.   CURRENT WORK IN THE SOFTWARE COMPLEXITY AREA                     in maintenance stages of software. By knowing these factors,
                                                                         project team try to prevent occurring them or establish suitable
    Software complexity is a broad topic in software                     strategies in design and implementation phase.
engineering and has attracted many researchers since 1976 [7].
Complexity control and management have important roles in
                                                                                IV.   MODEL OF SOFTWARE COMPLEXITY FACTORS
risk management, cost control, reliability prediction, and
quality improvement. The complexity can be classified in two                In the proposed model, we have provided software
parts: problem complexity (or inherent complexity) and                   complexity factors according to their importance in the first
solution complexity (also referred to as added complexity).              phase of SDLC (see Figure 1).
Solution complexity is added during the development stages                  Based on this model, there are two main complexity factors
following the requirements phase, mostly during the designing            in requirements phase: Human resource and requirements
and coding phase.                                                        document (the output of Requirement Engineering process).




                                                                    89                             http://sites.google.com/site/ijcsis/
                                                                                                   ISSN 1947-5500
                                                              (IJCSIS) International Journal of Computer Science and Information Security,
                                                                                                                        Vol. 9, No. 6, 2011
                                                                            Functional Requirements: Functional requirements should
                                                                             define the fundamental actions that must take place in the
                                                                             software. Most researchers claim the size of the product is
                                                                             one of the main factors in determining its complexity. In
                                                                             the other words, the more functional requirements result
                                                                             in a larger and more complex system and would require
                                                                             more effort and resources to solve it (especially in
                                                                             maintenance phase ).
                                                                                  o Stability Degree: Some of the systems located in
                                                                                       the dynamic and competitive environment or
                                                                                       interact with evolving systems. The functional
                                                                                       requirements of these systems expose in frequent
                                                                                       changes. Systems which undergo frequent
                                                                                       modification have higher error rates, because
                                                                                       each modification represents an opportunity for
                                                                                       new errors to be generated. It may also be the
                                                                                       case that when systems are undergoing frequent
                                                                                       changes, there is less opportunity and less
                                                                                       interest in testing those changes thoroughly. All
                                                                                       of these lead to the complexity.
                                                                                  o Sub function: It may be appropriate to partition
                                                                                       the functional requirements into sub functions or
                                                                                       sub processes. This does not imply that the
         Figure 1. The Model of Software Complexity Factors                            software design will also be partitioned that way.
                                                                                       More number of sub functions in a functional
    Human resource is considered in stakeholders and project                           requirement means the high rate of complexity in
team levels. Stakeholders are the most important complexity                            that requirement.
factors, because the requirements extraction process results and            Non-Functional Requirement: It refers to the system
their requested and desirable items form the system base.                    qualitative requirements and not fulfilling those leads to
    Stakeholders are people with different backgrounds,                      customer's dissatisfaction. More number of non-functional
organizational and personal goals and social situations, and                 requirements and more force to do them lead to more
each of them has its own method for understanding and                        complexity in the product. A way to rank requirements is
expressing the knowledge and communicates in various ways                    to distinguish classes of requirements as essential,
with other people. So complexity is widely depending on the                  desirable, and optional.
stakeholders, and placed in the first level of the model.                   Design constraints: This should specify design constraints
    The Input data to perform the next phases of the software                that can be imposed by other standards, hardware
life cycle are documents, which are derived of the requirements              limitations, etc. Some constraints are, Implementation
analysis phase. All items and stated requirements, causes the                language, database integrity policies, operating
complexity and so documents have considered as the second                    environment, size of required resources; all of these limit
level of the model. It is necessary to say that this level of the            the developers and add complexity in the system.
complexity results in inherent complexity of the system.                    System Interfaces: There are hardware interfaces,
    Finally, project team (as a subset of human resources) is                software interface, user interface, communication
considered as another complexity factor, because of differences              interface, etc. These specify the logical characteristics
in cognitive, experimental, subjective skills, and placed in the             between the software product and hardware components,
third level of model. In the following, the model discussed                  other software products, users and different
more in details.                                                             communication protocol. The more numbers of the
                                                                             interfaces represent more complexity in the system.
A. Inherent Complexity Factors                                              Input, Output, Files: Functional requirements process the
    The output of the Requirement Engineering process is                     inputs and process and generating the outputs, also files
Software Requirement Specification (SRS). SRS is included                    are stored data in the system. So the number of files, input
the principles of software acceptance, and monitors software                 and output parameters and the relationship between them
product, not the product development process. SRS composed                   is very important. Many numbers of these parameters
of several items, such as functional requirements, non-                      represent high transactions and so complexity in the
functional requirements, design constraints, interfaces, users,              system.
inputs and outputs, etc. All these items are the basis for the              Users: These requirements are the number of supported
complexity identification.                                                   terminals and the number of concurrent users. High




                                                                    90                            http://sites.google.com/site/ijcsis/
                                                                                                  ISSN 1947-5500
                                                            (IJCSIS) International Journal of Computer Science and Information Security,
                                                                                                                      Vol. 9, No. 6, 2011
    number of any of these items represents the high                       Project team level: Skill and experience of the project
    complexity of the system.                                               members has also been identified as a possible factor
                                                                            affecting the complexity of user requirements. The skill of
B. Added Complexity Factors
                                                                            members could be measured in a number of very complex
Added complexity is added during different phases of the                    ways, But for experience of the project members, one
software life cycle because of various factors such as                      factor has been highlighted as being important: has the
inappropriate use of standards, methodologies, methods and                  project member worked on a similar project before, if so
tools, lack of coordination of the project team, lack of enough             how many similar projects has he/she been part of and has
skills and experience, etc. Human resource is a part of the                 project member had experience in the same team before?
project resources and lack of it is one the complexity causes.              Also sub-contract may be considered as complexity
Human resource is investigated in stakeholders and project                  factors. If the Requirement Engineering team is not
team levels.                                                                present in the next phases of software development, new
 Stakeholder level: Stakeholders are individuals or                        team is not familiar with initial SRS. Hence changing or
     organizations that were affected by the project and                    improving the software may ignore some aspect and may
     directly or indirectly affect system requirements.                     lead to errors and complexity.
     Requirement extraction is the process of identifying
     stakeholder needs, and the most common challenges                        V.      MODEL OF SOFTWARE COMPLEXITY RESULTS
     during requirements elicitation process are to ensure                  In this section, we are going to provide a model of software
     effective communication between various stakeholders               complexity results and check its impact on the main features of
     and     elicit    implicit   knowledge.     So    effective        the system namely quality and the productivity.
     communication is an important factor in project success
     and developing a good SRS. In the following, we are
     going to describe the associated challenges with
     stakeholders.
          o Heterogeneity of the Organization: When doing a
               project for an organization, there is strong
               possibility that all stakeholders are not in a
               geographical location. This means that
               requirements extraction is done from various
               stakeholders and in many different places. This
               problem is occurred due to the heterogeneity of
               the organization.
               Research has been done into a number of
               ‘capability barriers’ which prevent effective
               communication in geographically dispersed
               groups [13]. The three identified problems
               included not sharing a common first language,
               being separated by sixteen time zones and the
               difference in typing ability when communicating
               via a messaging program.
          o Number of stakeholders: When conducting a
               project for a Virtual Organization, requirements                    Figure 2. The Model of Software Complexity Results
               extraction from numerous stakeholders leads to
               waste much of resources (time and cost); further            Cost, quality and maintenance issues can be seen in the
               integrating the extracted requirement is time-           most related topics to software development process.
               consuming and so hard.                                   Complexity is determining factors that may affect them (see
          o Stakeholder's skills: System users are a group of           Figure 2).
               major stakeholders. Individually, they can                   Complexity has affected on two important aspects of the
               enhance sustainability at the company they work          software: error proneness and maintenance. The main idea
               for by bringing their personal skills and                behind the relationship between complexity and error-
               experiences to aid change and innovation.                proneness is that when comparing two different solutions the
               Against, an inexperienced user by providing              more complex solution is also generating the more number of
               irrelevant,    contradictory    and     confused         errors. This relationship is one of the most analyzed by
               requirements and frequent changes in                     software metrics’ researchers and previous studies and
               requirements may cause the complexity and thus           experiments have found this relationship to be statistically
               imposes heavy costs on the software.                     significant [14].




                                                                   91                                http://sites.google.com/site/ijcsis/
                                                                                                     ISSN 1947-5500
                                                                       (IJCSIS) International Journal of Computer Science and Information Security,
                                                                                                                                 Vol. 9, No. 6, 2011
    High levels of software complexity make software more                       errors. These errors show cost of current disturbances and cost
difficult to understand, and it increases the probability of hiding             of future activities to fix them.
    Since a great deal of software development costs are                        repeated testing waste resources and loss make low
directed to the software integration testing, it is crucial for the             productivity.
project performance to possess the instruments for predicting
and identifying the type of errors that may occur in a specific                    Therefore, less product complexity has been high
module.                                                                         maintainability. According to many quality models,
                                                                                maintainability is determining factor of the system.
     Error-proneness can influence quality. This impact is
traceable through "usability" and "reliability" of the software.                    Finally, we should consider that the quality and
The concept of usability is connected to what the customer                      productivity have a close relationship with each other. If we
expects from the product. If the customer feels that he can use                 focus too hard on productivity, we may improve our efficiency
it in a way that he intend to, he will more likely be satisfied and             and lower our project costs, but this gain is worthless if we are
regard it as a product with high quality. Thus, a large number                  not building quality systems that meet the demands from our
of errors in software are presumably something that would                       customers. Similarly, if we are aiming to create the perfect
lower the usability of the program.                                             system, we may lose control of the costs. Moreover, the time it
                                                                                takes to improve and correct the system may cause a late
    The reliability of a system is often measured by trying to                  delivery of the product.
determine the mean time elapsed between occurrences of faults
(the result of the software errors) in a system. More reliable                                 VI.    CONCLUSION AND FUTURE WORK
product is more stable and has fewer unexpected interruptions
than a less reliable product.                                                       Software quality and productivity depend on several factors
                                                                                such as on time delivery, within budget and fulfilling user's
    A defective product has a large amount of errors and it                     needs. Software complexity is one of the most important
should be undergone of frequent changes to fix them. Frequent                   indicators that affect the software quality and productivity. To
changes are not desirable to users and have negative effect on                  achieve higher quality and better productivity, software
product quality. On the other hand, such product needs more                     complexity should be controlled from the initial phases of the
resources to fix errors and thus have indirectly impact on the                  SDLC.
productivity.
                                                                                    In this article, with emphasis on Requirements Engineering
    The relationship between complexity and maintainability is                  process, we have analyzed the influential factors in software
clear. According to Corbi's viewpoint [15], more maintenance                    complexity, particularly in the first phase of software
costs are spent for understanding the system rather than to                     development, and provide a model. We also propose a model
modify and improve it. Therefore, higher levels of system                       of complexity result. These models could be use as a roadmap
complexity make it difficult to understand, so maintenance                      to assist the manager in identifying complexity factors and
would be time-consuming and costly.                                             avoiding them. In addition, it would be appropriate to know the
    On the other hand, as shown in Figure 3, the cost required                  impact of complexity on important characteristics of the
to fix an error later in the life cycle increases exponentially: it             system.
costs 5-10 times more to repair errors during coding phase and                      In future work, we are going to complete the model of
                                                                                complexity factors and provide a requirement based metric.
                                                                                This metric is extracted from all the factors that mentioned in
                                                                                this article. By using both, we can measure and control the
                                                                                software complexity much before the actual implementation
                                                                                and design thus saving on cost and time especially in
                                                                                maintenance phase.

between 100-200 times more during maintenance phase than                                                       REFERENCES
during the requirements phase [16].
                                                                                [1]   Zuse, H., "software Complexity-measures and methods. Berlin: Walter
                                                                                      de Gruyter", Berlin: Walter de Gruyter & Co, 1991
       Figure 3. Relative Cost of Fixing Errors in Project Lifecycle
                                                                                [2]   Eberlein A., Requirements Acquisition and Specification for
                                                                                      Telecommunication Services, PhD Thesis, University of Wales,
   The idea behind the relation between error-proneness and                           Swansea, UK, 1997
maintenance is that, maintainer spends a lot of financial and
                                                                                [3]   The      Chaos    Report         the    Standish     Group    Internatio
human resources to identify and correct errors and this means                         al,http://www.standishgroup.com/sample_research/index.php, 1995
lower productivity.                                                             [4]   Zave P. and Jackson M. Four Dark Corners of Requirements
    Mainly, complex system has frequent maintenance.                                  Engineering, ACM Transactions on Software Engineering and
                                                                                      Methodology, pp. 1-30, 1997
Software maintenance challenges are system understanding,
                                                                                [5]   Brooks, F.P. , Essence and Accidents of Software Engineering, IEEE
considering the side effects of changes and testing the                               Computer, Vol. , pp. 10-19, April 1987
performed changes. Frequent changes may make less interest in                   [6]   IEEE Std. 830-1984, IEEE Guide to Requirements Specification, 1984
testing and surly loss product quality. On the other hand,
                                                                                [7]   W. P. Stevens, G. J. Myers, and L. L. Constantine, "Structural Design",
                                                                                      IBM Systems Journal, vol. 13, no. 2, Jun. 1976, pp. 113-129




                                                                           92                                   http://sites.google.com/site/ijcsis/
                                                                                                                ISSN 1947-5500
                                                                          (IJCSIS) International Journal of Computer Science and Information Security,
                                                                                                                                    Vol. 9, No. 6, 2011
[8]   Fenton N., Pfleeger S, Software Metrics- A Rigorous and Practical             [9]   Thomas J.McCabe, "A Complexity Measure", IEEE Transactions On
      Approach, London: International Thomson Computer Press, 1996                        Software Engineering, pp: 308-320, 1976



[10] Henry, S., Kafura, K.: Software structure metrics based on
     information flow. IEEE, Transactions on Software Engineering, pp:
     510–518, 1982
[11] Halstead M., "Element of Software Science", Amsterdam: Elsevier,
     1977
[12] Kushwaha, D.S. and Misra, A.K., Improved Cognitive Information
     Complexity Measure: A metric that establishes program
     comprehension effort, ACM SIGSOFT Software Engineering,
     Volume 31 Number 5, September 2006
[13] Toomey, L., Smoliar, S., Adams, L. Trans-Pacific Meetings in a
     Virtual Space, FX Palo Alto Labs Technical Reports, 1998
[14] Curtis, B., Sheppard, B. Milliman,P. Third time charm: stronger
     pediction of programmer peformnace by software complexity metric.
     In proceeding of the 4th International Conference on Software
     Engineering, pp: 356-360, 1979
[15] Corbi, T. A., “Program Understanding: Challenge for the 1990s”,
     IBM System Journal, pp: 294-306, 1989
[16] Wieringa R.J., Requirements Engineering - Frameworks for
     Understanding, John Wiley and Sons,1995

                          AUTHORS PROFILE
Ghazal Keshavarz received her BA. Degree in Comp. Sc. & Engg from
Shiraz Technical University, Shiraz, Iran in the year 2006. Currently she is
pursuing M.Sc. in Comp. Sc. & Engg from Islamic Azad University
(Science and Research branch), Tehran, Iran under the guidance of Dr
Modiri. She is presently working on Requirement Based Complexity metric
and Software Quality and Complexity Model.

Dr. Nasser Modiri received his M.Sc and PhD in Electronics engineering
from the University of Southampton (UK) and the University of Sussex
(UK). Assistant Professor of Department of Computer Engineering in
Islamic Azad University (Zanjan/Iran).


Dr. Mirmohsen Pedram Assistant Professor of Department of Computer
Engineering in Tarbiat Moallem University (Karaj/Iran).




                                                                               93                               http://sites.google.com/site/ijcsis/
                                                                                                                ISSN 1947-5500
                                                             (IJCSIS) International Journal of Computer Science and Information Security,
                                                             Vol. 9, No. 6, June 2011




A Hierarchical Overlay Design for Peer to Peer and
                 SIP Integration
              Md. Safiqul Islam #1 , Syed Ashiqur Rahman #2 , Rezwan Ahmed ∗3 , Mahmudul Hasan #4
                      #
                          Computer Science and Engineering Department, Daffodil International University
                                                     Dhaka, Bangladesh
                                                1
                                                     safiqul@daffodilvarsity.edu.bd
                                                2
                                                    ashiq797@daffodilvarsity.edu.bd
                                                    4
                                                      mhraju@daffodilvarsity.edu.bd
                                          ∗
                                              American International University - Bangladesh
                                                           Dhaka, Bangldesh
                                                         3
                                                             a.rezwan@aiub.edu


   Abstract—Peer-to-Peer Session Initiation Protocol (P2PSIP) is             P2P can be integrated with a traditional SIP system. We also
the upcoming migration from the traditional client-server based              examine some existing approaches to identify in achieving
SIP system. Traditional centralized server based SIP system                  P2PSIP. The remainder of the paper is organized as follows.
is vulnerable to several problems like performance bottleneck,
single point of failure. So, integration of Peer-to-Peer system              Section II gives an overview of traditional SIP systems and
(P2P) with Session Initiation Protocol (SIP) will improve the per-           terminologies involved. P2P technology is demonstrated in
formance of a conventional SIP system because a P2P system is                section III. P2PSIP is introduced in section IV and section V.
highly scalable, robust, and fault tolerant due to its decentralized         In Section VI, details out the current approaches for P2PSIP.
manner and self-organization of the network. However, P2PSIP                 In Section VII, a proposal for the integration of P2PSIP is
architecture faces several challenges including trustworthiness
of peers, resource lookup delay, Network Address Translation                 introduced based on the existing approaches. Finally, Section
(NAT) traversal, etc. This paper focuses on understanding the                VIII takes an account of the conclusion.
needs of integration of P2P and SIP. It also reviews the existing
approaches to identify their advantages and shortcomings. Based                        II. S ESSION I NITIATION P ROTCOL (SIP)
on the existing approaches, it proposes a layered architecture to               SIP is an application-layer control protocol for initiating,
address the major challenges introduced by P2PSIP.
                                                                             terminating and modifying multimedia sessions (for example
                                                                             video, voice, instant messaging, online games and multimedia
                       I. I NTRODUCTION
                                                                             conferences). To establish a session, a traditional PSTN
   The session initiation protocol (SIP) is a signaling protocol             requires SS7 [3] signaling. In IP based telephony, the
[1] standardized by IETF. It is also the default standard                    signaling protocol is SIP. However, Session Description
protocol for VoIP. The majority of the VoIP development                      Protocol (SDP) [4] and Real-Time Transport Protocol (RTP)
is currently based on SIP. SIP is used to establish, modify,                 [5] should be used together with SIP to provide complete
or tear down a multimedia session. Most VoIP systems                         IP telephony system. Traditional SIP architecture uses
rely on fixed set of SIP servers for which they suffer from                   client-server architecture. SIP servers are classified into
performance bottlenecks, single point of failure, and Denial                 proxy, registrar, and redirect servers [1]. A proxy server is an
of Service (DoS) attacks. On the other hand, a Peer-to-Peer                  intermediate entity that can act as a server to accept a SIP
(P2P) system [2] is a popular technology which does not rely                 request or act as a client to forward a SIP request. Redirect
on central control and is very popular for resource sharing.                 servers performs redirection of SIP request. A registrar server
As in a P2P system there is no centralized server, such a                    accepts Register request from clients and maintains location
system has greater robustness, scalability and fault tolerance.              information in order to support mobility. SIP user agents are
If SIP can be made to work over P2P systems, it will improve                 classified into user agent client (UAC) and user agent server
the performance of traditional SIP systems and eliminate                     (UAS). A user agent client [1] is a SIP entity that creates
the problems of using centralized SIP servers. Integration                   a new request. It uses client state machinery to send that
of P2P technologies and SIP introduces several challenges                    request. A user agent server is also a SIP entity that receives
such as resource lookup delays, node heterogeneity, NAT                      SIP requests on behalf of users and responds to these requests.
traversal, peers trustworthiness which has to be addressed
before enjoying its advantages.                                                Each user agent is identified by SIP uniform resource
                                                                             identifier (URI) for instance sip:username@somedomain.com.
   This paper describes the need to integrate P2P and SIP                    In order to initiate a session with another user, the caller first
technologies and examines cost benefit of P2PSIP over the                     needs to know the SIP URI of that user. A caller can either
traditional fixed set of SIP servers. Further it discusses how                send an INVITE request to a locally configured SIP server




                                                                        94                              http://sites.google.com/site/ijcsis/
                                                                                                        ISSN 1947-5500
                                                     (IJCSIS) International Journal of Computer Science and Information Security,
                                                     Vol. 9, No. 6, June 2011


or directly send an INVITE to the IP address and port of                the prefix based identifier where each node shares a prefix
the user’s address. A user agent registers its location with the        with the data key .There is also a unique identifier for each
registrar server before initiating a session.                           key. The algorithm choose the node with the closest numeric
                                                                        value of a key to route messages. During routing, the sender
                 III. P EER - TO -P EER (P2P)
                                                                        node sends the message to the node whose identifier’s prefix
   Peer-to-peer(P2P) technology, by definition, is a mesh net-           is minimum one digit larger than the sender node. If there
work as opposed to a star network in a client/server model. In          is no existence of such a node it forwards the messages
a peer-to-peer network, all nodes act simultaneously as client          to the another node with the same prefix whose identifier
and server. Some of the advantages of P2P systems are:                  is numerically closer to the data key.The expected number
   • Scalability                                                        of routing step is O(logN). The Pastry method provides
   • Robustness                                                         more flexible mechanism than the chord method because the
   • No single point of failure                                         successor with the pastry identifiers is not so strictly defined
   There are two types of P2P systems [6], structured and               and its adjusting nodes in its routing table.
unstructured. The structured networks impose techniques to
tightly control the data placement and topology within the
                                                                           The Bamboo DHT algorithm conceptualize the namespace
network, and currently only support search by identifier.
                                                                        as a circle like Chord DHT which means the peer always
On the other hand unstructured networks rely on flooding
                                                                        located next to the peer with the largest possible Peer-ID
techniques. In many scenarios, the increased search efficiency
                                                                        [9]. Unlike Chord but like Pastry, bamboo uses prefix routing
makes structured networks preferable to the widely deployed
                                                                        to accumulate on the peer responsible for the search key. It
unstructured networks. Most widely used structured P2P
                                                                        uses the Pastry geometry where the term geometry is used to
system is Distributed Hash Table (DHT). There are several
                                                                        refer the neighbor management algorithm or independent of
flavors of DHT each with some advantages over another. It
                                                                        the routing algorithms used as well as patterns of neighbor
is very important to choose a proper DHT algorithm to have
                                                                        links in DHT. The Bamboo algorithms are more incremental
good performance of SIP running on P2P.
                                                                        than Pastry. In bandwidth-limited environments, the bamboo
                                                                        algorithm allows to continuous churn in membership as well as
   DHT is a decentralized and distributed system where all
                                                                        it allows to better acceptance of changes in large membership
the peer nodes and resources are identified by unique keys.
                                                                        in DHT.
A DHT table is introduced to provide efficient location and
retrieving operations. The most popular DHT algorithms are
Chord [7], Bamboo [8], Pastry [10], and Tapestry [11].
Most of the P2PSIP architectures use the Chord algorithm
to maintain the P2P overlay. The logical structure of                                                         N4     Predecessor
Chord is a ring shaped where each node is identified by a
numeric identity. For each peer, the peer with the nearest
lower identifier is called the predecessor and nearest higher                                                         N23
identifier is known as successor. Figure 1 illustrates a Chord
ring where node 23 is responsible for objects O6 to O22.
Each node maintains a finger table that contains information
                                                                                                             N35     Successor
of half of the nodes clockwise from that node. Each node is
responsible for objects whose associated key is in between
the node’s own id and predecessor’s ID.                                                       Fig. 1.   Chord Ring


   Like Chord, Pastry is another 2nd generation large P2P
routing network [10]. Pastry forms a self-organized, robust
                                                                           A node will search for a target node by searching the
and overlay network in the internet. The major challenge is
                                                                        node in the finger table which is nearest to the target
again to form an efficient algorithm for routing. In Pastry
                                                                        node. Since each node knows about nearby nodes, so after
network, each peer or node has a unique numeric 128-bit
                                                                        a repeated number of searches, the target node will be
identifier which is assigned randomly when a peer joins in
                                                                        discovered. Choosing the proper DHT algorithm will improve
the network . Each peer formed the overlay network on the
                                                                        the performance of P2PSIP. Table 1 shows a comparison
top of the hash table and the peer contains the table of list
                                                                        of different DHT algorithms [12]; from this list a suitable
of leaf nodes, a routing table and a neighborhood list. Leaf
                                                                        algorithm for a P2PSIP implementation could be chosen.
nodes, a routing and a neighborhood list tables organized
based on the existing nodes of the network and here we can
see the self-organization is very similar to Chord algorithm               Below is a table comparing the various features of different
except that Pastry also update its routing table. Leaf nodes            flavors of DHT algorithms: Comparison among different DHT
set contains L/2 closest nodes where as routing table contains          algorithms, adapted from [12] is shown in Table I.




                                                                   95                            http://sites.google.com/site/ijcsis/
                                                                                                 ISSN 1947-5500
                                                      (IJCSIS) International Journal of Computer Science and Information Security,
                                                      Vol. 9, No. 6, June 2011


                                                             TABLE I
                                   C OMPARISON OF DIFFERENT DHT ALGORITHM , ADAPTED        FROM   [12]

                               Chord         CAN             Pastry          Bamboo         Tapestry        Kademlia
                 Lookup        Recursive     Recursive      Recursive        Recursive      Recursive       Iterative
                 Methods       Semi-         Semi-          Semi-            Semi-          Semi-
                               Recursive     Recursive      Recursive        Recursive      Recursive
                               Iterative     Iterative      Iterative        Iterative      Iterative
                 Parallel      Not           No             Not              Yes (on It-    No              Yes
                 Lookups       Suitable                     suitable         erative)
                 Complexity    Simple        Simple         Quite com-       Quite com-     Not             Simple
                                                            plex             plex           complex
                 Bandwidth     Moderate      Moderate       High             Moderate       Quite high      Moderate
                 Consump-
                 tion
                 Node          Quite sim-    Very           Complex          Quite sim-     Complex         Simple
                 Join and      ple           simple         join             ple            join
                 Departure


   IV. P EER - TO -P EER S ESSION I NITIATION P ROTOCOL                A. Resource Lookup Delay
                          (P2PSIP)                                       Locating a peer or resources in P2PSIP networks takes much
                                                                       more time than the traditional SIP based network. Trying to
   P2PSIP is the combination of a P2P network and SIP, where           reduce this delay is significant challenge for P2PSIP.
traditional fixed set of servers are replaced by a distributed
mechanism. DHT can be used which is one of the possible                B. Network Address Table (NAT) Traversal
distributed mechanisms available. All the address of record              Most of the P2P nodes may be behind a NAT or Firewall.
to contact URI mappings is distributed among the peers                 There must be some relay in between them with a public
in the P2P overlay. Currently, the P2PSIP working group                IP address in order to establish communication with other
of the Internet Engineering Task Force (IETF) has defined               peer. This is one of the most important challenges for P2PSIP
the terminologies, concepts in [13] and use cases in [14].             network.
Moreover, this working group is trying to standardize the
P2PSIP peer protocol. P2PSIP can be implemented into two               C. Node Heterogeneity
ways: one is SIP on top of P2P and the other implements                   In order to maintain the scalability and service availability of
P2P over SIP. SIP on top of P2P uses P2P protocol to                   P2PSIP, node heterogeneity should be handled appropriately.
implement SIP location service; while the other approach               Node heterogeneity can be difference in bandwidth, CPU,
uses SIP messages to transport P2P traffic. Traditional a P2P           storage, and uptime of the peer. Security Issues and Trustwor-
node searching mechanism uses flooding mechanism to locate              thiness of peers Security of a distributed P2P communication
the node. However, to find a target node using P2PSIP for               system is another of the major challenges. Security issues
multimedia session, flooding mechanism should be avoided.               concern user identification, authentication and trustworthiness.

                                                                       D. Security Issues and Trustworthiness of Peers
        V. P INPOINTING C HALLENGES FOR P2PSIP
                                                                         Security of a distributed P2P communication system is
   In the following section, we will discuss some common               another of the major challenges. Security issues concern user
requirements to implement P2PSIP. In an IETF draft, Bryan,             identification, authentication and trustworthiness.
et al. defines a set of requirements for P2PSIP [15]. We have
taken some important point from this paper.                                             VI. E XISTING A PPROACHES
First, P2PSIP peers should be capable of performing opera-                Integration of P2P and SIP will improve the performance
tions such as joining, leaving, storing information on behalf          of traditional SIP system as P2P has several advantages.
of the overlay, or transporting messages. Secondly, the peers          There are several requirements that should be met to integrate
must provide the functions offered by traditional SIP network.         P2P and SIP. Those requirements have already been described
For example, P2PSIP should support the modification, estab-             in Section V.
lishment, and termination of multimedia sessions. Thirdly, the
implementation should not prevent the use of existing proto-             Singh and Schulzrinne propose a hybrid architecture for
cols like SSL or TLS as used in the P2P or SIP network. NAT            the integration of P2P and SIP that introduces two additional
and firewall traversal should also be supported for P2PSIP.             advantages: interoperability with existing SIP servers and
Finally, the functionality of the fixed set of centralized SIP          no maintenance cost besides P2P scalability and reliability
servers should be distributed over the peers. Some of the other        [16]. Chord was used as an underlying DHT algorithm.
challenges are described in the following sections.                    Their architecture is based on the concept of Super nodes




                                                                  96                               http://sites.google.com/site/ijcsis/
                                                                                                   ISSN 1947-5500
                                                     (IJCSIS) International Journal of Computer Science and Information Security,
                                                     Vol. 9, No. 6, June 2011


and ordinary node where super nodes are powerful nodes                  briefly.
having high bandwidth, lots of CPU power, and plenty of
memory, long uptimes and public IP address compared to                     Dhara, et al. [22] propose a layered architecture for P2PSIP
the ordinary nodes. These authors have also implemented                 that separates the P2P related issues from the underlying
a P2PSIP adaptor [17] which allows existing or new user                 voice or transport layer. The advantage of this system is
agents to connect to the P2PSIP network. NAT/Firewall                   that it allows dynamic changes of overlays based on the
detection is based on the ICE [18] algorithm, then it uses              requirements and properties of users and devices. Their paper
a super node as a relay to help ordinary nodes to establish             focuses on a layered architecture that allows the choice of
calls and participate in calls. Node heterogeneity is identified         P2P overlay based on specific parameters. The architecture is
by the super nodes and ordinary nodes. Offline messaging                 a Public Key Infrastructure PKI [23] based trust management
services are provided by combining the storage of sender and            system where SIP is the transport protocol and chord is DHT
intermediate DHT nodes [19].                                            algorithm for a P2P structure. They have mentioned that
                                                                        the device overlay will focus on NAT and firewall traversal,
  Their architecture fails to reduce call setup delay; which is         but no such implementation or methods were introduced.
higher than the traditional SIP networks.Peer trustworthiness           This paper failed to address DHT maintenance cost on
and security issues are not described in their report. The              an ordinary node and bootstrap node selection- Besides
paper does not propose any super node selection mechanism.              the above approach, Scalable Application-Layer Mobility
Node heterogeneity is introduced, but they do not describe              Protocol (SAMP) [24] deals with the session setup latency
what will happen if a more powerful node than the current               by introducing two optimization techniques: Hierarchical
super node joins the existing network. Ordinary nodes always            Registration (HR) and Two-Tier Caching (TTC). In HR, when
have to pay a high maintenance cost to maintain their Chord             the mobile node (MN) is in a foreign domain it will register
finger table and periodically send refresh messages to update            its Care of Address with an anchor SIP server instead of
their predecessors and successors.                                      home SIP server, this is to reduce session setup delay. On the
                                                                        other hand, TTC introduces two phase cache lookup where
   SOSIMPLE [20] is a P2PSIP architecture where nodes are               the first cache lookup is based on the MN’s cache and the
organized using the Chord DHT algorithm. SIP messages                   second phase lookup is based on anchor SIP server’s cache.
with a newly defined header are used to maintain the DHT,                If the target is not found in either cache then a traditional
register users, locate resources and establish sessions. Based          P2P lookup occurs. Hierarchical P2PSIP [25] is introduced
on the registration process, the SOSIMPLE architecture has              to address the connectivity problem of heterogeneous P2P
two levels of REGISTER operations. One is user registration             overlays and the overhead problem of extra SIP messages
which is the traditional use of registration and another is node        overhead when SIP is used to maintain the overlays.
registration which is for DHT operation. The SOSIMPLE
paper mentions several security and user authentication                    Another approach to P2PSIP is described in [26]. This
mechanisms such as user certificate, email verification, etc.             approach deals with the manageability of a P2P system. In
that can be implemented on their architecture. However, this            this approach, the system architecture is divided into three
paper fails to describe Node heterogeneity, Bootstrap node              layers: the signal control layer, the management layer, and
selection, Node maintenance cost for DHT operation, and                 the media transportation layer. In this architecture signaling
Resources look up delay.                                                is used to initiate the system, maintain the system topology,
                                                                        and for resource searching. The management layer deals with
   In order to alleviate the problems in [16] and [20], such as         grouping, playing, and media uploading management. Inter-
call setup delay, node heterogeneity, and high maintenance              domain registers, resource locating and DHT creation and
cost paid by the ordinary nodes, Le and Kuo [21] propose                maintenance are the functions of the SIP signal control layer.
a hierarchical and breathing overlay based network. In a                Media transport layer deals with the storage management and
hierarchical overlay nodes forms different sub-overlays based           media uploading. In Cooperative SIP (CoSIP) [27], both the
on node heterogeneity. Session setup delay is reduced by                server based and P2PSIP networking work together.
introducing two types of lookups based on knowledge of
the destination sub overlay: oriented lookup and un-oriented                      VII. L AYERED A RCHITECTURE FOR P2PSIP
lookup. In oriented lookup, a node relays its request to its
father node to establish the session. In the case of un-oriented           In this section, we propose a new P2PSIP architecture in
lookup, a father node will receive the request from a son               order to address all the challenges. Our P2PSIP architecture
node and this (and if necessary other) upper level father               is a two layered architecture based on [25]. Where the top
node will do a DHT lookup in their sub over-lays. Node                  layer consists of powerful super nodes and the bottom layers
heterogeneity is introduced due to forming a hierarchical               consist of the ordinary nodes similar to HiLO Peer and LoLO
overlay. Lower DHT maintenance cost. The paper failed to                Peer [25]. Figure 2 shows our proposed architecure for P2PSIP.
address peer trustworthiness and other security issues. The             The top overlay uses Bamboo as an underlying algorithm. The
following subsections will describe their three techniques              reason for choosing Bamboo is its parallel lookup capability.




                                                                   97                            http://sites.google.com/site/ijcsis/
                                                                                                 ISSN 1947-5500
                                                                      (IJCSIS) International Journal of Computer Science and Information Security,
                                                                      Vol. 9, No. 6, June 2011


Several challenges remain to our proposed approach. However,                          F. Peer’s Trustworthiness and Security issues
our proposed architecture has the following properties:
                                                                                        We can implement a login server module to the super node.
                                                                                      When a peer wants to join a P2P network, it will send a SIP
                                  Intra Domain                                        REGISTER message along with its public key, and then the
                                     Overlay
                                     (Chord)                                          login server will authenticate the node and provide it with
                                                                                      a signed certificate. It will help the node to provide this
     Intra Domain




                                                                                      certificate and build trust.




                                                                 Intra Domain
        (Pastry)
        Overlay




                     Inter Domain Overlay (Using Bamboo as an




                                                                    Overlay
                                                                     (CAN)
                               underlying algorithm)
                                                                                                                VIII. C ONCLUSION

                                   Intra Domain
                                                                                         This paper gave an overview of SIP and P2P and exam-
                                      Overlay                                         ines the integration of a P2P system with SIP. Moreover,
                                     (Tapestry)
                                                         Super Node
                                                                                      it describes some advantages of their integration as well
                                                                                      as challenges and implications were also discussed. Several
                                                         Ordinary Node
                                                                                      current approaches to P2PSIP were discussed along with their
                    Fig. 2.   Layered Architecture of P2PSIP                          approaches to mitigating the challenges of P2PSIP. Finally, we
                                                                                      proposed a layered architecture to address the major challenges
                                                                                      of P2PSIP. Future work will focus on implementing our
A. Node Heterogeneity                                                                 proposed architecture on a simulator.
  A joining node will contract the regional super node and
share his capabilities with the super node. If it has higher                                                        R EFERENCES
capabilities it will become the super node. So, here based on                         [1] J. Rosenberg, H. Schulzrinne, G. Camarillo, A. Johnston, J. Peterson, R.
node heterogeneity we select the super node.                                              Sparks, M. Handley, and E. Schooler. SIP: Session Initiation Protocol,
                                                                                          RFC 3261, Internet Engineering Task Force, 2002.
                                                                                      [2] D.S. Milojicic, V. Kalogeraki, R. Lukose, K. Nagaraja, J. Pruyne, B.
B. High Maintenance Cost Problem                                                          Richard, S. Rollins, and Z. Xu. Peer-to-Peer Computing, HP Laboratories
   As the ordinary node has to pay a high maintenance cost                                Palo Alto, 2002.
                                                                                      [3] Performance Technologies, Inc. SS7 Tutorial, 2006. Available at:
for the DHT, we use the concept of a breathing layer with a                               http://www.pt.com/tutorials/ss7/, Last Visited April 2008.
minor modification. Ordinary nodes will have two states: sleep                         [4] M. Handley, V. Jacobson. SDP: Session Description Protocol, RFC
and active. After a long idle time an ordinary node will go to                            2327, Internet Engineering Task Force,April 1998. Available at:
                                                                                          http://www.ietf.org/rfc/rfc2327.txt
sleep mode by giving a sleep indication to its super node.                            [5] H. Schulzrinne, S. Casner, R. Frederick, V. Jacobson. RTP: A Transport
                                                                                          Protocol for Real-Time Applications, RFC 1889, Internet Engineering
C. Node Look Up                                                                           Task Force, January 1996. Available at: http://www.ietf.org/rfc/rfc1889.txt
                                                                                      [6] D. Bryan, B. Lowekamp. Decentralizing SIP, ACM Portal, Vol.5, Issue
   We can use two-tier caching scheme [24] to locate the target                           2, pages: 34-41, March 2007
                                                                                      [7] I. Stoica, R. Morris, D. Liben-Nowell, D.R. Karger, M.F. Kaashoek, F.
node and along with utilizing the parallel look up feature.                               Dabek, and H. Balakrishnan, Chord: A Scalable Peer-to-Peer Lookup Pro-
However, if the target node information is not cached in either                           tocol for Internet Applications, IEEE/ACM Transactions on Networking,
the cache of super node and ordinary node, then the super                                 page 17, 2003.
node will search using the parallel lookup feature of Bamboo                          [8] S. Rhea, D. Geels, T. Roscoe, and J. Kubiatowicz Kubiatowicz, Handling
                                                                                          Churn in a DHT In Proc. of USENIX Annual Technical Conference,
in order to locate the node. That will take less time to find the                          2004.
target node’s responsible super node.                                                 [9] 3. Sean Rhea, Byung-Gon Chun, John Kubiatowicz, and Scott Shenker,
                                                                                          Fixing the Embarrassing Slowness of OpenDHT on PlanetLab, Proceed-
                                                                                          ings of USENIX WORLDS 2005, December 2005
D. NAT Traversal Problem                                                              [10] A. Rowstron and P. Druschel. Pastry: Scalable, Distributed Object
                                                                                          Location and Routing for Large-Scale Peer-to-Peer Systems, In IFIP/ACM
  Super nodes are the most powerful nodes in terms of                                     International Conference on Distributed Systems Platforms, 2001.
available bandwidth, CPU processing power and they have a                             [11] B.Y. Zhao, L. Huang, J. Stribling, S.C. Rhea, A.D. Joseph, and J.D.
public IP address. In order to solve NAT traversal problem, a                             Kubiatowicz. Tapestry: A Resilient Global-scale Overlay for Service
                                                                                          Deployment, Selected Areas in Communications, IEEE Journal on, 2004.
super node can act as a relay for an ordinary node to establish                       [12] J. Hautakorpi, G. Camarillo. Evaluation of DHTs from the viewpoint of
session.                                                                                  interpersonal communications, ACM International Conference Proceed-
                                                                                          ing Series; Vol. 284, Proceedings of the 6th international conference on
                                                                                          Mobile and ubiquitous multimedia, pages 74-83, 2007
E. Connectivity Problem                                                               [13] D. Bryan, P. Matthews, E. Shim, and D. Willis. Concepts and Termi-
   Super nodes will use Bamboo to maintain the inter domain                               nology for Peer to Peer SIP draft-ietf-p2psip-concepts-01, Nov. 2007.
                                                                                          Expires: May 18, 2008.
overlay with other super nodes and regional algorithm like                            [14] D. Bryan, E. Shim, and B. Lowekamp. Use Cases for Peer-to-Peer
Chord, Pastry, etc. running in the inner domain overlay. Thus,                            Session Initiation Protocol (P2P SIP), draft-bryan-p2psip-usecases-00.txt
each Super node will maintain two DHT tables based on two                                 , July 2007. Expires: January 3, 2008
                                                                                      [15] D. Bryan, S. Baset, M. Matuszewski, and H. Sinnreich. P2PSIP Protocol
different algorithms. As a result the different domains running                           Framework and Requirements, draft-bryan-p2psip-requirements-00.txt ,
different DHT algorithm can communicate.                                                  June 2007. Expires: Jan 2008.




                                                                                 98                                   http://sites.google.com/site/ijcsis/
                                                                                                                      ISSN 1947-5500
                                                                 (IJCSIS) International Journal of Computer Science and Information Security,
                                                                 Vol. 9, No. 6, June 2011


[16] K. Singh and H. Schulzrinne. Peer-to-peer Internet Telephony Using SIP,
    In NOSSDAV ’05: Proceedings of the international workshop on Network
    and operating systems support for digital audio and video, pages 63-68.
    ACM Press, June 2005.
[17] K. Singh and H. Schulzrinne. SIPpeer: a session initiation protocol
    (SIP)-based peer-to-peer Internet telephony client adaptor, White paper,
    Computer Science Department, Columbia University, New York, NY, Jan
    2005. http://www.cs.columbia.edu/ kns10/publication/sip-p2pdesign. pdf.
[18] J. Rosenberg. Interactive Connectivity Establishment (ICE): A Method-
    ology for Network Address Translator (NAT) Traversal for Offer/Answer
    Protocols, draft-ietf-mmusic-ice-08. 2006.
[19] K. Singh and H. Schulzrinne. Peer-to-peer Internet telephony using
    SIP, Technical Report CUCS-044-04, Department of Computer Science,
    Columbia University, New York, NY, Oct. 2004.
[20] D. A. Bryan, B. B. Lowekamp, and C. Jennings. SOSIMPLE: A Server-
    less, Standards-based, P2P SIP Communication System In Proceedings
    of the 2005 International Workshop on Advanced Architectures and
    Algorithms for Internet Delivery and Applications (AAA-IDEA 2005),
    June 2005.
[21] L. Le, G. Kuo. Hierarchical and Breathing Peer-to-Peer SIP Sys-
    tem Communications, 2007, ICC ’07, IEEE International Conference,
    Pages:1887 - 1892, June 2007.
[22] K.K.Dhara, V. Krishnaswamy, S. Baset. Dynamic peer-to-peer overlays
    for voice systems, Pervasive Computing and Communications Workshops,
    2006. PerCom Workshops 2006. Fourth Annual IEEE International Con-
    ference, March 2006.
[23] Public      Key      Infrastructure(PKI)     tutorial.   available     at:
    http://www.cs.gmu.edu/ hfoxwell/EC511/pki.pdf. Last Visited- April
    2008.
[24] S. Pack, K. Park, T. Kwon, Y. Choi. SAMP: scalable application-
    layer mobility protocol, IEEE Communications Magazine, Vol. 44, Issue
    6,Page(s):86 - 92. JUNE 2006.
[25] J. Shi; Y. Wang; L. Gu; L. Li; W. Lin; Y. Li; Y. Ji; P. Zhang.
    A Hierarchical Peer-to-Peer SIP System for Heterogeneous Overlays
    Interworking, IEEE Global Telecommunications Conference. Page(s):93
    -97, November 2007.
[26] H. Jie, H. Yongfeng, L. Xing. MSPnet: Manageable SIP P2P media
    distribution system, Journal of electronics(China). Volume 24, November,
    2007
[27] A. Fessi, H. Niedermayer, H. Kinkelin, G. Carle. A Cooperative SIP In-
    frastructure for Highly Reliable Telecommunication Services, IPTCOMM,
    Proceedings of the 1st international conference on Principles, systems and
    applications of IP telecommunications, July 2007.




                                                                                  99                        http://sites.google.com/site/ijcsis/
                                                                                                            ISSN 1947-5500
                                                    (IJCSIS) International Journal of Computer Science and Information Security,
                                                    Vol. 9, No. 6, June 2011




     Evaluation of CPU Consuming, Memory
Utilization and Time Transfering Between Virtual
 Machines in Network by using HTTP and FTP
                   techniques.

              Igli TAFA, Elinda KAJO, Elma ZANAJ, Ariana BEJLERI, Aleksandër XHUVANI

              Polytechnic University of Tirana, Information Technology Faculty
                             Computer Engineering Department
                                       Tiranë, Albania
        itafaj@gmail.com, e_kajo@yahoo.com, ezanaj@gmail.com, arianabejleri@yahoo.com,
                                                axhuvani@yahoo.com

  Abstract: In this paper we want to evaluate Transfer             Hypervisor than those machines without it. Another
  Time, Memory Utilization and CPU Consuming                       problem is Physical Memory utilization and data
  between virtual machines in Network by using FTP and             overhead during live migration phase. Some
  HTTP benchmarks. As a virtualization platform for                researchers in [13], [14] presented some methods of
  running the benchmarks we have used Xen hypervisor
                                                                   memory overbooking and compression of this
  in para-virtualization mode. The virtual machine
  technology offers some benefits such as live migration,          memory in virtual machine, in order to improve
  fault tolerance, security, resource management etc. The          memory utilization and performance of migration.
  experiments performed show that virtual machines
  above the hypervisor consume more CPU, memory and                In this paper we analyze Transfer Time, CPU
  have bigger transfer times than in a non virtualized             Consuming and Memory Utilization between Virtual
  environment.                                                     machines and physical machines by using FTP [15]
                                                                   and HTTP requests [13]. All results are presented in
  Keywords: Transfer Time, Memory Utilization, CPU                 respectively tables.
  Consuming, Virtual Machines, Xen-Hypervisor.
                                                                   This paper is organized as follows. Section II dis
     I. INTRODUCTION                                               scribes the experimental architecture. Section III
                                                                   presents the experimental evaluation. Section IV
  Virtual machine technology offers a lot of benefits as           presents conclusions and outlines areas of future
  shown             in           previous        research          work.
  [1],[2],[3],[4],[5],[6],[7],[8] such are live migration,
  fault tolerance, security, resource management and                    II. EXPERIMENTAL ARCHITECTURE
  reduced energy consumption. Some virtual machine
  technologies are based on a software layer called                The Figure 1 and Figure 2 we present the basic of
  Hypervisor [9]. There are [10] three main types of               experimental architecture. In Figure 1 there are 2
  virtualization: full virtualization, OS virtualization           computers which are connected with UTP cat 7 cable
  and para-virtualization. Para virtualization approach            using Twisted Pair technique. Communication of two
  gives more flexibility than others. Based on this                computers is Full duplex. Both computers can
  approach we can use ESX-Server [ 11] or Xen [] .                 communicate with each other by network fast-
  Because Xen is free open source and implements                   ethernet interface 100/1000 Mbit/sec.     In each
  ballooning method [12] we have used this Hypervisor              computer we have setup the virtual machine
  .                                                                environment with Xen 4.1 and CentOS 5.5 as Dom0
                                                                   operating system.
  Anyway, the virtualization technology offers some
  “black holes”, for example the Hypervisor introduces
  a slight delay during a transfers from one machine to
  another one [13]. Also CPU consumes more
  functionalities in machines which include the



                                                             100                               http://sites.google.com/site/ijcsis/
                                                                                               ISSN 1947-5500
                                                       (IJCSIS) International Journal of Computer Science and Information Security,
                                                       Vol. 9, No. 6, June 2011




                                                                      Evaluation of transfer time in FTP server

                                                                           Evaluation of transfer time between 2 VM on
                                                                            the same Host

                                                                           Evaluation of transfer time between 2 VM on
                                                                            different Hosts

                                                                           Evaluation of transfer time between 2 physical
                                                                            machines connected by twisted pair cable.

                                                                           Evaluation of transfer time between 3 Physical
Fig.1 Communication between 2 Physical Machine with Twisted                 Machines connected by Gigabit Switch
Pair. Above those hosts are Guest Virtual Machines.
                                                                      Initially we have used 2 physical machines which
In figure 2 we have installed 3 computers connected                   support 2 virtual machines, by           using para-
with Gigabit switch. The topology of routing is Bus.                  virtualization approach (XEN 4.1). In both machines
Communication is Full Duplex. We used             a                   we have installed CentOS 5.5 as Dom0. In DomU1
management Gigabit Cisco Switch, but we could use                     and DomU2 respectively we have installed Scientific
simple switch too.                                                    Linux 6.0 and Ubuntu 10.04 Server for the first
                                                                      machine. In the second machine in DomU1 is
                                                                      installed Ubuntu 10.04 Server. Initially we want to
                                                                      test the transfer time from a client to a server
                                                                      between 2 VM in the same physical host by using
                                                                      FTP technique. We will repeat the test by using 2
                                                                      virtual machines in different physical hosts and
                                                                      finally we will evaluate this time between 2 physical
                                                                      machines connected by twisted pair technique. We
                                                                      want to transfer ISO image (XP.ISO with SP2 = 557
                                                                      MB). In the machine with Scientific Linux 6.0 we run
                                                                      a FTP client and in Ubuntu 10.04 a FTP server. To
                                                                      realize the transfer of XP.ISO file from one machine
                                                                      to another we have used Samba FTP tool (which is
                                                                      part of Scientific or Ubuntu Server). We can measure
                                                                      the time of file transfer from the start moment at
                                                                      source machine to the destination machine.
Fig 2. Three Computers connected with a Gigabit Switch. Above         The experiment is repeated again with Scientific
Host computers are setup Virtual Machines
                                                                      Linux 6.0 client machine as DomU1 in host 1 and
                                                                      Ubuntu 10.04 Server as DomU1 in host 2 which is
The architecture of all the machines is X86 - 64 bit,                 used as FTP server.
RAM 4 GB. CPU Quad-Core, supported with VT
and Hyper-threading technology.                                       Finally we will evaluate the time transferred between
                                                                      2 physical machines, respectively host 1 and host 2.
     III. EXPERIMENTAL EVALUATION:                                    Host 2 will serve as FTP server and Host 1 as a FTP
                                                                      client. The results are presented in table 1:
The evaluation is separated in three phases:
                                                                        TABLE1 THE EVALUATION OF TIME TRANSFERRED OF
     Evaluation of transfer time for FTP and Web                           XP ISO IMAGE BY USING SAMBA FTP TOOL
      servers
                                                                      Time     transferred   Time      transferred   Time    transferred
     Evaluation of CPU consumption for FTP and                       between 2 VM on        between 2 VM on         between 2 Physical
      Web servers                                                     the same Host          different Hosts         Hosts
                                                                      (Host 1)
     Evaluation of memory utilization for FTP and
                                                                      48 sec                 86 sec                  36 sec
      Web serves




                                                                101                                   http://sites.google.com/site/ijcsis/
                                                                                                      ISSN 1947-5500
                                                   (IJCSIS) International Journal of Computer Science and Information Security,
                                                   Vol. 9, No. 6, June 2011




In Figure 2 we have 3 computers connected with a                  Evaluation of CPU consuming in FTP Server
Gigabit Switch. All computers communicate with
each other by Fiber channel. This communication                   We want to evaluate the CPU consuming of Web
will increase the performance of network (As it                   Server which means how much is the percentage of
knows communication with fiber channel utilize                    CPU dedicated to our experiment. It is presented in
more bandwidth and speed than UTP cable). The                     average values during the test in figure 1 and figure
third computer is a clone of second computer. Now                 2. To monitor the CPU consuming we have to used
we want to evaluate the transfer time between 2 VM                xentop command in /proc and System Monitor into
in different hosts (i.e between DomU1 in host 1 and               System Administrator Menu. Both of them offer an
DomU1 in host3). Then will transfer the image from                explicit form of CPU consuming including all located
computer 1 to computer 3. The results are presented               processes into computer by calculating the average
in table 2                                                        value [ (DomU1+DomU2+…)/n where n is the
                                                                  number of virtual machines (The same thing would
 TABLE 2 THE EVALUATION OF TIME TRANSFERRED OF                    be with physical machines) ] of these processes in
         XP ISO IMAGE BY USING SAMBA FTP
                                                                  virtual or physical machines in our experiment. To
                                                                  calculate the total CPU consuming we have built a
Time transferred between 2     Time transferred between 2
VM     on    different Hosts   Physical Hosts (Host 1 and
                                                                  script in C which gives a formula:
(DomU1 in Host 1 and Dom       Host 3)
U1 in Host3)                                                      Running process rate x nr of active process +
                                                                  Sleeping process rate x nr of sleeping process= Total
83 sec                         32 sec                             CPU Consuming                                     (1)

                                                                    TABLE.3 AVERAGE OF CPU CONSUMING DURING THE
                                                                     TRANSFERRED OF 557 MB BASED ON FIGURE 1 AND
                                                                                     FORMULA 1
Let`s analyze Tab. 1 and Tab. 2.
                                                                  CPU      consuming   CPU       consuming     CPU     consuming
In Tab. 1 the time transferred between 2 VM on the                between 2 VM on      between 2 VM on         between 2 Physical
same host is small. This is because the transfer of 557           the same Host        different Hosts         Hosts
MB image file from one VM to another one                          (Host 1)
performed over the interfaces of the same computer
architecture (In reality we are in the same computer).            61,4%                62,2 %                  55.1 %
If we compare time transferred between 2 VM on
different host as it looks from table 1, total time
transferring is 86 sec. The reasons are:
                                                                    TABLE. 4 AVERAGE OF CPU CONSUMING DURING THE
     The transferred speed of the file between 2                    TRANSFERRED OF 557 MB BASED ON FIGURE 2 AND
      computers over the network is more slowly                                       FORMULA 1
      then the internal interface of computer
      architecture (ISA interface).                               CPU consuming between 2           CPU consuming between 2
                                                                  VM     on    different Hosts      Physical Hosts (Host 1 and
     Media communication between hosts is based                  (DomU1 in Host 1 and Dom          Host 3)
      on UTP cat 7 which is more slowly then other                U1 in Host3)
      medias (i.e fiber media, which is presented in
                                                                  61,45%                            55.16%
      table 2)

Also in table 1 we show that time transferred between
different hosts is 36 sec (< 86 sec and < 48 sec). In
the both cases the main reason is the delay that is               As it look from table 3 CPU consuming between 2
introduced from the Hypervisor.                                   VM is higher than between 2 physical machines. The
                                                                  reason is a part of CPU, consumes to maintenance
In table 2, the time decrease slightly. The reason is             the Hypervisor. (61,4 % > 55,1 %). CPU consuming
media communication. In table 2 we are using                      doesn’t affect from the computer communication in
Gigabit Switch and Fiber channel communication                    network. This is the reason that in table 4, CPU
while in table 1 we are using only UTP cable.                     consuming has the same values as table 3.




                                                            102                                 http://sites.google.com/site/ijcsis/
                                                                                                ISSN 1947-5500
                                                        (IJCSIS) International Journal of Computer Science and Information Security,
                                                        Vol. 9, No. 6, June 2011




Evaluation of memory utilization in FTP Server                          Hosts (DomU1 in Host 1 and       1 and Host 3)
                                                                        Dom U1 in Host3)
We want to evaluate the physical memory utilization
                                                                        Host 1 MAX= 1,06 GB of           Host 1 MAX=687 MB of RAM
during the transferring .iso image from one virtual
                                                                        RAM                              (≈17 % of RAM)
machine to another, inside a physical host. Then we
want to evaluate the Physical memory utilization                        Host 2 MAX= 1,08 GB of           Host 2 MAX= 711 MB of
while .iso image transferring between two virtual                       RAM                              RAM    ( ≈ 17 % of RAM)
machines in different physical hosts. Initially we
have used Mem_Access [13], but now we have used                         Both (≈ 41 % of RAM)
this tool to evaluate the Mem-Utilization during file
transfer. We have implemented this tool to a script
which is write in C and called MemC. For each 10
MB transfer from Server Machine to Client Machine                       In table 5 and table 6 we presented that memory
it calculates Memory Utilization by activation this                     utilization doesn`t affect from communication in
tool. The result is resolve in a record. This record is
                                                                        networks, but it affects from presence of the
part of “My SQL” Data Base which is installed in
Server Machine (mysql Ver 12.21 Distrib 4.0.14, for                     Hypervisor (from 41 % to 17 %)
pc-linux). Average Memory Utilization is Total
Sum of record dividing with nr of records. Always                       Evaluation of memory utilization and CPU
we can use Samba FTP server for transferring data                       consuming in Web Server
from one machine to another one. Finally we repeat
the experiment by using transfer between 2 physical                     We will repeat the above experiment by using Web
hosts (fig 1). Table 5 give the results for memory                      server instead of FTP Server. Initially we will
utilization between 2 virtual machines.                                 accomplish the experiment inside a physical machine
                                                                        between 2 VM. Then we will repeat the test between
TABLE.5 PHYSICAL MEMORY UTILIZATION DURING THE                          2 Physical Machines (Fig 1).
 .ISO FILE TRANSFER BY USING SAMBA FTP BASED ON
                          FIGURE 1.                                     We have installed LAMP (Appache 2 and My SQL
                                                                        client , My SQL Server) in Web Server Virtual
Average Memory        Average Memory        Average     Memory          Machine or Web Server Physical Machine.
utilization between   Utilization between   Utilization between
2 VM on the same      2 VM on different     2 Physical Hosts            We have built another script in C++, called MemCP
Host (Host 1)         Hosts (Dom U1 in      (Host 1 is a Client         which get information from MemAccess benchmark
                      Host 1 is a Client    FTP and Host 2 is a         for every request ( From Client machine to Server
                      FTP and Dom U1        Server FTP)
                                                                        machine) and previously script which was located in
                      in Host 2 is a
                                                                        /proc. This script make a calculation by adding the
                      Server FTP)
                                                                        results got it in module of Apache 2 installed in our
MAX = 1,06 GB of      Host 1 MAX= 1,06      Host 1 MAX=687              machine. This module is implemented in order to
RAM (≈41 % of         GB of RAM             MB of RAM (≈17              include the MemAccess tool for each request. The
RAM)                                        % of RAM)                   results obtained present the Memory utilization in
                      Host 2 MAX= 1,08                                  each Virtual Machine for each process which is
                      GB of RAM             Host 2 MAX= 711             located in host computer. Total results can present in
                                            MB of RAM    (≈             percentage. Each request from Client Machine to
                      Both (≈ 41 % of       17 % of RAM)                Server Machine performed by using Httperf
                      RAM)                                              Benchmark which can generate 10 request in Second.
                                                                        The time duration of experiment is 1 minute. One
                                                                        request is equal to 10 MB which corresponds to
                                                                        .html file located in /home. Results are presented in
If we repeat the experiment based on figure 2 we will                   Table 7.
present the results in table 6.
                                                                          TABLE 7 MEMORY UTILIZATION IN WEB SERVER BY
TABLE 6 PHYSICAL MEMORY UTILIZATION DURING THE                                     USING HTTPERF BENCHMARK.
 .ISO FILE TRANSFER BY USING SAMBA FTP BASED ON
                     FIGURE 2.                                          Memory utilization     Memory Utilization   Memory Utilization
                                                                        between 2 VM on        between 2 VM on      between 2 Physical
Average Memory Utilization       Average Memory Utilization             the same Host          different   Hosts    Hosts (Host 1 is a
between 2 VM on different        between 2 Physical Hosts (Host                                (Dom U1 in Host 1    Client and Host 2 is




                                                                  103                                http://sites.google.com/site/ijcsis/
                                                                                                     ISSN 1947-5500
                                                        (IJCSIS) International Journal of Computer Science and Information Security,
                                                        Vol. 9, No. 6, June 2011




(Host 1)            is a Client and Dom   Web Server)                  TAB.9 TIME REQUEST            IN WEB SERVER BY USING
                    U1 in Host 2 is                                    HTTPERF BENCHMARK.
                    Web Server)
                                                                       Time request in      Time request in         Time request in
(≈37 % of RAM)      Web Server (≈37,7     Web Server (≈16,6            Web        Server    Web           Server    Web        Server
                    % of RAM)             % of RAM)                    between 2 VM on      between 2 VM on         between 2 Physical
                                                                       the same Host        different Hosts         Hosts
                                                                       (Host 1)

                                                                       38 sec               66 sec                  33 sec
   TAB. 8 CPU CONSUMING IN WEB SERVER BY USING
                HTTPERF BENCHMARK.

CPU     consuming   CPU      consuming    CPU     consuming
between 2 VM on     between 2 VM on       between 2 Physical           If we make a comparison between Tab.9 and Tab.1,
the same Host       different Hosts in    Hosts   in   Web             the time request for Web Server is smaller than Time
(Host 1) in Web     Web Server            Server                       transferred in FTP (38 sec < 48 sec, 66 sec < 86 sec).
Server
                                                                       This time is not affect so much in third case. In this
                                                                       case we are not using Hypervisor (33 sec < 36 sec)
57,4%               57,5 %                53,6 %

                                                                       The reasons are :

                                                                            Request in Httperf are smaller, so that doesn’t
If we compare Table 7 and Table 5, Memory                                    give any effect in degradation of CPU.
Utilization during File (557 MB) transferring in FTP
Server is bigger than Memory Utilizes in Http                               Memory utilization of Httperf ( Tab.7) is
Request in Web Server (41 % > 37 %). The reasons                             smaller than Memory Utilization of FTP
are:                                                                         (which uses Samba FTP tool)
     FTP uses 2 connections in communication                               IV. CONCLUSIONS
      Client/Server. One for synchronization and
      one for data transmission, while Web Server                           FTP approach consumes more CPU, Time and
      uses only 1 connection. Each request reserves                          Memory Utilization than Web Approach
      a small amount of memory.
                                                                            The Hypervisor offers an additive time for
     Introduction of the Hypervisor in Web Server                           both approaches, but it is significantly in FTP
      during the requests from Client to Server                              approach
      offers a smaller effect than FTP Server
                                                                            Transfer Process between 2 Physical
     If the computers don`t use Hypervisor during                           Machines offer a better performance than
      transfers,    Memory       Utilization     is                          between 2 Virtual Machines.
      approximately the same.
                                                                            Introduction of good environment decrease the
If we compare Table 8 and Table 3, CPU consuming                             time consuming between machines and
during file transfer in FTP Server is bigger than Http                       increase the CPU performance. (Fiber channel
request in Web Server. The reasons are the same as                           is better environment than UTP channel)
Memory Utilizes.
                                                                       In the future we want to test the performance and
Evaluation of Time Transferring in Web Server.                         CPU and memory utilization by using:

Finally, we can repeat the experiment by using                              Live Migration between 2 machines
HTTPerf Benchmark. We want to evaluate the time
duration after 56 requests from client machine to                           Memory Compaction Algorithms and their
Web Server machine. Every request is 10 MB. So we                            effects in Memory Utilization
will test approximately in the same conditions the
results of table 9 with table 1 (10x56 requests =560                        Memory Ballooning approach and it`s effect
MB in Web Request ≈ 557 MB in FTP)                                           in CPU consuming.

                                                                            Extension of these experiments in WAN.



                                                                 104                                 http://sites.google.com/site/ijcsis/
                                                                                                     ISSN 1947-5500
                                                         (IJCSIS) International Journal of Computer Science and Information Security,
                                                         Vol. 9, No. 6, June 2011




References:                                                              [11] Carl Waldspurger, 2006, “Memory Resource Management in
                                                                         VMWare ESX Server”
                                                                         [12] Weiming Zhao, Zhenling Wang, 2009, “Dyanamic Memory
[1] Hien Nguyen Van, Fr´ed´eric Dang Tran, 2009, “Autonomic
                                                                         Balancing for Virtual Machines”.
virtual resource management for service hosting platforms”
                                                                         [13] Jin Heo, Xiaoyun Zhu, Pradeep Padala, Zhikui Wang,2010,
[2] Michael Cardosa, Madhukar R. Korupolu, Aameek Singh,
                                                                         “Memory Overbooking and Dynamic Control of Xen Virtual
2007, “Shares and Utilities based Power Consolidation in
                                                                         Machines in Consolidated Environments”
Virtualized Server Environments”
                                                                         [14] Hai Jin, Li Deng, Song Wu, “Live Virtual Machine Migration
[3] Keller, G.;       Lutfiyya, H.; Dept. of Comput,,2010,
                                                                         with Adaptive Memory Compaction”
“Replication and Migration as Resource Management Mechanisms
                                                                         [15] Moreira, Miguel Elias M. Campista, Lu´ıs Henrique M. K.
for Virtualized Environments”
                                                                         Costa, and Otto Carlos M. B. Duarte, 2007, “OpenFlow and Xen-
[4] Ando, R.; Zong-Hua Zhang; Kadobayashi, Y.; Shinoda,
                                                                         Based Virtual Network Migration”
Y.; ,2009,” A Dynamic Protection System of Web Server in Virtual
Cluster Using Live Migration”
                                                                         Authors Profile
[5] Takahiro Hirofuchi Hirotaka Ogawa Hidemoto Nakada Satoshi
                                                                         Igli TAFA. He is a pedagogue in Polytechnic University, in
Itoh Satoshi Sekiguchi ,2009, “A Live Storage Migration
                                                                         Computer Engineering Department. In 2008 he has finished
Mechanism over WAN for Relocatable Virtual Machine Services
                                                                         the Master Thesis and now is PhD student. His PhD topic
on Clouds”
                                                                         according to Virtual Machines direction.
[6] Moghaddam, F.F.; Cheriet, M.; 2010, “Decreasing live virtual
                                                                         Elinda KAJO (MECE). She is a pedagogue in Polytechnic
machine migration down-time using a memory page selection
                                                                         University, in Computer Engineering Department. She has
based on memory”
                                                                         finished the PhD thesis at 2004 in Object Oriented
[7] Wei Wang; Ya Zhang; Ben Lin; Xiaoxin Wu; Kai
                                                                         Programing direction.
Miao; ,2010, “Secured and reliable VM migration in personal
                                                                         Elma ZANAJ. She is a pedagogue in Polytechnic University
cloud”
                                                                         in Computer Engineering Department. She has finished the
[8] Anton Beloglazov* and Rajkumar Buyya, 2010, “Energy
                                                                         PhD thesis at 2009 in Computer Sensor Network direction in
Efficient Resource Management in Virtualized Cloud Data
                                                                         Polytechnic University of Ancona Italy.
Centers”
                                                                         Aleksandër XHUVANI. He is a chief of Computer Software
[9] Andrew Tanenbaum, 2009, “Modern Operating System”
                                                                         Department in Polytechnic University. He has finished the
[10] Espen Braastad, 2006, “Management of high availability
                                                                         PhD study at Bordeaux in France. At 2004 he is graduated
services using virtualization”
                                                                         as Prof.Dr.




                                                                   105                                http://sites.google.com/site/ijcsis/
                                                                                                      ISSN 1947-5500
                                                              (IJCSIS) International Journal of Computer Science and Information Security,
                                                                                                                  Vol. 9, No. 6, June 2011

A Proposal for Common Vulnerability Classification
Scheme Based on Analysis of Taxonomic Features in
             Vulnerability Databases

                    Anshu Tripathi                                                            Umesh Kumar Singh
       Department of Information Technology                                               Institute of Computer Science
         Mahakal Institute of Technology                                                        Vikram University
                   Ujjain, India                                                                    Ujjain, India
           anshu _ tripathi@yahoo.com                                                     umeshsingh@rediffmail.com


Abstract— A proper vulnerability classification scheme aids in             locations [3]. Results from previous researches [3-6] clearly
improving system security evaluation process. Many                         indicate that quantitative security evaluation of risks on
vulnerability classification schemes exist but there is lacking of         vulnerability datasets partitioned in well defined classes is a
a standard classification scheme. Focus of this work is to devise          meaningful metric. In [4], results of categorized vulnerability
a common classification scheme by combining characteristics
                                                                           analysis shown that some vulnerability classes are more
derived from classification schemes of prominent vulnerability
databases in effective way. In order to identify a balanced set            severe, this fact can be used to design optimal security
of characteristics for proposed scheme comparative analysis of             solution by prioritizing severe classes. A proper
existing classification schemes done on five major vulnerability           classification scheme facilitates distribution of vulnerabilities
databases. A set of taxonomic features and classes extracted as            and help in prioritizing mitigation efforts according to
a result of analysis. Further a common vulnerability                       severity level. Efficiency of security evaluation process can
classification scheme proposed by harmonizing extracted set of             be measured by its objectivity and vulnerability coverage. A
taxonomic features and classes. Mapping of proposed scheme                 proper classification scheme plays a major role in this regard
to existing classification schemes also presented to eliminate             by increasing both objectivity and vulnerability coverage.
inconsistencies across selected set of databases.
                                                                           Taxonomy is a way to classify vulnerabilities in a well
    Keywords- Vulnerabilit; Classification scheme; Vulnerability           formed structure so that categorization and generalization
databases; Taxonomy; Security evaluation.                                  can be achieved [7]. In our previous work [8], we analyzed
                                                                           prominent vulnerability taxonomies published with respect
                      I.    INTRODUCTION                                   to standard criteria and highlight issues which make them not
                                                                           so usable in today's scenario. This study on past efforts at
Proper assessment and mitigation of vulnerabilities is                     developing such taxonomy indicates that these efforts prove
essential in order to ensure the system security.                          to be insufficient to address security issues associated with
Vulnerabilities are “design and implementation errors in                   current software products due to theoretical approach or
information systems that can result in a compromise of the                 being focused on limited domain.
confidentiality, integrity or availability of information stored           There are many different vulnerability databases set up with
upon or transmitted over the affected system” [1]. In view of              different standards and capabilities that records
the increasing population of vulnerabilities [2], it is                    vulnerabilities and characterize them by several attributes.
necessary to prioritize them and first remediate those that                These databases serve the need of updated collection of
pose the greatest risk. Vulnerability prioritization requires              vulnerability data for research. Some of the most popular
evaluation of risk levels posed by presence of vulnerabilities.            databases include National Vulnerability Database (NVD)
Quantitative evaluation of system security in terms of risk                [9], The Open Source Vulnerability Database (OSVDB)
levels due to presence of vulnerabilities is gaining                       [10], and IBM ISS-X Force[11].But there are many
importance because of objective and on time result                         challenges in extracting common patterns from these
generation. One of the ways for fast security evaluation is to             vulnerability databases due to discrepancies involved in the
find out potential weak areas of the system. It is essential to            way the information is kept. Many different classification
focus mitigation efforts in area that have a greater number of             schemes used by databases to classify vulnerabilities and
vulnerabilities to meet budget and time constraints. These                 there is lacking of a common classification scheme. Detailed
areas can be identified by proper vulnerability classification             study on the issues involved in this regard can be found in
and thus leads to identify root causes of the weaknesses.                  [12]. Objective of this work is to analyze vulnerability
Vulnerabilities share common properties and similar                        classification schemes in some most popular databases and
characteristics in generic aspects like causes, impacts,                   devise a common classification scheme. Main aim of




                                                                     106                               http://sites.google.com/site/ijcsis/
                                                                                                       ISSN 1947-5500
                                                        (IJCSIS) International Journal of Computer Science and Information Security,
                                                                                                            Vol. 9, No. 6, June 2011

proposing common classification scheme is to provide a                 schemes. First tier include categories Location, Attack
stepping stone in security risk analysis by strategically              Type, Impact, Solution, Exploit, Disclosure, OSVDB.
mitigating risks.                                                      Location includes nine subcategories, Attack Type includes
The paper is organized as follows. Section 2 provides                  ten subcategories, Impact includes four subcategories,
overview of vulnerability classification schemes in major              Solution includes seven subcategories, Disclosure includes
vulnerability databases. Section 3 presents comparison of              eight subcategories and OSVDB include six subcategories.
classification schemes under taxonomic features in                     OSVDB supports a rich search feature under every category
prominent vulnerability databases introduced in section 2.             for trend analysis. Secunia [17] is a private organization that
Section 4 presents a proposal for common classification                provides services in security company defense and
scheme based on comparison in section 3 by extracting                  vulnerability analysis. Secunia Categorize vulnerabilities
appropriate taxonomic features and classes. Further mapping            under features Impact, Critical Levels, and Exploitation
of proposed scheme to existing ones also given. Finally                Location. Vulnerabilities under impact are associated to
section 5 concludes the work with directions for future work.          twelve classes. Criticality levels can be five ranging from
                                                                       extremely critical to not critical and attack vector
                    II.   RELATED WORK                                 classification includes three classes.
There are number of vulnerability classification schemes               As we can see classification schemes supported by these
adopted by different vulnerability databases maintained by             major vulnerability databases are disparate in terms of
various organizations. In this part we will introduce                  classification criteria and dimensionality. Moreover there is
classification schemes in five major vulnerability databases:          no interoperability among them. Therefore it is challenging
IBM ISS X-Force, NVD, SecurityFocus, OSVDB and                         to compare or combine information across these databases. A
Secunia.                                                               common classification scheme can help in this regard. In
IBM ISS X-Force database [11] is one of the world‟s most               next section these databases are compared and analyzed with
comprehensive threats and vulnerabilities database. At the             respect to generic taxonomic features in order to extract
end of 2010, there were 54,604 vulnerabilities in the X-Force          pertinent information for development of a common
Database, covering 24,607 distinct software products from              classification scheme.
12,562 vendors. IBM ISS X-Force database doesn‟t include
any class or category information explicitly. Or in other               III.   EXTRACTION OF TAXONOMIC FEATURES AND CLASSES
words it doesn‟t specify any classification scheme. But it             One of the objectives of this work is to identify a set of
inherently supports taxonomic features: impact and severity            characteristics for a very specific classification scheme, one
level. In all eleven categories proposed under impact and it           that can be used effectively in quantitative security
assigns risk levels in three categories: High, Medium and              evaluation of system. This goal requires analysis of existing
Low. National vulnerability database [9] is managed by the             schemes to deduce possible common features that will aid in
National Institute of Standards and Technology of the United           security evaluation. A comparative study provides insight
States and is associated with the CVE [13]. It records                 into the pros and cons of the different kind of classification
vulnerabilities since 1999, total 46176 vulnerabilities listed         schemes. This section compares classification schemes in
under CVE names. NVD is using CWE [14] as a                            major vulnerability databases introduced in previous section
classification mechanism; each individual CWE represents a             under generic taxonomic features. Taxonomic features
single vulnerability type. There are total 23 vulnerability            identified for analysis are: cause, impact, exploitation
types in NVD classification scheme, which are based on                 location and severity levels. Comparisons of features done
taxonomic features vulnerability cause and vulnerability               under various heads are summarized in Table II to V. These
impact. SecurityFocus vulnerability database [15] is a vendor          heads have been numbered for greater legibility and their
neutral vulnerability database managed by Symantec                     correspondence is shown in Table I.
Corporation from 2002. It contains more than 40,000
recorded vulnerabilities (spanning more than two decades)                TABLE I.      TABLE SHOWING CORRESPONDENCE OF COMPARISON
                                                                                                    HEADS
affecting more than 105,000 technologies from more than
14,000 vendors. SecurityFocus supports a classification                         No. of Head             Name of Head
scheme under the taxonomic feature cause. Total eleven                          1             Explicit
                                                                                2             Dimensionality
vulnerability categories specified based on taxonomy of                         3             Class Code
security faults in Unix operating system by Taimur Aslam                        4             Class Details
[16]. Other taxonomy feature supported by SecurityFocus is                      5             Multivariate
exploitation location with two categories remote and local.                     6             Approximate Population Percentage
Open Source Vulnerability Data Base [10] is an open source
                                                                       A. Vulnerability cause
database created in 2002 by the Black Hat Conference
people, currently covers 70,789 vulnerabilities, spanning              Vulnerabilities grouped under the taxonomic feature cause
32,272 products from 4,735 researchers, over 46 years.                 help in understanding common type of errors and conditions
OSVDB provides two tier vulnerability classification                   that are reason for existence of majority of vulnerabilities.




                                                                 107                               http://sites.google.com/site/ijcsis/
                                                                                                   ISSN 1947-5500
                                                                (IJCSIS) International Journal of Computer Science and Information Security,
                                                                                                                    Vol. 9, No. 6, June 2011

Paying attention to common errors and mistakes result in                      C. Exploitation Location
mitigating multiple of vulnerabilities and also avoid future                  Exploitation location is main feature affecting risk level of
vulnerabilities caused by same reason. SecurityFocus                          system as it determines attacker community and in turn
classifies vulnerabilities explicitly under feature cause and                 mitigation strategies. Databases SecurityFocus, OSVDB and
NVD and OSVDB also incorporates this feature in their                         Secunia explicitly classify vulnerabilities under this feature
classification scheme partially. Table II provides results of                 ranging from 2 to 9 classes. Table IV provides results of
comparative study of classification under feature cause in                    comparative study of classification under feature
these three databases.                                                        Exploitation location in these databases.
B. Vulnerability Impact                                                       D. Severity level
Exploitation of vulnerabilities results in degradation of                      Different vulnerabilities have different level of impact on
performance of system. Different vulnerabilities have                         the CIA of the system, which is measured by severity level.
different kind of impact on system performance. So                            Severity level information provided by databases
classification of vulnerabilities under the feature impact can                qualitatively or quantitatively. Number of classes is
provide useful insights. The taxonomic feature vulnerability                  inconsistent in databases for the feature severity level
impact is used as classification criteria in X-Force, Secunia,                varying from 3 to 5 in case of qualitative as shown in
NVD and OSVDB databases. Table III provides results of                        column 3 of Table IV. OSVDB provides severity ratings in
comparative study of classification under feature impact in                   terms of CVSS scores [18] only while SecurityFocus
these databases.                                                              doesn‟t include this information. Table V provides results of
                                                                              comparative study of classification under feature Severity
   TABLE II.       COMPARISON OF CLASSIFICATION SCHEMES UNDER
               TAXONOMIC FEATURE VULNERABILITY CAUSE
                                                                              level in these databases.
 VDB       1     2       3                  4               5     6              TABLE III.      COMPARISON OF CLASSIFICATION SCHEMES UNDER
                      C-SF1    Configuration Error              1.19                        TAXONOMIC FEATURE VULNERABILITY IMPACT
                      C-SF2    Boundary        Condition        16.70
                               Error                                           VDB      1     2       3                   4                 5     6
                      C-SF3    Environment Error              0.31                                 I-X1     Gain Access                         49.25
                      C-SF4    Input Validation Error         45.59                                I-X2     Gain Privileges                     4.0
                      C-SF5    Design Error                   18.81                                I-X3     Bypass Security                     5.75
Security                                                                                           I-X4     File Manipulation                   1.25
           Y 11       C-SF6    Race Condition Error         N 1.10
Focus                                                                                              I-X5     Data Manipulation                   16.42
                      C-SF7    Origin Validation Error        0.50
                      C-SF8    Access Validation Error        5.60            X-Force   N     11   I-X6     Obtain Information              N   9.0
                      C-SF9    Failure     to     Handle      10.09                                I-X7     Denial of Service                   12.0
                               Exceptional Conditions                                              I-X8     Configuration                       0.08
                      C-SF10   Atomicity Error                  0.03                               I-X9     Informational                       0.05
                      C-SF11   Unknown                          0.08                               I-X10    Other                               1.5
                      C-N1     Authentication Issues            2.48                               I-X11    None                                0.7
                      C-N2     Credentials Management           1.01                               I-S1     Brute force                         0.21
                      C-N3     Buffer Errors                    11.65                              I-S2     Cross site scripting                17.5
                      C-N4     Cryptographic Issues             1.23                               I-S3     Denial of Service                   13.0
                      C-N5     Path Traversal                   5.38                               I-S4     Exposure of sensitive               14.23
                      C-N6     Code Injection                   6.05                                        information
                      C-N7     Format              String       0.53                               I-S5     Exposure       of     system        2.67
                               Vulnerability                                                                information
                                                                              Secunia   Y     12                                            Y
                      C-N8     Configuration                    0.89                               I-S6     Hijacking                           0.40
NVD        P    16                                          N                                      I-S7     Manipulation of data                15.87
                      C-N9     Input Validation                 6.79
                      C-N10    Numeric Errors                   3.01                               I-S8     Privilege escalation                5.82
                      C-N11    OS Command Injections            0.24                               I-S9     Security bypass                     5.88
                      C-N12    Race Conditions                  0.56                               I-S10    Spoofing                            1.56
                      C-N13    Resource      Management         4.94                               I-S11    System Access                       21.46
                               Errors                                                              I-S12    Unknown                             1.40
                      C-N14    SQL Injections                   13.17                              I-N1     Permissions,       Privileges       7.49
                      C-N15    Link Following                   1.28                                        and Access Control
                      C-N16    Design Error                     2.45                               I-N2     Cross Site Request Forgery          1.49
                                                                              NVD       P     04                                            N
                      C-O1     Authentication                   2.18                               I-N3     Cross site scripting                12.60
                               Management                                                          I-N4     Information             leak/       3.22
OSVDB      P    04    C-O2     Cryptographic                N 1.62                                          disclosure
                      C-O3     Misconfiguration               0.89                                 I-O1     Denial of Service                   11.44
                      C-O4     Race Condition                 1.39                                 I-O2     Information disclosure              18.66
                                                                              OSVDB     P     04                                            N
                                                                                                   I-O3     Infrastructure                      0.15
                                                                                                   I-O4     Input manipulation                  60.64




                                                                        108                                http://sites.google.com/site/ijcsis/
                                                                                                           ISSN 1947-5500
                                                          (IJCSIS) International Journal of Computer Science and Information Security,
                                                                                                              Vol. 9, No. 6, June 2011

    TABLE IV.    COMPARISON OF CLASSIFICATION SCHEMES UNDER             database‟s vulnerability information under head „where‟.
            TAXONOMIC FEATURE EXPLOITATION LOCATION
                                                                        Severity level feature considered as univariate and classified
  VDB       1   2      3                 4            5     6           in four classes: critical, high, medium and low. Classes
Security             E-SF1   Remote                       80            under severity level feature are based on CVSS scores and
           Y    02                                    N
Focus                E-SF2   Local                        20
                     E-O1    Physical access              0.45
                                                                        can be associated with severity levels defined by other
                     E-O2    Local access                 8.50          databases on the basis of CVSS scores only.
                     E-O3    Remote/network access        77.42         During the harmonization process, we have merged classes
                     E-O4    Local/Remote                 4.37          with minor vulnerability population in nearest relevant
                     E-O5    Context dependent            6.36          classes with the objective to produce a classification that
OSVDB      Y    09                                    N
                     E-O6    Dial up access               0.06
                                                                        helps in focusing on main causes and impact areas. Class
                     E-O7    Wireless Vector              0.27
                     E-O8    Mobile Phone/Hand held       0.15          Other is included to keep the scope for expansion.
                             device                                     Table VI presents concise view of proposed classification
                     E-O9    Unknown                      2.39          scheme and mapping information.
                     E-S1    Remote                       84.0
Secunia    Y    03   E-S2    Local network            N   8.0           B. Discussion
                     E-S3    Local system                 8.0           To proactively secure any system it is crucial to focus on
                                                                        root causes of vulnerabilities. It signifies the classification
    TABLE V.      COMPARISON OF CLASSIFICATION SCHEMES UNDER
                TAXONOMIC FEATURE SEVERITY LEVEL                        under feature vulnerability cause. But as we can see in Table
                                                                        II only SecurityFocus classifies vulnerabilities explicitly
  VDB       1    2     3                   4          5      6
                     S-X1     High                         34
                                                                        under this feature. NVD and OSVDB include few classes
X-Force     Y   03   S-X2     Medium                  N    59           associated with cause in their classification scheme. It‟s
                     S-X3     Low                          07           obvious from class details given in column 5 of Table II that
                     S-N1     High                         44.6         information across these three databases is highly
NVD         Y   03   S-N2     Medium                  N    48.0         incompatible. Moreover many taxonomies exists [16, 19-22]
                     S-N3     Low                          7.40
                                                                        that classify vulnerabilities under feature cause.
                     S-S1     Extremely critical           4.0
                     S-S2     Highly critical              19.0
                                                                        Classification in these taxonomies based on classification
Secunia     Y   05   S-S3     Moderately critical     N    39.0         given by Landwehr et. al. in [19]. SecurityFocus‟s
                     S-S4     Less critical                35.0         vulnerability classification scheme is based on taxonomy of
                      S-S5           Not critical           3.0         security faults in Unix operating system by Taimur Aslam
                                                                        [16]. So in proposed scheme we opted to select the
                                                                        classification given by SecurityFocus as basis. In all eleven
   IV.     PROPOSED CLASSIFICATION SCHEME AND MAPPING
                                                                        classes selected as specified in column 4 of Table 6. Feature
Analysis of classification schemes in major vulnerability               is listed as univariate. Classes of NVD and OSVDB mapped
databases, in section III suggests that main taxonomic                  nearly due to incompatibility as specified in column 5 of
features in classification at highest level in hierarchy should         Table VI. Few of the classes can‟t be mapped (see column 7
be vulnerability causes, vulnerability impact, exploitation             of Table VI).
location and severity level. We propose a two level                          Taxonomic feature impact is used by most of the
vulnerability classification scheme based on this observation.          databases, but Secunia is the only database that uses it
Further, mapping of classes in proposed scheme to the                   explicitly as classification criteria (see Table 3). Proposed
classes in analyzed vulnerability databases also presented, to          scheme classify vulnerabilities under the feature
resolve the discrepancy. Summary of complete scheme                     vulnerability impact based on Secunia. X-force‟s information
presented below.                                                        about consequences is compatible with Secunia but in
                                                                        contrast in treating classes as multivariate/univariate. NVD
A. Overview of scheme                                                   and OSVDB include few classes under feature impact. After
Vulnerability cause feature considered as univariate and                identifying impact classes from both databases they are
classified in eleven classes specified in column IV of Table            mapped on classification based on Secunia and X-Force. In
VI. Classes under the feature vulnerability cause are based             all nine main categories identified as listed in column 4 of
on SecurityFocus‟s classification scheme. Vulnerability                 Table VI, that cover main impact classes included in these
impact feature considered as multivariate and classified in             four databases. Mapping information given in column 5 of
                                                                        Table VI specifies that classes File manipulation and data
nine classes, listed in column IV of Table VI. Classes under
                                                                        manipulation in X-Force are merged into a single category
feature vulnerability impact are based on classification
                                                                        and mapped onto manipulation of data class of Secunia.
scheme of Secunia database and vulnerability consequence
information provided by X-Force database. Exploitation
location considered as univariate and classified in three
classes: remote, local network and local. Classes under
feature exploitation location are based on Secunia




                                                                  109                               http://sites.google.com/site/ijcsis/
                                                                                                    ISSN 1947-5500
                                                            (IJCSIS) International Journal of Computer Science and Information Security,
                                                                                                                Vol. 9, No. 6, June 2011



                                      TABLE VI.      PROPOSED CLASSIFICATION SCHEME AND MAPPING
   Taxonomic
                  2    5                    4                                    Mapping                             Comments
    Feature
                         Configuration Error                       C-SF1, C-N8, C-O3                      Crptographic issues cann‟t be
                         Boundary Condition Error                  C-SF2, C-N13, C-N10, C-N3              mapped directly because they
                         Environment Error                         C-SF3                                  can be associated with different
                         Input Validation Error                    C-SF4, C-N5, C-N9, C-N11, C-N14        causes for ex. Numeric errors,
                         Design Error                              C-SF5, C-N16                           boundary condition etc. So they
  Vulnerability          Race Condition Error                      C-SF6, C-N12, C-O4                     are required to be as per reason
                  11   N                                                                                  associated.
  Cause                  Origin Validation Error                   C-SF7, C-N1, C-N2, C-N6, C-N7, C-O1
                         Access Validation Error                   C-SF8, C-N15
                         Failure     to     Handle   Exceptional   C-SF9
                         Conditions
                         Atomicity Error                           C-SF10
                         Other                                     C-SF11
                         Gain system access                        I-X1, I-S6, I-S11                      Classes with minor population
                         Gain privileges                           I-X2, I-S8, I-N1                       are merged to relevant class.
                         Bypass security                           I-X3, I-S9                             Purpose is to objectively focus
                         Data manipulation                         I-X4, I-X5, I-S7, I-O4                 on main impact areas.
  Vulnerability
                  09   Y Exposure of information                   I-X6, I-S1, I-S4, I-S5, I-N4, I-O2
  Impact
                         Denial of service                         I-X7, I-S5, I-O1
                         Cross site scripting                      I-S2, I-N2, I-N3
                         Spoofing                                  I-S10
                         Other                                     I-X8, I-X9, I-X10, I-S12, I-O3
                         Remote                                    E-S1, E-O3, E-O6, E-O7, E-O8, E-SF1    E-SF1 need to be categorized
  Attack                 Local Network                             E-S2, E-O4, E-SF1                      depending on network type
                  03   N
  Vector                 Local                                     E-S3, E-O1, E-O2, E-SF2                (LAN/WAN). Definition of E-
                                                                                                          O5 is ambiguous.
                           Critical                                S-X1, S-N1, S-S1, S-S2                 Mapping based on CVSS scores.
  Severity                 High                                    S-X1, S-N1, S-S3                       Critical (9-10), High (7-8.9),
                  04   N                                                                                  Medium (4-6.9), Low (0-3.9)
  Level                    Medium                                  S-X2, S-N2, S-S4
                           Low                                     S-X3, S-N3, S-S5
Similarly Exposure of sensitive information and Exposure                 classes and further of vulnerabilities in them. Severity level
of system information of Secunia are merged and mapped                   feature information included in almost all databases but
onto Obtain information of X-Force. After reviewing                      number of levels vary from 3 to 5 (see Table V). To remove
population distribution given in column 7 of Table III,                  the inconsistency in a balanced way after analyzing the
Configuration, Informational classes of X-Force are mapped               population distribution specified in column 7 of Table V,
onto others because of minor population (0.05-0.08) and                  four level grading proposed. Further mapping of severity
these classes are also covered in feature cause.                         levels to CVSS scores given in column 6 of Table VI that
Hijacking of Secunia is basically gaining access and brute               will resolve disparity and give both qualitative and
force is obtaining information. Spoofing is kept as a                    quantitative classification. Severity level can be univariate
separate category because it bypasses security as well as                only.
escalates privileges. A single vulnerability can impact in
multiple ways so the feature is listed as multivariate as                         V.          CONCLUSION AND FUTURE WORK
specified in column 3 of Table VI.                                       Efficiency of security evaluation process depends on its
Attack vector not only determines exploitation location but              objectivity and vulnerability coverage. A proper
also inform about attacker class which in turn reflects attack           vulnerability classification scheme can be helpful in this
techniques. It is an important feature that affects severity             regard by increasing both objectivity and vulnerability
level and also guides in designing security solution plans.              coverage. Moreover a proper classification scheme also
Although all databases cover this feature but high disparity             helpful in categorization of newly discovered vulnerabilities
involved (see Table IV). Classes range from 2 to 9 as listed             and trend analysis. Effectiveness of a classification scheme
in column 5 of Table IV. Proposed scheme suggests three                  mainly depends on the taxonomic features selected as a base
levels: local, local network and remote (see column 4 of                 for classification. Different classification schemes exist in
Table VI). These three levels cover all the significant attack           different vulnerability databases based on variety of criteria.
dimensions and accordingly security plans can be developed               There is lacking of a standard classification scheme. Five
depending on exposure of machine to LAN or WAN or                        major vulnerability databases are selected in this work and
physical. Various vulnerabilities although listed under same             classification schemes adopted by them are analyzed.
cause or impact class can damage system in different
severity levels. So classification under feature severity level
is necessary in order to understand impact level of different



                                                                   110                               http://sites.google.com/site/ijcsis/
                                                                                                     ISSN 1947-5500
                                                                      (IJCSIS) International Journal of Computer Science and Information Security,
                                                                                                                          Vol. 9, No. 6, June 2011

                           ACKNOWLEDGMENT                                             [9]    NHS and NIST, National Vulnerability Database (NVD), automating
                                                                                             vulnerability management, security Measurement, and compliance
     We would like to thank the anonymous reviewers who                                      checking, http://nvd.nist.gov/scap.cfm , (Accessed on 10-05-2011).
  provided helpful feedback on our manuscript.                                        [10]   http://osvdb.org/ (Accessed on 10-05-2011).
                                                                                      [11]   Internet Security Services Online database X-Force, 2008. [Online]
                               REFERENCES                                                    Available: http://www.iss.net/xforce/ (Accessed on 10-05-2011)
                                                                                      [12]   Tripathi, A. Singh, U.K., “Taxonomic Analysis of Classification
                                                                                             Schemes in Vulnerability Databases” (Communicated)
  [1]   D. Turner, M. Fossi, E. Johnson, T. Mack, J. Blackbird, S. Entwisle,          [13]   The MITRE Corporation (2008) Common vulnerabilities and
        M. K. Low, D. McKinney, and C Wueest, "Symantec global internet                      exposures. [Online] Available:http://cve.mitre.org/ (Accessed on 10-
        security threat report: Trends for july to december 2007," Symantec,                 05-2011)
        Tech. Rep., 2008.
                                                                                      [14]   R.A. Martin, Common Weakness Enumeration (CWE v1.8). 2010,
  [2]   Secunia.       Secunia       yearly       report    2010,     http://                National Cyber Security Division of the U.S. Department of
        secunia.com/gfx/pdf/Secunia_Yearly_Report_2010.pdf, 2011.                            Homeland Security.
  [3]   Zhongqiang Chen, Yuan Zhang, Zhongrong Chen, A Categorization
        Framework for Commom Vulnerabilities and Exposures. In the                    [15]   Security Focus Vulnerability Database. [Online] Available:
        computer Journal Advance Access published online on May 7, 2009,                     http://www.securityfocus.com, (Accessed on 10-05-2011)
        http://comjnl.oxfordjournals.org,doilO.1093/comjnl/bxp040                     [16]   T. Aslam, "A Taxonomy of Security Faults in the Unix Operating
  [4]   Alhazmi, O. H., Woo, S-W., Malaiya, Y. K., “Security Vulnerability                   System," M.S. thesis, Dept. of Compo Sci., Purdue Univ., Coast TR
        Categories in Major Software Systems”, Proc. Third IASTED                            95-09, 1995
        International Conference Proceedings Communication, Network, and              [17]   Secunia. [Online] Available: http://secunia.com, (Accessed on 10-05-
        Information Security, 2006, pp. 138-143.                                             2011)
  [5]   Lutz Lowis, Rafael Accorsi, "On a Classification Approach for SOA             [18]   Forum Of Incident Response And Security Teams (FIRST),
        Vulnerabilities," Proc. 33rd Annual IEEE International Computer                      “Common vulnerability scoring system 2.0,” 2007. [Online].
        Software and Applications Conference, vol. 2, 2009, pp.439-444.                      Available: http://www.first.org/cvss
  [6]   Somak Bhattacharya, S.K. Ghosh, "Security Threat Prediction in a              [19]   C. E. Landwehr et al., “A Taxonomy of Computer Program Security
        Local Area Network Using Statistical Model," Proc. IEEE                              Flaws,” ACM Comp. Surveys, vol. 26, no. 3, Sept. 1994, pp. 211–
        International Parallel and Distributed Processing Symposium, 2007,                   254.
        pp.425-432.                                                                   [20]   K. Jiwnani and M. Zelkowitz, “Maintaining Software with a Security
  [7]   Aslam,T., Krsul, I. and Spafford, E.H., “Use ofATaxonomy of                          Perspective,” Proc. Int’l Conf. Software Maintenance, 3–6 Oct. 2002,
        Security Faults”, Proc. 19th National Information Systems Security                   pp. 194–203.
        Conf., Baltimore, USA. , 1996, pp. 551–560.                                   [21]   W. Du and A. P. Mathur, “Categorization of Software Errors that Led
  [8]   Tripathi, A. Singh, U.K., “Towards Standardization of Vulnerability                  to Security Breaches,” Proc. 21st Nat’l Info. Sys. Sec.Conf., 1998.
        Taxonomy”, Proc. 2nd International Conference on Computer                     [22]   S. Kamara et al., “Analysis of Vulnerabilities in Internet Firewalls,”
        Technology and Development, Cairo, Egypt, 2010, pp. 379-384.                         Comp. & Sec., vol. 22, no. 3, 2003, pp. 214–232



                           AUTHORS PROFILE
                                                                                      training division of CMC Ltd., New Delhi in initial years of his career. He has
                                                                                      authored a book on “ Internet and Web technology “ and his various research
Anshu Tripathi holds M.Tech. degree in Computer Science from Banasthali
                                                                                      papers are published in national and international journals of repute. He is
Vidyapith, Banasthali-INDIA. She is currently Pursuing Ph.D. in Computer
                                                                                      reviewer of International Journal of Network Security (IJNS), IJCSIS,
Science from Institute of Computer Science, Vikram University,Ujjain-
                                                                                      reviewer and member of conference committee of European Conference of
INDIA. Her research interest includes proactive network security, security
                                                                                      Knowledge Management (ECKM) since 2007. He is also reviewer of 4th IEEE
measurement, and risk analysis.
                                                                                      International Conference on Computer Science and Information Technology
                                                                                      and 2011 3rd International Conference on Machine Learning and Computing.
                                                                                      His research interest includes Computer Networks, Network Security,
Umesh Kumar Singh received his Ph.D. in Computer Science from Devi
                                                                                      Internet & Web Technology, Client-Server Computing and IT based
Ahilya University, Indore-INDIA. Presently he is Director in Institute of
                                                                                      education.
Computer Science, Vikram University, Ujjain-INDIA. He served as Professor
in Computer Science and Principal in Mahakal Institute of Computer Sciences
(MICS-MIT), Ujjain. He has served as Engineer (E&T) in education and




                                                                                111                                      http://sites.google.com/site/ijcsis/
                                                                                                                         ISSN 1947-5500
                                                               (IJCSIS) International Journal of Computer Science and Information Security,
                                                               Vol. 9, No. 6, June 2011



  Abrupt Change Detection of Fault in Power System
      Using Independent Component Analysis
                 1
                     SATYABRATA DAS, 2SOUMYA RANJAN MOHANTY 3SABYASACHI PATTNAIK
                      1
                          Asstt Prof., Department of CSE, College of Engineering Bhubaneswar, Orissa, India-751024
           2
               Asstt Prof., Department of EE, Motilal Neheru National Institute of Technology, Allahabad, India-211004
                                3
                                    Prof., Department of I&CT, Fakir Mohan University, Balasore, India -756019
                              E-mail: satya.das73@gmail.com,soumya@mnnit.ac.in, spattnaik40@yahoo.co.in,


Abstract— This paper proposes a novel approach for fault                     independent component of current samples. The proposed
detection in a power system based on Independent Component                    method performance have been tested under the presence of
Analysis (ICA). The index for detection of fault is derived from
                                                                              noise, harmonics and with frequency variation and found to
independent components of faulty current samples. The proposed
approach is tested on simulated data obtained from                            be accurate. Independent component analysis (ICA) is selected
MATLAB/Simulink for a typical power system. The proposed                      for feature extraction because of its reliability to extract the
approach is compared with existing approaches available in                    relevant and useful features. Further, the proposed approach is
literature for fault detection in time-series data. The comparison            compared with three existing approaches available in
demonstrates the accuracy and consistency of the proposed
                                                                                 literatures. The first one of these is a detector based on
approach in considered changing conditions of a typical power
system. By virtue of its accuracy and consistency, the proposed               comparison of sample value with one cycle. The second one
approach can be used in real time applications also.                          being a differential approach based on phasor estimation [3]
                                                                              while third is a moving-sum based detector where sum over
Index Terms— Digital relaying, distance relay, fault detection,               one cycle of faulty current samples is chosen as index for
independent component analysis.
                                                                              detection [11].
                             I. INTRODUCTION                                     Rest of the paper is arranged as follows; section II gives a
                                                                              brief description of three approaches used for comparative
   Every power system is provided with a protective relay
which ensures better performance while maintaining minimum                    assessment of proposed approach followed by section III,
disturbance and damage. In last few years, digital relays have                which gives a brief description of independent component
replaced their solid-state-device counterparts due to their fast,             analysis technique. Next, section IV presents the discussion on
accurate and reliable operation. The fault diagnosis unit of                  the proposed approach based on independent component
digital relays contains a fault detector (FD) unit in addition to             analysis while section V present the testing of the proposed
fault classification and fault localization unit[1]-[2].
                                                                              approach. Finally, conclusions are given in section VI.
   In recent years, a number of methods is available in the
literature for detection of power system faults. Fault can be
                                                                              II. FAULT DETECTION TECHNIQUES USED FOR POWER SYSTEM
detected based on the comparison of difference between the                                  BASED ON TIME-SERIES DATA
value in current samples for two consecutive cycles being
                                                                                This section gives brief description of fault detection
greater than threshold value and phasor comparison scheme
                                                                              techniques used for power system based on time-series data.
[3]–[4]. However it has the limitation due to the difficulties in
                                                                              These three techniques are used to carry out the comparative
modeling the fault resistance. A Kalman filter–based approach
                                                                              assessment of proposed approach in changing conditions of
[5]-[7] has been proposed in order to detect power system
                                                                              the system. All these approaches are based on deterministic
faults. Wavelet based approach [8] is used to detect the abrupt
                                                                              modeling of faulty current signal obtained from a typical
change in the signal. The synchronized segmentation is
                                                                              power system.
applied for disturbance recognition [9]. Then, application of
adaptive whitening filter and wavelet transform has been used                   A. Sample Comparison (SC)
to detect the abrupt change in the signal [10]. However, these                  The first approach for fault detection is the conventional.
methods are sensitive to frequency deviation, presence of                     Here, decision is taken out by computing the difference of
noise and harmonics.                                                          current sample of signal with corresponding sample of the one
   In this paper, algorithm for an abrupt change detection is                 cycle earlier. Under normal conditions, the computed
proposed where the index for detection is derived from                        difference comes out to be zero. When there is a fault in the
                                                                              system, the current signal gets distorted and consequently
                                                                              computed difference become significant. If the computed

                                                                        112                               http://sites.google.com/site/ijcsis/
                                                                                                          ISSN 1947-5500
                                                                          (IJCSIS) International Journal of Computer Science and Information Security,
                                                                          Vol. 9, No. 6, June 2011


difference remains greater than a threshold value for three                                implementation, once a new sample is obtained, the oldest
consecutive samples, a fault is reported by the FD unit. Let                               sample is discarded and the sum is recalculated for the new
the discrete current signal be                                                             window. Thus, only one addition and one subtraction is
i( k )  I m sin( k    )                                                   (1)          required at each step of computation. For large variation in
                                                                                           system conditions such as frequency variations, the window
Where, I m is the peak of the signal,  is the discrete angular
                                                                                           size needs to be made adaptive to generate the zero sums in
frequency and  is the phase angle. Then the index is derived                              normal condition. However, such variations are not common
as follows:                                                                                in large power system. Let the discrete current signal be
 ichange ( k )  i( k )  i( k  N )                      (2)                                 i( k )  I m sin( k    )                                            (10)
Indexsc ( k ) | ichange ( k )|                                                (3)            Then, the derivation of index is as follows:
                                                                                                              k
Where, k is the time-instant and N is the window size of                                                                                                             (11)
                                                                                           isum (k )                  il
one period. If                                                                                           l  k  N 1

 Indexsc (k)  ( Indexsc )threshold ,                (4)                                   Where, N is the window size for one cycle.
                                                                                           Indexocms ( k )  | isum |                                                (12)
a fault is reported by FD unit
                                                                                           If Indexocms ( k )  ( Indexocms )threshold                               (13)
  B. Phasor Comparison (PC)
  This approach for fault detection is based on estimation of                              A fault is reported by FD unit. Also,
the phasor [3]. It is a relatively fast algorithm based on the                             isum ( k )  isum ( k  1 )  i( k )  i( k  N )                         (14)
derivative of the current signal. If the discrete current signal is,                       The above eqn. (14) shows that, for on-line computation, only
i( k )  I m sin( k    )                                                   (5)          one addition and one subtraction is required at each step.

         Where, I m is the peak of the signal,                 is the discrete                            III. INDEPENDENT COMPONENT ANALYSIS
angular frequency and  is the phase angle. Then at any                                       Since ICA is based on the statistical properties of signals, it
instant, k, the peak-value of the signal can be estimated as,                              works accurately in non-deterministic modeling of the signals
                                2             2
              2     i "(k ))   i '(k )                                                 [12]. For ICA to be applied, following assumptions for the
 Iˆ (k ) 
     m                  2 
                        
                                                                           (6)
                                                                                           mixing and demixing models needs to be satisfied:
                                                                      '                       1. The source signals s (ti ) is statistically independent.
                             is the peak estimate of the signal, and i (k ) and
                  ^
Where             I m (k )
                                                                                              2. At most one of the source signals is Gaussian distributed.
i '' ( k ) are the first and second derivatives of discrete current                           3. The number of observations M is greater or equal to the
signal respectively. The peak estimate is the magnitude of                                 number of sources N (MN).
fundamental phasor at k-th estimate. The magnitude of the                                  In addition to blind separation of sources, ICA is also used for
current phasor obtained at k-th instant is compared with that at                           representing data as linear combination of latent variables.
(k-3)-th instant. If the difference is more than the threshold                             There are different approaches for estimating the ICA model
value for three successive samples, a fault is reported by FD.                             which are based on the statistical properties of signals. Some
The derivation of index is as follows:                                                     of the methods used for ICA estimation are:
 I           ˆ        ˆ
      (k )  I (k )  I (k  3)                             (7)                               1. by maximization of nongaussianity
 change                  m          m
                                                                                              2. by minimization of mutual information
Index pc ( k )  | I change ( k )| ,                                          (8)             3. by maximum likelihood estimation,
If                                                                                            4. by tensorial methods
Index pc ( k )  ( Index pc )threshold                                        (9)          Blind source separation algorithm estimates the source signals
                                                                                           from observed mixtures. The word ‘blind’ emphasizes that the
then FD detects the fault. As the method is derivative based, it
                                                                                           source signals and the way the sources are mixed, i.e. the
is found to be sensitive to noise and signal distortions.
                                                                                           mixing model parameters, are unknown or known very
  C. One-Cycle-Moving-Sum (OCMS):                                                          imprecisely. Independent component analysis is a blind source
   This approach involves the computation of one cycle sum                                 separation (BSS) algorithm, which transforms the observed
of current samples obtained from the power system [11]. This                               signals into mutually statistically independent signals. The
approach is based on the symmetrical nature of the current                                 ICA algorithm has many technical applications including
waveforms in power system. In absence of fault, the computed                               signal processing, brain imaging, telecommunications and
sum comes out to be zero. However, on occurrence of fault in                               audio signal separation [12] – [14].
the power system, the corresponding sum will be non-zero or
equivalently greater than a chosen threshold. For on-line

                                                                                     113                                      http://sites.google.com/site/ijcsis/
                                                                                                                              ISSN 1947-5500
                                                               (IJCSIS) International Journal of Computer Science and Information Security,
                                                               Vol. 9, No. 6, June 2011


 A. ICA estimation by maximization of nongaussianity:                          components have non-Gaussian distributions. Then, after
  A measure of nongaussianity is negentropy J(y) which is the                  estimating the matrix A, we can compute its inverse, say W,
normalized differential entropy. By maximizing the                             and obtain the independent component simply by:
negentropy, the mutual information of the sources is                            s  Wx                                                   (21)
minimized. Also, mutual information is a measure of the                           Fast ICA is an efficient algorithm based on fixed-point
independence of random variables. Negentropy is always non-                    iteration used for estimation of ICs in time series data [15].
negative and zero for Gaussian variables. [12]                                 This approach for ICs estimation is 10-100 times faster than
         J ( y )  H ( y gauss )  H ( y )              (15)                   the other methods that are used to reduce data dimension.
The differential entropy H of a random vector y with density
                                                                                            IV. PROPOSED FAULT DETECTION METHOD
py(η) is defined as
              H ( y )    p y   log p y   d              (16)            This section presents the algorithm of the proposed method
                                                                               for detection of abrupt changes due to occurrence of fault in
In equation (15) and (16), the estimation of negentropy
                                                                               the power system. An abrupt change detector based on
requires the estimation of probability functions of source
                                                                               independent components of current samples is proposed in this
signals which are unknown. Instead, the following
                                                                               section. The index for detection is derived from independent
approximation of negentropy is used:
                                                           2
                                                                               components of current sample obtained from data acquisition
                                                  
 J  yi   J E  wiT x    E G  wiT x   E G  y gauss  
                                                             
                                                                (17)           system. The proposed algorithm has been tested on simulation
                                                                               data and is explained below:
Here, E denotes the statistical expectation and G is chosen as
                                                                                  (i) Data has been obtained from MATLAB/Simulink model
non-quadratic. Assuming that we observe n linear mixtures x1
                                                                               of the interconnected power system considered in this work.
,..., xn of n independent components :
                                                                               Also in this study, the pre-fault signal can be taken as non-
 x j  a1s1  a2 s2  ....  an sn For all j                     (18)
                                                                               faulty signal. The signal is first passed through the detection
We assume that each mixture x j as well as each independent                    block followed by classification block and finally through
component sk is a random variable, instead of a time                           localization block for deciding logic for trip signal system.
dependent signal. Without loss of generality, we can assume                    This constitutes fault diagnosis system.
that both the mixture variables and the independent                               (ii) The simulated signal is passed through first block where
components have zero mean. If this is not true, then the                       the removal of mean and de-correlation (for removal of second
observable variables xj can always be centered by subtracting                  order dependencies) is done. This constitutes the first level of
the sample mean, which makes the model zero mean. It is                        pre-processing. The output of this block is fed to the next
convenient to use vector-matrix notation instead of the sums                   block.
like in the previous equation. Let us denote by x, the random                     (iii) In this block, whitening of data followed by dimension
vector whose elements are the mixtures x1 ,..., xn and likewise                reduction is performed for reducing redundancy in data.
by s the random vector with elements s1 ,..., sn . Let us denote               Output of this block is fed to third block.
by A the matrix with elements aij. All vectors are taken as                       (iv) Now, principal components (PC) of data are determined
column vectors; thus xT , or the transpose of x, is a row vector.              and fed to next block.
With this vector-matrix notation, the above mixing model                          (v) Here, independent components of data are calculated
becomes:                                                                       using fixed point iteration of Fast ICA algorithm [12], [15].
 x  As                                                     (19)                  (vi) The ICA block returns demixing or separating matrix,
denoting the column of matrix A by a j the model can also be                   W f along with independent component, s f of real time signal.
written as                                                                     For calculation of these variables matrix, x f is constructed
      n
x   a s   i i
                                                                  (20)         from real time signal samples.
     i 1
                                                                                  (vii) The stored signals or the pre-fault signals are used to
The statistical model in eqn (20) is called independent
component analysis, or ICA model. The ICA model is a                           construct matrix, xn for the derivation of index.
generative model and the independent components are latent                        The index is derived as
variables, meaning that they cannot be directly observed. Also                 Index proposed (k )  (normalised (abs(W f (k ) * xn (k )  s f (k )))2 )      (21)
the mixing matrix is assumed to be unknown. All we observe                        The fault is detected when Index proposed (k ) is greater than a
is the random vector x, and we must estimate both A and s
                                                                               certain        threshold         ( Index proposed )threshold .   The    threshold,
using it. The starting point for ICA is the very simple
assumption that the components si are statistically                            ( Index proposed )threshold is   evaluated         by     decision     block   and
independent. We also assume that the independent                               appropriate actions are taken. This information is then passed

                                                                         114                                          http://sites.google.com/site/ijcsis/
                                                                                                                      ISSN 1947-5500
                                                          (IJCSIS) International Journal of Computer Science and Information Security,
                                                          Vol. 9, No. 6, June 2011


to classification block. The flowchart of the proposed
algorithm is illustrated in Fig. 1.




                                                                                          Fig. 2 A 230kV, 50 Hz Power System.

                                                                                TABLE 1 PARAMETERS OF THE POWER SYSTEM MODEL
                                                                               System voltage               230 kV
                                                                               System frequency             50 Hz
                                                                               Voltage of source 1          1.0    0 degree    pu
                                                                               Voltage of source 2          1.0    10 degree    pu
                                                                               Transmission line length     200 km
                                                                                                            R     =   0.0321    (ohm/km),   L    =
                                                                               Positive sequence            0.57(mH/km), C = 0.021 (µF/km)
                                                                               Zero sequence                R = 0.0321 (ohm/km), L = 1.711
                                                                                                            (mH/km), C = 0.021 (µF/km)
                                                                               Series compensated           70 % C =176.34 µF
                                                                               MOV Vref                     40 kV 5 MJ
                                                                               Current transformer (CT)     230kV, 50 Hz, 2000:1(turns ratio)



                                                                             Faults of various types are simulated at different locations
                                                                           and the performance of the algorithms is assessed. To
                                                                           demonstrate the potential of the approach only few cases of
            Fig. 1 Flowchart of the proposed algorithm.                    fault occurrence towards the farther end of the line are
                                                                           demonstrated here. Nevertheless, the proposed method
    V. COMPARATIVE ASSESSMENT AND TESTING OF THE                           responds similarly to other types of power system faults too.
               PROPOSED ALGORITHM                                          Single line-to-ground faults (AG-type) at 80% of the line have
  A three phase transmission line (200km, 230 kV, 50 Hz)                   been created at different inception angles and the
connecting two systems with MOV and series capacitor kept                  corresponding phasor current, tapped at Bus 2 is processed
at the middle of line as shown in Fig. 2 has been considered               through the different algorithms and the detection indices are
for comparative assessment of the performance between                      computed and normalized for comparison. As the FD is
existing and proposed algorithms. We have demonstrated the                 expected to be fast enough to detect the inception of fault
comparative assessment of the performance of the various                   within few milliseconds, first few sampling periods are
algorithms by considering the fault at the same instant at                 important to adjudge the performance of the algorithm.
different conditions for the sake of better clarity of the result.
                                                                            A. Abrupt change detection without noise
The typical power system model in MATLAB/Simulink is
used in obtaining simulation data. At the receiving end, the                 The interconnected power system as shown in the Fig. 2 is
combination of linear and non-linear load is used. Depending               simulated in MATLAB/Simulink. A L-G fault has been
on the switching of non-linear load, harmonics are obtained in             created at 0.065 s with the system frequency as 50 Hz.
the current signal. The testing data is obtained through                   Comparative assessment of proposed algorithm is carried out
simulation of considered power system under different system               with existing algorithms and shown in Fig. 3. Here, all the
  changing conditions. A sampling rate of 1 kHz and a full                 indices approximately indicates the fault situation with
cycle window of N= 20 (50 Hz nominal frequency) has been                   minimum (1-2) sample delay after the inception of the fault. In
chosen for testing.                                                        the post-fault region, almost all the algorithm exhibits
                                                                           consistent results.




                                                                     115                                  http://sites.google.com/site/ijcsis/
                                                                                                          ISSN 1947-5500
                                                         (IJCSIS) International Journal of Computer Science and Information Security,
                                                         Vol. 9, No. 6, June 2011


                                                                         methods fail to sense the change immediately. In presence of
                                                                         DC offset, algorithm based on accumulated sum of samples in
                                                                         fixed data window, i.e. moving sum deviates from nominal
                                                                         value of zero or very small value of the threshold. As a matter
                                                                         of fact, before the occurrence of fault, moving sum algorithm
                                                                         is not suitable approach for the description of the fault
                                                                         occurrence. Similarly, the other existing algorithm also
                                                                         performs poorly in the pre-fault period. On the other hand, the
                                                                         proposed approach does not deviate from nominal threshold in
                                                                         the pre-fault region. Even if the algorithm based on PC
                                                                         approach shows comparable performance but still it is
                                                                         inconsistent in post fault region since its value becomes equal
               Fig. 3 Fault detection without noise.                     to threshold value. This is misinterpreted as fault inception.

  B. Abrupt change detection with noise
  A phase-to-ground fault is created at 0.065 s with the system
operating at nominal frequency of 50 Hz and a noise signal of
20 dB SNR added to original one for performance assessment.
The normalized indices are shown in Fig. 4. It is observed that
values of indices determined from three algorithms as
discussed in section II are significant even before the
occurrence of fault inception. However, index of proposed
method demonstrates the instant of fault inception correctly.
Indexsc as crosses the threshold before the occurrence of fault
i.e. in the steady state situation may be mis-interpreted as the
fault even if there is no fault. Indexocms also exhibits a non-
zero variation although it sums the total noisy signal over a
                                                                                    Fig. 5 Fault detection in presence of dc-offset.
fixed data window i.e. 20 samples per cycle. Thus, the indices
exhibits variation in pre-fault region and are not consistent in             D. Abrupt change detection with harmonics
the post-fault region as well. On contrary, the proposed                   With the incorporation of non-linear load at Bus 2, the
method shows almost zero index value in pre-fault region and             harmonics are generated in addition to the fundamental
consistent index in the post-fault region also.                          components in the signal. A phase-to-ground fault has been
                                                                         created at 0.065 s in the system. The current signal is
                                                                         processed through the different methods as described in earlier
                                                                         sections. The normalized indices have been plotted in Fig. 6.




             Fig. 4. Fault detection with 20 dB noise.
    C. Abrupt change detection with DC offset
                                                                                    Fig. 6 Fault detection in presence of harmonics.
  A phase-to-ground fault has been created at 0.065 s in the             A favorable detection of fault by proposed algorithm over existing
system. The current signal is processed through the different
                                                                         ones is observed.
methods as described in Section II and IV. The normalized
indices are given in Fig. 5. It is observed that the existing

                                                                   116                                 http://sites.google.com/site/ijcsis/
                                                                                                       ISSN 1947-5500
                                                                (IJCSIS) International Journal of Computer Science and Information Security,
                                                                Vol. 9, No. 6, June 2011


      E. Abrupt change detection with change in frequency                         [2]    P. K. Dash, A. K. Pradhan, and G. Panda, “A novel fuzzy neural
                                                                                         network based distance relaying scheme,” IEEE Trans. on Power
                                                                                         Delivery, vol. 15, pp. 902–907, 2000.
   Frequency variations are common in power systems. Thus,                        [3]    T. S. Sidhu, D. S. Ghotra, and M. S. Sachdev, “An adaptive distance
the frequency estimation is indispensable for demonstrating                              relay and its performance comparison with a fixed data window distance
                                                                                         relay,” IEEE Trans. on Power Delivery, vol. 17, pp. 691–697, 2002.
the performance of the existing algorithms such as sample                         [4]     M. S. Sachdev and M. Nagpal, “A recursive least square algorithm for
comparison, phasor approach, moving-sum approach etc. In                                   power system relaying and measurement applications,” IEEE Trans. on
the study, firstly the frequency is estimated by variable leaky-                           Power Delivery, vol. 6, pp. 1008–1015, 1991.
                                                                                  [5]      F. N. Chowdhury, J. P. Christensen, and J. L. Aravena, “Power system
least mean square (VL-LMS) that tracks the original
                                                                                           fault detection and state estimation using Kalman filter with hypothesis
frequency change faster than complex LMS algorithm [16]-                                   testing,” IEEE Trans. on Power Delivery, vol. 6, pp. 1025–1030, 1991.
[20]. After frequency estimation, assessment of the existing                      [6]    A. Girgis and D. G. Hart, “Implementation of Kalman and adaptive
algorithm is demonstrated. A phase-to-ground fault has been                              Kalman filtering algorithms for digital distance protection on a vector
                                                                                         signal processor,” IEEE Trans. on Power Delivery, vol. 4, pp. 141–156,
created at 0.065 s in the system operating at nominal                                    1989.
frequency of 52 Hz. The normalized indices have been plotted                      [7]    A. Girgis, “A new Kalman filtering based digital distance relaying,”
                                                                                         IEEE Trans. on Power Apparatus and Systems, vol. 101, pp. 3471–
in Fig.7. As indicated, the proposed approach is still consistent                        3480, 1982.
against the indices of existing algorithms in tracking the point                  [8]    Abhishek Ukil, Rastko Živanović, ”Abrupt change detection in power
of change.                                                                                 system fault analysis using wavelet transform”, International
                                                                                           Conference on Power Systems Transients (IPST’05), Montreal, Canada.
                                                                                  [9]     Abhisek Ukil, and Rastko Zivanovic, “Application of Abrupt Change
                                                                                           Detection in Power Systems Disturbance Analysis and Relay
                                                                                           Performance Monitoring”, IEEE Trans. on Power Delivery, Vol. 22,
                                                                                           no. 1, 2007.
                                                                                  [10]    Abhisek Ukil, and Rastko Zivanovic, “Abrupt change detection in
                                                                                           power system fault analysis using adaptive whitening filter and wavelet
                                                                                           transform”, Electric Power Systems Research, vol. 76, pp. 815–823,
                                                                                           2006.
                                                                                  [11]   A. K. Pradhan, A. Routray, and S. R. Mohanty, , “A Moving Sum
                                                                                         Approach for Fault Detection of Power Systems”, Electric Power
                                                                                         Components and Systems, vol. 34, no. 4, pp. 385 – 399, 2005.
                                                                                  [12]   Aapo Hyvärinen, Juha Karhunen, Erkki Oja, “Independent Component
                                                                                         Analysis”, A Wiley Interscience Publication, John Wiley & Sons, Inc.,
                                                                                         2001.
                                                                                  [13]   Sanna Pöyhönen, Pedro Jover, Heikki Hyötyniemi, “Independent
                                                                                         component analysis of vibrations for fault diagnosis of an induction
                                                                                         motor”, Proceedings of IASTED International Conference Circuits,
                                                                                         Signals, and Systems, , Cancun, Mexico, May 19-21, 2003.
                                                                                  [14]   G. Gele, M. Colas, C. Serviere, “Blind source separation: A tool for
            Fig.7 Fault detection with change in frequency.                              rotating machine monitoring by vibration analysis”, Journal of Sound
                                                                                         and Vibration, vol. 248, no. 5, pp. 865-885, 2001.
                                                                                  [15]   Hyvärinen, “Fast and robust fixed-point algorithms for independent
                          VI. CONCLUSION                                                 component analysis” IEEE Trans. on Neural Networks, vol. 10, no. 3,
                                                                                         pp. 626-634, 1999.
   Fault detection for relaying application is a challenging task                 [16]   F. Gustafson, Adaptive Filtering and Change Detection, New York:
in the presence of noise, harmonics and frequency change of                              John Wiley, 2000.
                                                                                  [17]   A.K.Pradhan, A.Routray and Abir Basak “Power System Frequency
signal. Traditional methods are based on deterministic                                   Estimation Using Least Mean Square Technique”, IEEE Trans. Power
modeling i.e. sinusoidal behavior of current/ voltage and are                            Delivery, vol. 20, no. 3, pp. 1812-1816, 2005.
                                                                                  [18]   Orlando J. Tobias and Rui Seara “On the LMS Algorithm with Constant
therefore sensitive to noise. In this paper, a novel fault                               and Variable Leakage Factor in a Nonlinear Environment” IEEE
detection algorithm was proposed based on the independent                                Trans.on Signal Processing,vol. 54, no. 9, pp. 3448-3458, 2006.
                                                                                  [19]   Max Kemenetsky and Bernard Widrow “A Variable Leaky LMS
components of current signal. The proposed technique does                                Adaptive Algorithm” IEEE conf. On signals, systems and computers, (1)
not assume sinusoidal behavior of current/ voltage signal. The                           , pp. 125-128 , November, 2004.
                                                                                  [20]   Scott C. Douglas “Performance Comparison of Two Implementations of
performance of the method was assessed through simulation
                                                                                         the Leaky LMS Adaptive Filter” IEEE Trans. On Signal Processing
with different fault data and compared with existing                                     vol. 45, no. 8, pp. 2125-212, August, 1997.
techniques. It has been found that this method provides very
consistent results under all the fault conditions. The method
was compatible with any sampling frequency conventionally                                                    AUTHORS BIBLIOGRAPHY
being used for relaying applications.
                                                                                  Mr. Satyabrata Das received the degree in Computer Sc & engineering from
                             REFERENCES                                           Utkal University, in 1996. He received the M.Tech. degree in CSE from
                                                                                  ITER, Bhubaneswar. He is a research student of Fakir Mohan University,
[1]   G. Phadke and J. S. Thorp, Computer Relaying for Power Systems, New         Balasore in the dept. of I&CT Currently, he is an Asst. Professor at College of
      York: John Wiley, 1988.                                                     Engineering Bhubaneswar, Orissa. His interests are in AI, Soft Computing,
                                                                                  Data Mining, DSP, Neural Network.


                                                                            117                                      http://sites.google.com/site/ijcsis/
                                                                                                                     ISSN 1947-5500
                                                                    (IJCSIS) International Journal of Computer Science and Information Security,
                                                                    Vol. 9, No. 6, June 2011

Soumya Ranjan Mohanty received the Ph.D. degree from Indian Institute of
Technology (IIT), Kharagpur, India. Currently he is an Assistant Professor in
the Department of Electrical Engineering, Motilal Nehru National Institute of
Technology (MNNIT), Allahabad, India. His research area includes digital
signal processing applications in power system relaying and power quality,
pattern recognition and distributed generations.

Dr.Sabyasachi Pattnaik received the M.Tech. degree in CSE from IIT,
Delhi. He received the Ph.D. degree in Computer Sc from Utkal University.
Currently, he is a professor at Fakir Mohan University, Balasore. His research
interests include Data Mining, AI and Soft Computing.




                                                                                 118                           http://sites.google.com/site/ijcsis/
                                                                                                               ISSN 1947-5500
                                                              (IJCSIS) International Journal of Computer Science and Information Security,
                                                                                                                  Vol. 9, No. 6, June 2011

      Modeling and Analyze the Deep Web: Surfacing
                     Hidden Value
                    SUNEET KUMAR                                                             ANUJ KUMAR YADAV
       Associate Proffessor ;Computer Science Dept.                                 Assistant Proffessor ;Computer Science Dept.
            Dehradun Institute of Technology,                                            Dehradun Institute of Technology,
                      Dehradun,India                                                               Dehradun,India
                  suneetcit81@gmail.com                                                          anujbit@gmail.com

                   RAKESH BHARATI                                                              RANI CHOUDHARY
       Assistant Proffessor ;Computer Science Dept.                                      Sr. Lecturer;Computer Science Dept.
            Dehradun Institute of Technology,                                                          BBDIT,
                      Dehradun,India                                                               Ghaziabad,India
               goswami.rakesh@gmail.com                                                      ranichoudhary04@gmail.com


Abstract—Focused web crawlers have recently emerged as an                  percent of Web users use search engines to find needed
alternative to the well-established web search engines. While the          information, but nearly as high a percentage site the inability
well-known focused crawlers retrieve relevant web-pages, there             to find desired information as one of their biggest
are various applications which target whole websites instead of            frustrations.[2] According to a recent survey of search-engine
single web-pages. For example, companies are represented by                satisfaction by market-researcher NPD, search failure rates
websites, not by individual web-pages. To answer queries
targeted at Websites, web directories are an established solution.
                                                                           have increased steadily since 1997.[3] The importance of
In this paper, we introduce a novel focused website crawler to             information gathering on the Web and the central and
employ the paradigm of focused crawling for the search of                  unquestioned role of search engines -- plus the frustrations
relevant websites. The proposed crawler is based on two-level              expressed by users about the adequacy of these engines --
architecture and corresponding crawl strategies with an explicit           make them an obvious focus of investigation. Our key findings
concept of websites. The external crawler views the web as a               include:
graph of linked websites, selects the websites to be examined next              • Public information on the deep Web is currently 400
and invokes internal crawlers. Each internal crawler views the                       to 550 times larger than the commonly defined World
web-pages of a single given website and performs focused (page)
                                                                                     Wide Web.
crawling within that website. Our Experimental evaluation
demonstrates that the proposed focused website crawler clearly                  • The deep Web contains 7,500 terabytes of
outperforms previous methods of focused crawling which were                          information compared to 19 terabytes of information
adapted to retrieve websites instead of single web-pages.                            in the surface Web.
                                                                                • The deep Web contains nearly 550 billion individual
Keywords- Deep Web ; Link references ; Searchable Databases ;                        documents compared to the one billion of the surface
Site page-views.                                                                     Web.
                                                                                • More than 200,000 deep Web sites presently exist.
                       I.    INTRODUCTION                                       • Sixty of the largest deep-Web sites collectively
A. The Deep Web                                                                      contain about 750 terabytes of information --
                                                                                     sufficient by themselves to exceed the size of the
    Internet content is considerably more diverse and the                            surface Web forty times.
volume certainly much larger than commonly understood.
                                                                                • On average, deep Web sites receive 50% greater
First, though sometimes used synonymously, the World Wide
                                                                                     monthly traffic than surface sites and are more highly
Web (HTTP protocol) is but a subset of Internet content. Other
                                                                                     linked to than surface sites; however, the typical
Internet protocols besides the Web include FTP (file transfer
                                                                                     (median) deep Web site is not well known to the
protocol), e-mail, news, Telnet, and Gopher (most prominent
                                                                                     Internet-searching public.
among pre-Web protocols). This paper does not consider
                                                                                • The deep Web is the largest growing category of new
further these non-Web protocols [1]. Second, even within the
                                                                                     information on the Internet.
strict context of the Web, most users are aware only of the
content presented to them via search engines such as Excite,                    • Deep Web sites tend to be narrower, with deeper
Google, AltaVista, or Northern Light, or search directories                          content, than conventional surface sites.
such as Yahoo!, About.com, or LookSmart. Eighty-five                            • Total quality content of the deep Web is 1,000 to
                                                                                     2,000 times greater than that of the surface Web.



                                                                     119                              http://sites.google.com/site/ijcsis/
                                                                                                      ISSN 1947-5500
                                                             (IJCSIS) International Journal of Computer Science and Information Security,
                                                                                                                 Vol. 9, No. 6, June 2011
    •     Deep Web content is highly relevant to every                    few documents and sites. It was a manageable task to post all
          information need, market, and domain.                           documents as static pages. Because all pages were persistent
    •     More than half of the deep Web content resides in               and constantly available, they could be crawled easily by
          topic-specific databases.                                       conventional search engines. In July 1994, the Lycos search
                                                                          engine went public with a catalog of 54,000 documents.[10]
A full ninety-five per cent of the deep Web is publicly                   Since then, the compound growth rate in Web documents has
accessible information -- not subject to fees or subscriptions.           been on the order of more than 200% annually! [11] Sites that
                                                                          were required to manage tens to hundreds of documents could
                                                                          easily do so by posting fixed HTML pages within a static
B. How Search Engines Work                                                directory structure. However, beginning about 1996, three
    Search engines obtain their listings in two ways: Authors             phenomena took place. First, database technology was
may submit their own Web pages, or the search engines                     introduced to the Internet through such vendors as Bluestone's
"crawl" or "spider" documents by following one hypertext link             Sapphire/Web (Bluestone has since been bought by HP) and
to another. The latter returns the bulk of the listings. Crawlers         later Oracle. Second, the Web became commercialized
work by recording every hypertext link in every page they                 initially via directories and search engines, but rapidly evolved
index crawling. Like ripples propagating across a pond,                   to include e-commerce. And, third, Web servers were adapted
search-engine crawlers are able to extend their indices further           to allow the "dynamic" serving of Web pages (for example,
and further from their starting points.The surface Web                    Microsoft's ASP and the Unix PHP technologies). Figure 2
contains an estimated 2.5 billion documents, growing at a rate            represents, in a non-scientific way, the improved results that
of 7.5 million documents per day.[4a] The largest search                  can be obtained by Bright-Planet technology. By first
engines have done an impressive job in extending their reach,             identifying where the proper searchable databases reside, a
though Web growth itself has exceeded the crawling ability of             directed query can then be placed to each of these sources
search engines[5][6] Today, the three largest search engines in           simultaneously to harvest only the results desired -- with
terms of internally reported documents indexed are Google                 pinpoint accuracy.
with 1.35 billion documents (500 million available to most
searches),[7] Fast, with 575 million documents [8] and                      FIGURE 2. HARVESTING THE DEEP AND SURFACE WEB WITH A DIRECTED
                                                                                                     QUERY ENGINE
Northern Light with 327 million documents.[9]

    Moreover, return to the premise of how a search engine
obtains its listings in the first place, whether adjusted for
popularity or not. That is, without a linkage from another Web
document, the page will never be discovered. But the main
failing of search engines is that they depend on the Web's
linkages to identify what is on the Web. Figure 1 is a graphical
representation of the limitations of the typical search engine.
The content identified is only what appears on the surface and
the harvest is fairly indiscriminate. There is tremendous value
that resides deeper than this surface content. The information
is there, but it is hiding beneath the surface of the Web.

FIGURE 1. SEARCH ENGINES: DRAGGING A NET ACROSS THE WEB'S
             SURFACE HIDDEN VALUE ON THE WEB




                                                                          Additional aspects of this representation will be discussed
                                                                          throughout this study. For the moment, however, the key
                                                                          points are that content in the deep Web is massive --
                                                                          approximately 500 times greater than that visible to
                                                                          conventional search engines -- with much higher quality
                                                                          throughout.
                                                                                             III.   STUDY OBJECTIVES
    II.   SEARCHABLE DATABASES: HIDDEN VALUE ON THE
                                                                            To perform the study discussed, we used our technology in
                         WEB
                                                                          an iterative process. Our goal was to:
 How does information appear and get presented on the                          • Quantify the size and importance of the deep Web.
Web? In the earliest days of the Web, there were relatively



                                                                    120                              http://sites.google.com/site/ijcsis/
                                                                                                     ISSN 1947-5500
                                                           (IJCSIS) International Journal of Computer Science and Information Security,
                                                                                                               Vol. 9, No. 6, June 2011
    •   Characterize the deep Web's content, quality, and                   •   Estimating the total number of records or documents
        relevance to information seekers.                                       contained on that site.
    •   Discover automated means for identifying deep Web                   •   Retrieving a random sample of a minimum of ten
        search sites and directing queries to them.                             results from each site and then computing the
    •   Begin the process of educating the Internet-searching                   expressed HTML-included mean document size in
        public about this heretofore hidden and valuable                        bytes. This figure, times the number of total site
        information storehouse.                                                 records, produces the total site size estimate in bytes.
                                                                            •   Indexing and characterizing the search-page form on
                                                                                the site to determine subject coverage.
A. What Has Not Been Analyzed or Included in Results
    This paper does not investigate non-Web sources of                  Estimating total record count per site was often not
Internet content. This study also purposely ignores private             straightforward. A series of tests was applied to each site and
intranet information hidden behind firewalls. Many large                are listed in descending order of importance and confidence in
companies have internal document stores that exceed terabytes           deriving the total document count:
of information. Since access to this information is restricted,
its scale can not be defined nor can it be characterized. Also,             •    E-mail messages were sent to the webmasters or
while on average 44% of the "contents" of a typical Web                         contacts listed for all sites identified, requesting
document reside in HTML and other coded information (for                        verification of total record counts and storage sizes
example, XML or Java script),[12] this study does not                           (uncompressed basis); about 13% of the sites
evaluate specific information within that code. We do,                          provided direct documentation in response to this
however, include those codes in our quantification of total                     request.
content. Finally, the estimates for the size of the deep Web                •   Total record counts as reported by the site itself. This
include neither specialized search engine sources -- which may                  involved inspecting related pages on the site,
be partially "hidden" to the major traditional search engines –                 including help sections, site FAQs, etc.
nor the contents of major search engines themselves. This                   •   Documented site sizes presented at conferences,
latter category is significant. Simply accounting for the three                 estimated by others, etc. This step involved
largest search engines and average Web document sizes                           comprehensive Web searching to identify reference
suggests search-engine contents alone may equal 25 terabytes                    sources.
or more [13] or somewhat larger than the known size of the
                                                                            •   Record counts as provided by the site's own search
surface Web.
                                                                                function. Some site searches provide total record
                                                                                counts for all queries submitted. For others that use
B. A Common Denominator for Size Comparisons                                    the NOT operator and allow its stand-alone use, a
All deep-Web and surface-Web size figures use both total                        query term known not to occur on the site such as
number of documents (or database records in the case of the                     "NOT ddfhrwxxct" was issued. This approach returns
deep Web) and total data storage. Data storage is based on                      an absolute total record count. Failing these two
"HTML included" Web-document size estimates.[11] This                           options, a broad query was issued that would capture
basis includes all HTML and related code information plus                       the general site content; this number was then
standard text content, exclusive of embedded images and                         corrected for an empirically determined "coverage
standard HTTP "header" information. Use of this standard                        factor," generally in the 1.2 to 1.4 range [14].
convention allows apples-to-apples size comparisons between                 •   A site that failed all of these tests could not be
the surface and deep Web. The HTML-included convention                          measured and was dropped from the results listing.
was chosen because:
                                                                                V.   ANALYSIS OF STANDARD DEEP WEB SITES
    •   Most standard search engines that report document
        sizes do so on this same basis.                                    Analysis and characterization of the entire deep Web
    •   When saving documents or Web pages directly from                involved a number of discrete tasks:
        a browser, the file size byte count uses this
        convention.                                                         •   Estimation of total number of deep Web sites.
                                                                            •   Deep web Size analysis.
All document sizes used in the comparisons use actual byte                  •   Content and coverage analysis.
counts (1024 bytes per kilobyte)                                            •   Site page views and link references.
                                                                            •   Growth analysis.
                                                                            •   Quality analysis.
        IV.   ANALYSIS OF LARGEST DEEP WEB SITES
                                                                        A. Estimation of Total Number of Sites
    Site characterization required three steps:
                                                                           The basic technique for estimating total deep Web sites
                                                                        uses "overlap" analysis, the accepted technique chosen for two



                                                                  121                              http://sites.google.com/site/ijcsis/
                                                                                                   ISSN 1947-5500
                                                               (IJCSIS) International Journal of Computer Science and Information Security,
                                                                                                                   Vol. 9, No. 6, June 2011
of the more prominent surface Web size analyses.[5b][15] We                multiplier applied to the entire population estimate. We
used overlap analysis based on search engine coverage and the              randomized our listing of 17,000 search site candidates. We
deep Web compilation sites noted above. The technique is                   then proceeded to work through this list until 100 sites were
illustrated in the diagram below:                                          fully characterized. We followed a less-intensive process to
                                                                           the large sites analysis for determining total record or
     FIGURE3. SCHEMATIC REPRESENTATION OF "OVERLAP" ANALYSIS
                                                                           document count for the site. Exactly 700 sites were inspected
                                                                           in their randomized order to obtain the 100 fully characterized
                                                                           sites. All sites inspected received characterization as to site
                                                                           type and coverage; this information was used in other parts of
                                                                           the analysis.
                                                                           C. Content Coverage and Type Analysis
                                                                               Content coverage was analyzed across all 17,000 search
                                                                           sites in the qualified deep Web pool (results shown in Table
                                                                           1); the type of deep Web site was determined from the 700
                                                                           hand-characterized sites. Broad content coverage for the entire
                                                                           pool was determined by issuing queries for twenty top-level
                                                                           domains against the entire pool. Because of topic overlaps,
                                                                           total occurrences exceeded the number of sites in the pool; this
                                                                           total was used to adjust all categories back to a 100% basis.

                                                                                   TABLE 1. DISTRIBUTION OF DEEP SITES BY SUBJECT AREA
Overlap analysis involves pair-wise comparisons of the
                                                                                                 Deep web coverage
number of listings individually within two sources, na and nb,
and the degree of shared listings or overlap, n0, between them.                               Agriculture                       2.7%
Assuming random listings for both na and nb, the total size of
                                                                                                  Arts                          6.6%
the population, N, can be estimated. The estimate of the
fraction of the total population covered by na is no/nb; when                                   Business                        5.9%
applied to the total size of na an estimate for the total
                                                                                            Computing Web                       6.9%
population size can be derived by dividing this fraction into
the total size of na. These pair-wise estimates are repeated for                               Education                        4.3%
all of the individual sources used in the analysis.
                                                                                              Employment                        4.1%
To illustrate this technique, assume, for example, we know our
total population is 100. Then if two sources, A and B, each                                   Engineering                       3.1%
contain 50 items, we could predict on average that 25 of those
                                                                                              Government                        3.9%
items would be shared by the two sources and 25 items would
not be listed by either. According to the formula above, this                                    Health                         5.5%
can be represented as: 100 = 50 / (25/50) There are two keys                                  Humanities                       13.5%
to overlap analysis. First, it is important to have a relatively
accurate estimate for total listing size for at least one of the                              Law/polices                       3.9%
two sources in the pair-wise comparison. Second, both sources                                   Lifestyle                       4.0%
should obtain their listings randomly and independently from
one another. This second premise is in fact violated for our                                  News/Media                       12.2%
deep Web source analysis. Compilation sites are purposeful in                              People, companies                    4.9%
collecting their listings, so their sampling is directed. And, for
search engine listings, searchable databases are more                                      Recreation, Sports                   3.5%
frequently linked to because of their information value which                                  References                       4.5%
increases their relative prevalence within the engine
listings.[4b] Thus, the overlap analysis represents a lower                                  Science, Math                      4.0%
bound on the size of the deep Web since both of these factors                                    Travel                         3.4%
will tend to increase the degree of overlap, n0, reported
between the pair wise sources.                                                                 Shopping                         3.2%


                                                                           Hand characterization by search-database type resulted in
B. Deep Web Size Analysis                                                  assigning each site to one of twelve arbitrary categories that
   In order to analyze the total size of the deep Web, we need             captured the diversity of database types. These twelve
an average site size in documents and data storage to use as a             categories are:




                                                                     122                                  http://sites.google.com/site/ijcsis/
                                                                                                          ISSN 1947-5500
                                                           (IJCSIS) International Journal of Computer Science and Information Security,
                                                                                                               Vol. 9, No. 6, June 2011
    •   Topic Databases -- subject-specific aggregations of             The queries were specifically designed to limit total results
        information, such as SEC corporate filings, medical             returned from any of the six sources to a maximum of 200 to
        databases, patent records, etc.                                 ensure complete retrieval from each source.[21] The specific
    •   Internal site -- searchable databases for the internal          technology configuration settings are documented in the
        pages of large sites that are dynamically created, such         endnotes.[22] The "quality" determination was based on an
        as the knowledge base on the Microsoft site.                    average of our technology's VSM and mEBIR computational
    •   Publications -- searchable databases for current and            linguistic scoring methods. [23] [24] the "quality" threshold
        archived articles.                                              was set at our score of 82, empirically determined as roughly
    •   Shopping/Auction.                                               accurate from millions of previous scores of surface Web
    •   Classifieds.                                                    documents.
    •   Portals -- broader sites that included more than one of
        these other categories in searchable databases.                                      VI.    CONCLUSION
    •   Library -- searchable internal holdings, mostly for
        university libraries.                                              This study is the first known quantification and
                                                                        characterization of the deep Web. Very little has been written
    •   Yellow and White Pages -- people and business
                                                                        or known of the deep Web. Estimates of size and importance
        finders.
                                                                        have been anecdotal at best and certainly underestimate scale.
    •   Calculators -- while not strictly databases, many do
                                                                        For example, Intelliseek's "invisible Web" says that, "In our
        include an internal data component for calculating
                                                                        best estimates today, the valuable content housed within these
        results. Mortgage calculators, dictionary look-ups,
                                                                        databases and searchable sources is far bigger than the 800
        and translators between languages are examples.
                                                                        million plus pages of the 'Visible Web.'" They also estimate
    •    Jobs -- job and resume postings.                               total deep Web sources at about 50,000 or so.[25] A mid-1999
    •    Message or Chat .                                              survey by About. Com’s Web search guide concluded the size
    •   General Search -- searchable databases most often               of the deep Web was "big and getting bigger."[26] A paper at
        relevant to Internet search topics and information.             a recent library science meeting suggested that only "a
D. Site Page-views and Link References                                  relatively small fraction of the Web is accessible through
                                                                        search engines."[27] The deep Web is about 500 times larger
    Netscape's "What's Related" browser option, a service
                                                                        than the surface Web, with, on average, about three times
from Alexa, provides site popularity rankings and link
                                                                        higher quality based on our document scoring methods on a
reference counts for a given URL.[17] About 71% of deep
                                                                        per-document basis. On an absolute basis, total deep Web
Web sites have such rankings. The universal power function (a
                                                                        quality exceeds that of the surface Web by thousands of times.
logarithmic growth rate or logarithmic distribution) allows
                                                                        Total number of deep Web sites likely exceeds 200,000 today
page-views per month to be extrapolated from the Alexa
                                                                        and is growing rapidly.[28] Content on the deep Web has
popularity rankings. [18] The "What's Related" report also
                                                                        meaning and importance for every information seeker and
shows external link counts to the given URL. A random
                                                                        market. More than 95% of deep Web information is publicly
sampling for each of 100 deep and surface Web sites for
                                                                        available without restriction. The deep Web also appears to be
which complete "What's Related" reports could be obtained
                                                                        the fastest growing information component of the Web.
were used for the comparisons.
                                                                                                   REFERENCES
E. Growth Analysis                                                      [1]. A couple of good starting references on various Internet
    The best method for measuring growth is with time-series            protocols can be found at http://wdvl.com/Internet/Protocols/
analysis. However, since the discovery of the deep Web is so            and
new, a different gauge was necessary. Who is [19] searches              http://www.webopedia.com/Internet_and_Online_Services/Int
associated with domain-registration services [16] return                ernet/Internet_Protocols/.
records listing domain owner, as well as the date the domain            [2]. Tenth edition of GVU's (graphics, visualization and
was first obtained (and other information). Using a random              usability) WWW User Survey, May 14, 1999. [formerly
sample of 100 deep Web sites [17b] and another sample of                http://www.gvu.gatech.edu/user_surveys/survey-1998-
100 surface Web sites [20] we issued the domain names to a              10/tenthreport.html.]
Who is search and retrieved the date the site was first                 [3]. 3a, 3b. "4th Q NPD Search and Portal Site Study," as
established. These results were then combined and plotted for           reported       by    Search     Engine   Watch     [formerly
the deep vs. surface Web samples.                                       http://searchenginewatch.com/reports/npd.html]. NPD's Web
                                                                        site is at http://www.npd.com/.
F. Quality Analysis                                                     [4]. 4a, 4b "Sizing the Internet, Cyveillance [formerly
  Quality comparisons between the deep and surface Web                  http://www.cyveillance.com/web/us/downloads/Sizing_the_Int
content were based on five diverse: The five subject areas              ernet.pdf].
were agriculture, medicine, finance/business, science, and law.




                                                                  123                              http://sites.google.com/site/ijcsis/
                                                                                                   ISSN 1947-5500
                                                            (IJCSIS) International Journal of Computer Science and Information Security,
                                                                                                                Vol. 9, No. 6, June 2011
[5]. 5a, 5b. S. Lawrence and C.L. Giles, "Searching the World            [19]. See, for example among many, Better Who is at
Wide Web," Science 80:98-100, April 3, 1998.                             http://betterwhois.com.
[6]. S. Lawrence and C.L. Giles, "Accessibility of Information           [20]. The surface Web domain sample was obtained by first
on the Web," Nature 400:107-109, July 8, 1999.                           issuing a meaningless query to Northern Light, 'the AND NOT
[7]. See http://www.google.com.                                          ddsalsrasve' and obtaining 1,000 URLs. This 1,000 was
[8]. See http://www.alltheweb.com and quoted numbers on                  randomized to remove (partially) ranking prejudice in the
entry page.                                                              order Northern Light lists results.
[9]. Northern Light is one of the engines that allows a "NOT             [21]. An example specific query for the "agriculture" subject
meaningless" query to be issued to get an actual document                areas is "agriculture* AND (swine OR pig) AND 'artificial
count from its data stores. See http://www.northernlight.com             insemination' AND genetics."
NL searches used in this article exclude its "Special                    [22]. The Bright-Planet technology configuration settings
Collections" listing.                                                    were: max. Web page size, 1 MB; min. page size, 1 KB; no
[10].                                                      See           date range filters; no site filters; 10 threads; 3 retries allowed;
http://www.wiley.com/compbooks/sonnenreich/history.html.                 60 sec. Web page timeout; 180 minute max. Download time;
[11]. 11a, 11b. This analysis assumes there were 1 million               200 pages per engine.
documents on the Web as of mid-1994.                                     [23]. The vector space model, or VSM, is a statistical model
[12]. Empirical Bright-Planet results from processing millions           that represents documents and queries as term sets, and
of documents provide an actual mean value of 43.5% for                   computes the similarities between them. Scoring is a simple
HTML and related content. Using a different metric, NEC                  sum-of-products computation, based on linear algebra. See
researchers found HTML and related content with white space              further: Salton, Gerard, Automatic Information Organization
removed to account for 61% of total page content (see 7). Both           and Retrieval, McGraw-Hill, New York, N.Y., 1968; and,
measures ignore images and so-called HTML header content.                Salton, Gerard, Automatic Text Processing, Addison-Wesley,
                                                                         Reading, MA, 1989.
[13]. Rough estimate based on 700 million total documents                [24]. See, as one example among many, CareData.com, at
indexed by AltaVista, Fast, and Northern Light, at an average            [formerly http://www.citeline.com/pro_info.html].
document size of 18.7 KB (see reference 7) and a 50%                     [25] See the Help and then FAQ pages at [formerly
combined representation by these three sources for all major             http://www.invisibleweb.com].
search engines. Estimates are on an "HTML included" basis.               [26] C. Sherman, "The Invisible Web," [formerly
                                                                         http://websearch.about.com/library/weekly/aa061199.htm]
[14]. For example, the query issued for an agriculture-related           [27] I. Zachery, "Beyond Search Engines," presented at the
database might be "agriculture." Then, by issuing the same               Computers in Libraries 2000 Conference, March 15-17, 2000,
query to Northern Light and comparing it with a                          Washington,                       DC;                     [formerly
comprehensive query that does not mention the term                       http://www.pgcollege.org/library/zac/beyond/index.htm]
"agriculture" [such as "(crops OR livestock OR farm OR corn              [28] The initial July 26, 2000, version of this paper stated an
OR rice OR wheat OR vegetables OR fruit OR cattle OR pigs                estimate of 100,000 potential deep Web search sites.
OR poultry OR sheep OR horses) AND NOT agriculture"] an                  Subsequent customer projects have allowed us to update this
empirical coverage factor is calculated.                                 analysis, again using overlap analysis, to 200,000 sites. This
[15]. K. Bharat and A. Broder, "A Technique for Measuring                site number is updated in this paper, but overall deep Web size
the Relative Size and Overlap of Public Web Search Engines,"             estimates have not. In fact, still more recent work with foreign
paper presented at the Seventh International World Wide Web              language deep Web sites strongly suggests the 200,000
Conference, Brisbane, Australia, April 14-18, 1998. The full             estimate is itself low.
paper                is               available               at
http://www7.scu.edu.au/1937/com1937.htm.
[16].              See,               for               example,
http://www.surveysystem.com/sscalc.htm, for a sample size
calculator.
[17.     17a,     17b.     See      http://cgi.netscape.com/cgi-
bin/rlcgi.cgi?URL=www.mainsite.com./dev-scripts/dpd
[formerly                           http://cgi.netscape.com/cgi-
bin/rlcgi.cgi?URL=www.mainsite.com./dev-scripts/dpd]

[18]. See reference 38. Known page-views for the logarithmic
popularity rankings of selected sites tracked by Alexa are used
to fit a growth function for estimating monthly page-views
based on the Alexa ranking for a given URL.



                                                                   124                               http://sites.google.com/site/ijcsis/
                                                                                                     ISSN 1947-5500
                                                              (IJCSIS) International Journal of Computer Science and Information Security,
                                                                                                                        Vol. 9, No. 6, 2011

  Instigation of Orthogonal Wavelet Transforms using
  Walsh, Cosine, Hartley, Kekre Transforms and their
               use in Image Compression
    Dr. H. B.Kekre                   Dr. Tanuja K. Sarode                      Sudeep D. Thepade                      Ms. Sonal Shroff
     Sr. Professor,                     Asst. Professor                        Associate Professor,                      Lecturer,
  MPSTME, SVKM’s                    Thadomal Shahani Engg.                     MPSTME, SVKM’s                     Thadomal Shahani Engg.
NMIMS (Deemed-to-be                         College,                         NMIMS (Deemed-to-be                          College
University, Vileparle(W),           Bandra (W), Mumbai-50,                   University, Vileparle(W),            Bandra (W), Mumbai-50,
   Mumbai-56, India.                         India.                             Mumbai-56, India.                          India.


Abstract—In this paper a novel orthogonal wavelet transform                molecular dynamics, astrophysics, optics, quantum mechanics
generation method is proposed. To check the advantage of                   etc. This change has also occurred in image processing, blood-
wavelet transforms over the respective orthogonal transform in             pressure, heart-rate and ECG analyses, DNA analysis, protein
image compression, the generated wavelet transforms are applied            analysis, climatology, general signal processing, speech, face
to the color images of size 256x256x3 on each of the color planes
R, G, and B separately, and thus the transformed R, G, and B
                                                                           recognition, computer graphics and multifractal analysis.
planes are obtained. Form each of these transformed color                  Wavelet transforms are also starting to be used for
planes, the 70% to 95% of the data (in form of coefficients having         communication applications. One use of wavelet
lower energy values) is removed and image is reconstructed. The            approximation is in data compression. Like other transforms,
orthogonal transforms Discrete Cosine Transform (DCT), Walsh               wavelet transforms can be used to transform data then, encode
Transform, Hartley Transform and Kekre Transform are used                  the transformed data, resulting in effective compression [8].
for the generation of DCT Wavelets, Walsh Wavelets, Hartley                Wavelet compression can be either lossless or lossy. The
Wavelets, and Kekre Wavelets respectively. From the results it is          wavelet compression methods are adequate for representing
observed that the respective Wavelet transform outperforms the             high-frequency components in two-dimensional images.
original orthogonal transform.
                                                                               So far wavelets of only Haar transform have been studied.
                                                                           The paper presents the wavelet generation of transforms alias,
                       I.    INTRODUCTION                                  Walsh transform, DCT, Hartley transform and Kekre
                                                                           transform. Also the use of these transform wavelets is
The development of wavelets can be linked to several separate              proposed and strudied for image compression. The
trains of thought, starting with Haar's work in the early 20th             experimental results have shown better data compression can
century [16,17]. Wavelets are mathematical tools that can be               be achieved in transform wavelets than using image
used to extract information from many different kinds of data,             transforms themselves.
including images [21,22,24]. Sets of wavelets are generally
needed to analyze data fully. A set of "complementary"                                      II.   EXSISTING TRANSFORMS
wavelets will reconstruct data without gaps or overlap so that
the deconstruction process is mathematically reversible and is             This section discusses some of the existing transforms, Walsh,
with minimal loss. Generally, wavelets are purposefully                    DCT, Hartley and Kekre.
crafted to have specific properties that make them useful for              A. DCT
image processing. Wavelets can be combined, using a "shift,
                                                                               A discrete cosine transform (DCT) expresses a sequence of
multiply and sum" technique called convolution, with portions
                                                                           finitely many data points in terms of a sum of cosine functions
of an unknown signal(data) to extract information from the                 oscillating at different frequencies. In particular, a DCT is a
unknown signal. Wavelet transforms are now being adopted                   Fourier-related transform similar to the discrete Fourier
for a vast number of applications, often replacing the                     transform (DFT), but using only real numbers. DCTs are
conventional Fourier transform. They have advantages over                  equivalent to DFTs of roughly twice the length, operating on
traditional fourier methods in analyzing physical situations               real data with even symmetry. There are eight standard DCT
where the signal contains discontinuities and sharp spikes[1-              variants, of which four are common. The DCTs are important
4]. In fourier analysis the local properties of the signal are not         to numerous applications in science and engineering, from
detected easily. STFT(Short Time Fourier Transform)[5] was                 lossy compression of audio and images to spectral methods for
introduced to overcome this difficulty. However it gives local             the numerical solution of partial differential equations. For
properties at the cost of global properties. Wavelets overcome             compression, the cosine functions are much more efficient
this shortcoming of Fourier analysis [6,7] as well as STFT.                whereas for differential equations the cosines express a
Many areas of physics have seen this paradigm shift, including             particular choice of boundary conditions.




                                                                     125                              http://sites.google.com/site/ijcsis/
                                                                                                      ISSN 1947-5500
                                                                           (IJCSIS) International Journal of Computer Science and Information Security,
                                                                                                                                     Vol. 9, No. 6, 2011
B. Walsh Transform
    The Walsh matrix was proposed by Joseph Leonard Walsh                                            ⎧     1            ,x ≤ y
in 1923 [18,19]. Each row of a Walsh matrix corresponds to a                                         ⎪
Walsh function. A Walsh matrix is a square matrix, with                                   K x, y   = ⎨− N + ( x + 1) , x = y + 1
dimensions a power of 2. The entries of the matrix are either +1                                     ⎪     0         ,x > y +1
or −1. It has the property that the dot product of any two
distinct rows (or columns) is zero [20,23,25]. The sequency
                                                                                                     ⎩
ordering of the rows of the Walsh matrix can be derived from
                                                                                                                                                           (1)
the ordering of the Hadamard matrix by first applying the bit-
reversal permutation and then the Gray code permutation[9].                              All diagonal elements and the upper diagonal elements are
The Walsh matrix (and Walsh functions) are used in computing                          one, while lower diagonal elements except the one exactly
the Walsh transform and have applications in the efficient                            below the diagonal are zero.
implementation of certain signal processing operations.
                                                                                          III.   GENERATING WAVELET FROM ANY ORTHOGONAL
C. Hartley Transform                                                                                         TRANSFORM
    Hartley transform was proposed by R. V. L. Hartley in                                  Wavelet transform matrix of size P2 x P2 can be generated
1942, as an alternative to the Fourier transform[10]. It is one of                    from any orthogonal transform M of size PxP. For example, if
many known Fourier-related transforms. Compared to the                                we have orthogonal transform matrix of size 9x9, then its
Fourier transform, the Hartley transform has the advantages of                        corresponding wavelet transform matrix will have size 81x81.
transforming real functions to real functions (as opposed to                          i.e. for orthogonal matrix of size P, wavelet transform matrix
requiring complex numbers) and of being its own inverse.                              size will be Q, such that Q = P2.
D. Kekre Transform                                                                       Consider orthogonal transform M of size pxp as shown
                                                                                      below.
    Kekre transform[11] matrix is the generic version of
Kekre’s LUV color space matrix[12-15]. Most of the other
transform matrices have to be in powers of 2. This condition is
not required in Kekre transform. Any term in the Kekre
transform is generated as
                                             M11        M12      ...       M1 (P-1)     M1P
                                             M21        M22      ...       M2 (P-1)     M2P
                                               .         .       ...          .          .
                                               .         .                    .          .
                                             MP1       MP2       ...       MP (P-1)     MPP
                                             Figure 1 : PxP orthogonal transform matrix
               1st column of M                                     2nd column of M                                 pth column of M
               Repeated P times                                    Repeated P times                                Repeated P times
             M11        M11        ...          M11        M12       M12         ...      M12      ...       M1P          M1P      ...         M1P

             M21         M21       ...         M21        M22        M22        ...        M22     ...       M2P        M2P        ...         M2P

              .           .         ...         .          .          .        ...        .         ...        .          .        ...          .
              .           .         ...         .          .          .        ...        .         ...        .          .        ...          .
             MP1         MP1        ...        MP1        MP2        MP2       ...       MP2        ...       MPP       MPP        ...         MPP
             M21         M22        ...        M2P         0       0         ...       0          ...       0         0          ...       0
         0           0            ...        0            M21        M22       ...       M2P      ...       0         0          ...       0
         .           .            .          .          .          .         .         .          .         .         .          .         .
         .           .            .          .          .          .         .         .          .         .         .          .         .
         0           0            ...        0          0          0         ...       0          ...         M21       M22        ...         M2P
                                  ...                                        ...                  ...
                                  ...                                        ...                  ...
             MP1         MP2        ...        MPP      0          0         ...       0          ...       0         0          ...       0
         0           0            ...        0            MP1        MP2       ...       MPP      ...       0         0          ...       0
         .           .            .          .          .          .         .         .          .         .         .          .         .
         .           .            .          .          .          .         .         .          .         .         .          .         .
         0           0            ...        0          0          0         ...       0          ...         MP1       MP2        ...         MPP
                                          Figure.2: QxQ wavelet transform generated from PxP orthogonal transform( (Q = P2 )




                                                                               126                                  http://sites.google.com/site/ijcsis/
                                                                                                                    ISSN 1947-5500
                                                                 (IJCSIS) International Journal of Computer Science and Information Security,
                                                                                                                           Vol. 9, No. 6, 2011
Figure 2 shows QxQ wavelet transform matrix generated from
PxP orthogonal transform matrix such that Q = P2. To                        Table 1,3,5 and 7 shows the comparison of MSE values
generate the wavelet matrix, the every column of the                        obtained from data compressed using DCT, Walsh, Hartley
orthogonal transform matrix is repeated P times. Then the                   and Kekre transforms applied on all the eleven test images
second row is translated P times to generate next P rows.                   respectively.
Similarly all rows are translated to generate P rows
corresponding to each row. Finally we get the wavelet matrix                Table 2,4,6 and 8 shows the comparison of MSE values
of the size QxQ, where Q = P2                                               obtained from data compressed using DCT wavelet, Walsh
                                                                            wavelet, Hartley wavelet and Kekre wavelet transforms
                                                                            applied on all the eleven test images respectively.
                    IV.   PROPOSED METHOD
In this section, the image compression using wavelet                        Figure 4: Comparison of average MSE with respect to 95% to
transform’s application is proposed.                                        70% of data compress using DCT wavelet, Walsh wavelet,
  Step 1.    Consider an image of size 256x256. The wavelet                 Hartley wavelet, Kekre wavelet, DCT, Walsh, Hartley and
             transform matrix of size 256x256 is generated                  Kekre transform.
             from orthogonal matrix of size 16x16.
                                                                            Figure 5,6,7,8 shows the results of Balls image obtained from
  Step 2.    The wavelet transform is applied on each of the
                                                                            DCT wavelet, Walsh wavelet, Hartley wavelet and Kekre
             image plane i.e. R-plane, G-plane, B-plane                     wavelet respectively for 70% to 95% of data compress.
             separately. Thus, transformed R-plane, G-plane,
             B-plane are obtained.
  Step 3.    From the transformed R-plane, G-plane, B-plane
             separately, the 70% to 95% coefficients having
             lowest energy values are removed. And then the
             image is reconstructed.
  Step 4.    Mean Square error between the reconstructed
             image and the original image is computed.


               V.     RESULTS AND DISCUSSION
In this section, the image compression using wavelet
transform’s application is proposed. The proposed method is
implemented using MatLab 7.0 on Core 2 Duo processor.
DCT, Walsh, Hartley and Kekre wavelets were generated by
the method discussed in the section 3. The eleven different                  Figure 3:Eleven original color test images namely Aishwariya, Balls, Bird,
color images belonging to different categories, of size                      Boat, Flower, Ganesh, Scenary, Strawberry, Tajmahal, Tiger and Viharlake
256x256 were compressed using the proposed method.                             (from left to right and top to bottom) belonging to different categories


Figure 3 shows the eleven color test images of size 256x256x3
belonging to different categories.
                  Table 1: Comparison of MSE values obtained for 95% to 70% data compressed using DCT applied on all eleven images.
                 %data compressed           95            90               85            80            75              70
                  %data retained             5            10               15            20            25              30
                    Aishwariya           16.0803        8.1392           4.542        2.7457         1.7756          1.2044
                       Balls             75.5739       62.2747         50.7583        40.078        30.6593         22.6352
                       Bird              23.4414       19.7856         17.1511        14.846        12.6658         10.6395
                       Boat              63.2849       56.6238         49.9363       43.0231         36.116         29.7334
                      Flower             23.2196       13.3896          8.2092         5.158         3.3155          2.1642
                     Ganesh              66.9069       60.1663         53.5089       47.1965        41.0909          34.991
                     Scenary             32.4582       26.3064         21.5738       17.7571        14.5049         11.6967
                   Strawberry            42.358        30.1477          21.656       15.8716        11.6467          8.5902
                    Tajmahal             49.7616        39.457          30.907       23.5085        17.5614         12.7884
                       Tiger             67.5201       53.9452          42.909       33.8272        26.4423         20.1247
                    Viharlake            42.4999       35.5583         29.7518       24.3079          19.42         15.3702
                     Average             45.7368      36.89035        30.08213      24.39269        19.56349        15.4489




                                                                      127                                   http://sites.google.com/site/ijcsis/
                                                                                                            ISSN 1947-5500
                                                        (IJCSIS) International Journal of Computer Science and Information Security,
                                                                                                                  Vol. 9, No. 6, 2011
 Table 2: Comparison of MSE values obtained for 95% to 70% data compressed using DCT wavelets applied on all eleven images.
    %data Compressed            95               90            85              80           75                70
      %data retained             5               10            15              20           25                30
         Aishwariya          13.9138          5.4578         2.5735         1.3751        0.7916           0.4853
            Balls            67.3254          51.4344       38.2372        27.1908       18.4033          11.9991
            Bird             15.1267          7.2307         3.5948          1.975        1.2121           0.7956
            Boat             54.5218          44.1746       35.3552        27.2747       19.6998          13.4656
           Flower            18.9017          7.4338         3.2424         1.5961        0.8935           0.5555
           Ganesh            65.2921          56.5384       48.5545        40.7238        33.151          25.9447
           Scenary           29.7195          20.5452       13.5051         8.2269        4.8059           2.7274
         Strawberry          40.4291           27.01        17.4447        10.6523        6.2435           3.5483
          Tajmahal           41.9902          29.0375       19.7188        12.8007        8.0717           5.0573
            Tiger            65.7406          49.9408       37.7845        27.6822       19.5049          13.2931
          Viharlake          38.3256          29.412        21.8512        15.5274       10.4845           6.8427
           Average           41.02605        29.83775       21.98745       15.91136      11.20562         7.701327

Table 3: Comparison of MSE values obtained for 95% to 70% data compressed using Walsh transform applied on all eleven images.
      %data Compressed            95               90            85             80              75              70
         %data retained           5                10            15             20              25              30
          Aishwariya           28.1012          18.5335       13.2878         9.6173         6.9535           4.9937
             Balls             81.3445          72.1406       63.6927        55.4022         47.672          39.9922
             Bird              29.0555          24.9949       21.8778        19.0824         16.3321          13.764
             Boat              66.874           60.8236       55.0104        49.0244         42.8836         36.6938
            Flower             36.1558          26.6554       20.6424        15.9753         12.2055          9.2632
            Ganesh             71.3259          65.8241       60.7583        55.4545         49.8489         43.9028
            Scenary            36.995           30.8505       26.1749        22.1124         18.3582         14.9854
          Strawberry           50.8574          42.5104        35.597        29.5755         23.963          18.9102
           Tajmahal            56.1151          46.753        39.0347        32.1825         26.1253           20.74
             Tiger             76.8846          67.4124       59.7075        52.2118         44.8674         37.3638
           Viharlake           46.1636          40.4746       35.3193        30.1336         25.1187          20.408
           Average            52.71569         45.17936      39.19116        33.70654       28.57529        23.72883

Table 4: Comparison of MSE values obtained for 95% to 70% data compressed using Walsh wavelets applied on all eleven images.
      %data compressed           95               90             85             80             75                70
        %data retained            5               10             15             20             25                30
          Aishwariya          24.2798          14.1835        8.4923         5.1467         3.2017            2.0188
             Balls            71.2873          59.9048        50.0267        40.7423       31.9771            24.201
             Bird              21.37           12.2772        6.7889         3.7529         2.1759            1.3296
             Boat             57.1472          47.5557        39.4294        31.806        24.5572           17.8547
            Flower            29.8175          18.8151        11.6577        6.9575         4.0902            2.3961
            Ganesh            68.7866          61.5798        54.7242        47.9489       41.1375           34.1747
            Scenary           33.2859          24.4968        17.4245        11.8686        7.8334            5.1252
          Strawberry          47.5166          37.5645        29.5413        22.3788       16.4636           11.6458
           Tajmahal           46.4039            34.44        25.3372        18.217        12.5737            8.4354
             Tiger            74.0328           62.677        52.7802        43.7662       35.2198            27.521
           Viharlake          41.6009          33.4509        26.2753        19.7348       14.0675             9.508
           Average            46.86623        36.99503       29.31615       22.93815      17.57251          13.11003

  Table 5: Comparison of MSE values obtained for 95% to 70% data compressed using Hartley transform applied on all eleven images.
       %data compressed           95               90             85             80              75             70
         %data retained            5               10             15             20               25            30
           Aishwariya          17.5702           9.2244        5.3507         3.3036            2.144         1.4455
              Balls            76.1777          62.8777       51.2288         40.6618          31.174        23.0542
              Bird             24.0468          19.8922       17.1642         14.8753         12.6649        10.6047
              Boat              63.743          56.9175       50.3089         43.4315          36.585        30.0655
             Flower             23.245          13.3901        8.1816          5.132           3.3002         2.1621
             Ganesh            67.4761          60.5399       54.0034         47.5902         41.4778        35.3603
             Scenary           33.3664          26.7334       22.0444         18.2306         14.8923        11.9873
           Strawberry          43.5429          31.7365        23.165         17.109          12.8251         9.5586
            Tajmahal            49.833          39.5094        30.807         23.4858         17.5716        12.8477
              Tiger            67.9142          54.7827       44.2022         35.0442         27.6414        21.3417
            Viharlake          43.0531          35.9083       30.1151         24.6871         19.7647        15.5462
            Average            46.36076        37.41019       30.59739       24.86828         20.00373       15.8158




                                                            128                                  http://sites.google.com/site/ijcsis/
                                                                                                 ISSN 1947-5500
                                                        (IJCSIS) International Journal of Computer Science and Information Security,
                                                                                                                  Vol. 9, No. 6, 2011
Table 6: Comparison of MSE values obtained for 95% to 70% data compressed using Hartley wavelets applied on all eleven images.
       %data compressed           95               90            85             80              75               70
         %data retained           5                10            15             20              25               30
          Aishwariya           25.213           13.1629        7.0521         4.0416         2.4411            1.5258
             Balls             71.1273          57.2092       45.1996        34.8136         25.8726          18.5723
             Bird              23.6101          13.4562        7.3995         4.0684         2.3198            1.4062
             Boat              57.524           47.4436       38.7197        30.4678         22.8547          16.2068
            Flower             30.4708          17.2874        9.5996         5.2322         2.9442            1.6909
            Ganesh             68.6207          59.8072       51.8952        44.4754         36.9332          29.7399
            Scenary            33.7757          24.0033       16.4779        10.8463         6.9661            4.4146
          Strawberry           47.5595          35.0583        25.241          17.6          12.0036           8.0276
           Tajmahal            45.9356          33.1194       23.4711        16.1064         10.6663           6.9495
             Tiger             71.4263          57.522        46.3082        36.2968         27.6191          20.1696
           Viharlake           40.7793          31.5152       23.8878        17.1808          11.82            7.7949
           Average            46.91294         35.41679      26.84106        20.10266       14.76734         10.59074

Table 7: Comparison of MSE values obtained for 95% to 70% data compressed using Kekre transform applied on all eleven images.
      %data compressed            95              90             85             80              75               70
        %data retained            5               10             15             20              25               30
          Aishwariya          104.7137         98.3782        89.9261        79.8792         71.1881         63.4727
             Balls             95.4663         94.6395        91.7457        85.9287         78.2736         70.4866
             Bird              71.0262         66.9461        61.3682         53.14          44.6532         36.7741
             Boat              96.431           91.469        85.2311        77.9438         70.1285          62.199
            Flower             75.3232          73.098        70.8086        66.3523         61.4276         54.4408
            Ganesh             89.5643         85.6352         79.994        73.6673         66.8255         59.6006
            Scenary            69.4835         66.4225        60.9814        54.9584         48.8438         42.7133
          Strawberry           91.5023         86.9626        82.2165        76.2722         69.2823         61.9791
           Tajmahal            87.7596          81.797        74.3466        66.3947         58.3154         50.4829
             Tiger            103.1722          96.723         90.382        84.1157         77.8424         71.2272
           Viharlake           65.3572         60.0596        54.1362        48.2416         42.1389         36.1324
           Average            86.34541         82.01188      76.46695        69.71763       62.62903        55.40988

Table 8: Comparison of MSE values obtained for 95% to 70% data compressed using Kekre wavelets applied on all eleven images.
       %data compressed           95              90            85              80             75              70
         %data retained            5              10            15              20             25              30
           Aishwariya          41.1364         30.8323        22.567         15.7857       10.5192           6.8614
              Balls            78.2216         69.5023       60.7559         51.6477       42.7972           34.186
              Bird             30.9089         18.9243       11.0784         6.3362         3.7573           2.3177
              Boat             61.2107         51.5525       43.1938         35.063        27.1555          20.0594
             Flower            43.864          33.5713       24.2075         16.0388        9.7355           5.4288
             Ganesh            73.9496         67.8945       61.3198         54.3184       46.8956          39.3169
             Scenary           40.3419         30.1718       22.2297         15.6323       10.7463           7.3131
           Strawberry           58.15          49.6929       41.7717         34.4112       27.2761          20.7789
            Tajmahal           53.9875         41.6384       32.1871         23.958        16.8148          11.1398
              Tiger            86.4872         76.9422       68.3485         60.3057       51.9975          43.7026
            Viharlake          46.5487         39.6689       32.3033         25.1071       18.4207           12.622
            Average            55.8915        46.39922       38.17843       30.78219       24.19234         18.5206

Table 9: Comparison of average MSE values obtained for 95% to 70% data compressed using DCT, Walsh, Hartley, Kekre transforms and their
                                          corresponding wavelets applied on all eleven images.


       %data compressed                95              90                85           80              75             70
        %data retained                  5              10                15           20              25             30
        DCT Wavelets             41.02605        29.83775          21.98745     15.91136        11.20562       7.701327
        Walsh Wavelets           46.86623        36.99503          29.31615     22.93815        17.57251       13.11003
       Hartley Wavelets          46.91294        35.41679          26.84106     20.10266        14.76734       10.59074
        Kekre Wavelets            55.8915        46.39922          38.17843     30.78219        24.19234        18.5206
             DCT                  45.7368        36.89035          30.08213     24.39269        19.56349        15.4489
            Walsh                52.71569        45.17936          39.19116     33.70654        28.57529       23.72883
           Hartley               46.36076        37.41019          30.59739     24.86828        20.00373        15.8158
            Kekre                86.34541        82.01188          76.46695     69.71763        62.62903       55.40988




                                                             129                                 http://sites.google.com/site/ijcsis/
                                                                                                 ISSN 1947-5500
                                                                (IJCSIS) International Journal of Computer Science and Information Security,
                                                                                                                          Vol. 9, No. 6, 2011




Figure 4: Comparison of average MSE with respect to 95% to 70% of data compressed using DCT wavelet, Walsh wavelet, Hartley wavelet, Kekre wavelet,
                                                     DCT, Walsh, Hartley and Kekre transform.




                        Figure 5: Results of Balls image obtained from DCT wavelet for 70% to 95% of data compressed.




                                                                    130                                  http://sites.google.com/site/ijcsis/
                                                                                                         ISSN 1947-5500
                                         (IJCSIS) International Journal of Computer Science and Information Security,
                                                                                                   Vol. 9, No. 6, 2011




Figure 6: Results of Balls image obtained from Walsh wavelet for 70% to 95% of data compressed.




Figure 7: Results of Balls image obtained from Hartley wavelet for 70% to 95% of data compressed.




                                              131                                  http://sites.google.com/site/ijcsis/
                                                                                   ISSN 1947-5500
                                                                         (IJCSIS) International Journal of Computer Science and Information Security,
                                                                                                                                   Vol. 9, No. 6, 2011




                               Figure 8: Results of Balls image obtained from Kekre wavelet for 70% to 95% of data compressed.

      From Table 9, it is observed that performance of all                                   Proceedings of the International Computer Music Conference (ICMC-
                                                                                             87, Tokyo), Computer Music Association, 1987.
      wavelet transforms is better than that of their respective
                                                                                      [6]    S. Mallat, "A Theory of Multiresolution Signal Decomposition: The
      orthogonal transforms as indicated by lower MSE values.                                Wavelet Representation," IEEE Trans. Pattern Analysis and Machine
      Figure 4 compares the average MSE with respect to 95%                                  Intelligence, vol. 11, pp. 674-693, 1989.
      to 70% od data compress using. DCT wavelet, Walsh                               [7]    Strang G. "Wavelet Transforms Versus Fourier Transforms." Bull.
      wavelet, Hartley wavelet, Kekre wavelet, DCT, Walsh,                                   Amer. Math. Soc. 28, 288-305, 1993.
      Hartley and Kekre transform.                                                    [8]    P. P. Kanjilal, “Adaptive Prediction and Predictive Control”, IET, p 210,
                                                                                             1995
                           VI.     CONCLUSION                                         [9]    Yuen, C. “Remarks on the Ordering of Walsh Functions”, IEEE
                                                                                             Transactions on Computers, C-21: 1452, 1972.
      In this paper, novel orthogonal wavelet transform
                                                                                      [10]   Hartley, R. V. L., “A more symmetrical Fourier analysis applied to
      generation method is proposed. The proposed method can                                 transmission problems”, Proc. IRE 30, 144–150, 1942.
      be used to generate the wavelet transform from any                              [11]   H.B.Kekre, Sudeep D. Thepade, “Image Retrieval using Non-
      orthogonal transform. To test the efficiency of wavelet                                Involutional Orthogonal Kekre’s Transform”, International Journal of
      transform, they are applied on the eleven different color                              Multidisciplinary Research and Advances in Engineering (IJMRAE),
                                                                                             Ascent Publication House, 2009, Volume 1, No.I, 2009. Abstract
      images for the purpose of data compression. The                                        available online at www.ascent-journals.com
      orthogonal transforms used in this paper are DCT, Walsh,                        [12]   H. B.Kekre, Sudeep D. Thepade, “Image Blending in Vista Creation
      Hartley and Kekre. From the results, it can be concluded                               using Kekre's LUV Color Space”, SPIT-IEEE Colloquium and Int.
      that wavelet transforms outperforms their respective                                   Conference, SPIT, Andheri, Mumbai, 04-05 Feb 2008.
      orthogonal transform as indicated by lower MSE values                           [13]   H.B.Kekre, Sudeep D. Thepade, “Boosting Block Truncation Coding
                                                                                             using Kekre’s LUV Color Space for Image Retrieval”, WASET Int.
                              REFERENCES                                                     Journal of Electrical, Computer and System Engineering (IJECSE),
                                                                                             Vol.2, Num.3, Summer 2008. Available online at
                                                                                             www.waset.org/ijecse/v2/v2-3-23.pdf
[1]   K. P. Soman and K.I. Ramachandran. ”Insight into WAVELETS From                  [14]   H.B.Kekre, Sudeep D. Thepade, “Color Traits Transfer to Grayscale
      Theory to Practice”, Printice -Hall India, pp 3-7, 2005.                               Images”, In Proc.of IEEE First International Conference on Emerging
[2]   Raghuveer M. Rao and Ajit S. Bopardika. “Wavelet Transforms –                          Trends in Engg. & Technology, (ICETET-08), G.H.Raisoni COE,
      Introduction to Theory and Applications”, Addison Wesley Longman,                      Nagpur, INDIA. Available on IEEE Xplore.
      pp 1-20, 1998.                                                                  [15]   H.B.Kekre, Sudeep D. Thepade, “Creating the Color Panoramic
[3]   C.S. Burrus, R.A. Gopinath, and H. Guo. “Introduction to Wavelets and                  Viewusing Medley of Grayscale and Color Partial Images”, WASET Int.
      Wavelet Transform” Prentice-hall International, Inc., New Jersey, 1998.                Journal of Electrical, Computer and System Engg. (IJECSE), Volume 2,
[4]   Amara Graps, ”An Introduction to Wavelets”, IEEE Computational                         No. 3, Summer 2008. Available online at www.waset.org/ijecse/v2/v2-3-
      Science and Engineering, vol. 2, num. 2, Summer 1995, USA.                             26.pdf
[5]   Julius O. Smith III and Xavier SerraP“, An Analysis/Synthesis Program           [16]   Dr. H.B.kekre, Sudeep D. Thepade, Adib Parkar, “A Comparison of
      for Non-Harmonic Sounds Based on a Sinusoidal Representation'',                        Haar Wavelets and Kekre’s Wavelets for Storing Colour Information in
                                                                                             a Greyscale Image”, International Journal of Computer Applications




                                                                                132                                      http://sites.google.com/site/ijcsis/
                                                                                                                         ISSN 1947-5500
                                                                          (IJCSIS) International Journal of Computer Science and Information Security,
                                                                                                                                    Vol. 9, No. 6, 2011
       (IJCA), Volume 1, Number 11, December 2010, pp 32-38. Available at              Professor at MPSTME, SVKM’s NMIMS. He has guided 17 Ph.Ds, more than
       www.ijcaonline.org/archives/volume11/number11/1625-2186                         100 M.E./M.Tech and several B.E./ B.Tech projects. His areas of interest are
[17]   Dr. H.B.kekre, Sudeep D. Thepade, Adib Parkar “Storage of Colour                Digital Signal processing, Image Processing and Computer Networking. He
       Information in a Greyscale Image using Haar Wavelets and Various                has more than 270 papers in National / International Conferences and Journals
       Colour Spaces”, International Journal of Computer Applications (IJCA),          to his credit. He was Senior Member of IEEE. Presently He is Fellow of IETE
       Volume 6, Number 7, pp.18-24, September 2010. Available online at               and Life Member of ISTE Recently 11 students working under his guidance
       http://www.ijcaonline.org/volume6/number7/pxc3871421.pdf                        have received best paper awards. Two of his students have been awarded Ph.
[18]   Dr.H.B.Kekre, Sudeep D. Thepade, Juhi Jain, Naman Agrawal, “IRIS                D. from NMIMS University. Currently he is guiding ten Ph.D. students.
       Recognition using Texture Features Extracted from Walshlet Pyramid”,
       ACM-International Conference and Workshop on Emerging Trends in                 Dr. Tanuja K. Sarode has Received Bsc.(Mathematics) from Mumbai
       Technology (ICWET 2011),Thakur College of Engg. And Tech.,                                                  University     in    1996,    Bsc.Tech.(Computer
       Mumbai, 26-27 Feb 2011. Also will be uploaded on online ACM Portal.                                         Technology) from Mumbai University in 1999,
                                                                                                                   M.E. (Computer Engineering) degree from
[19]   Dr.H.B.Kekre, Sudeep D. Thepade, Akshay Maloo, “Face Recognition
                                                                                                                   Mumbai University in 2004, Ph.D. from Mukesh
       using Texture Features Extracted form Walshlet Pyramid”, ACEEE
                                                                                                                   Patel School of Technology, Management and
       International Journal on Recent Trends in Engineering and Technology
                                                                                                                   Engineering, SVKM’s NMIMS University, Vile-
       (IJRTET), Volume 5, Issue 1, www.searchdl.org/journal/IJRTET2010
                                                                                                                   Parle (W), Mumbai, INDIA. She has more than 12
[20]   Dr.H.B.Kekre, Sudeep D. Thepade, Juhi Jain, Naman Agrawal,                                                  years of experience in teaching. Currently working
       “Performance Comparison of IRIS Recognition Techniques using                                                as Assistant Professor in Dept. of Computer
       Wavelet Pyramids of Walsh, Haar and Kekre Wavelet Transforms”,                                              Engineering at Thadomal Shahani Engineering
       International Journal of Computer Applications (IJCA), Number 2,                College, Mumbai. She is life member of IETE, member of International
       Article                  4,                March                  2011,         Association of Engineers (IAENG) and International Association of Computer
       http://www.ijcaonline.org/proceedings/icwet/number2/2070-aca386                 Science and Information Technology (IACSIT), Singapore. Her areas of
[21]   Dr.H.B.Kekre, Sudeep D. Thepade, Akshay Maloo, “Face Recognition                interest are Image Processing, Signal Processing and Computer Graphics. She
       using Texture Features Extracted from Haarlet Pyramid”, International           has 90 papers in National /International Conferences/journal to her credit.
       Journal of Computer Applications (IJCA), Volume 12, Number 5,
       December          2010,        pp       41-45.        Available      at         Sudeep D. Thepade has Received B.E.(Computer) degree from North
       www.ijcaonline.org/archives/volume12/number5/1672-2256                                                          Maharashtra University with Distinction in
[22]   Dr.H.B.Kekre, Sudeep D. Thepade, Juhi Jain, Naman Agrawal, “IRIS                                                2003. M.E. in Computer Engineering from
       Recognition using Texture Features Extracted from Haarlet Pyramid”,                                             University of Mumbai in 2008 with
       International Journal of Computer Applications (IJCA), Volume 11,                                               Distinction, currently submitted thesis for
       Number       12,   December       2010,    pp    1-5,   Available    at                                         Ph.D. at SVKM’s NMIMS, Mumbai. He has
       www.ijcaonline.org/archives/volume11/number12/1638-2202.                                                        more than 08 years of experience in teaching
[23]   Dr.H.B.Kekre, Sudeep D. Thepade, Akshay Maloo, “Performance                                                     and industry. He was Lecturer in Dept. of
       Comparison of Image Retrieval Techniques using Wavelet Pyramids of                                              Information Technology at Thadomal Shahani
       Walsh, Haar and Kekre Transforms”, International Journal of Computer                                            Engineering College, Bandra(w), Mumbai for
       Applications (IJCA) Volume 4, Number 10, August 2010 Edition, pp 1-                                             nearly 04 years. Currently working as
       8, http://www.ijcaonline.org/archives/volume4/number10/866-1216                                                 Associate Professor in Computer Engineering
                                                                                                                       at Mukesh Patel School of Technology
[24]   Dr.H.B.Kekre, Sudeep D. Thepade, Akshay Maloo, “Query by image                  Management and Engineering, SVKM’s NMIMS, Vile Parle(w), Mumbai,
       content using color texture features extracted from Haar wavelet                INDIA. He is member of International Association of Engineers (IAENG) and
       pyramid”, International Journal of Computer Applications (IJCA) for the         International Association of Computer Science and Information Technology
       special edition on “Computer Aided Soft Computing Techniques for                (IACSIT), Singapore. He is member of International Advisory Committee for
       Imaging and Biomedical Applications”, Number 2, Article 2, August               many International Conferences. He is reviewer for various International
       2010. http://www.ijcaonline.org/specialissues/casct/number2/1006-41             Journals. His areas of interest are Image Processing Applications, Biometric
[25]   Dr.H.B.Kekre, Sudeep D. Thepade, “Image Retrieval using Color-                  Identification. He has about 110 papers in National/International
       Texture Features Extracted from Walshlet Pyramid”, ICGST                        Conferences/Journals to his credit with a Best Paper Award at International
       International Journal on Graphics, Vision and Image Processing (GVIP),          Conference SSPCCIN-2008, Second Best Paper Award at ThinkQuest-2009
       Volume 10, Issue I, Feb.2010, pp.9-18, Available online                         National Level paper presentation competition for faculty, Best paper award at
       www.icgst.com/gvip/Volume10/Issue1/P1150938876.html                             Springer international conference ICCCT-2010 and second best research
                                                                                       project award at ‘Manshodhan-2010’.
                          AUTHORS PROFILE
Dr. H. B. Kekre has received B.E. (Hons.) in Telecomm. Engineering. from               Ms. Sonal Shroff has Received B.Sc.(Physics) from University of Mumbai in
                             Jabalpur University in 1958, M.Tech                                                1996, B.Sc.Tech.(Computer Technology) from
                             (Industrial Electronics) from IIT Bombay in                                        University of Mumbai in 1999. She has more than
                             1960, M.S.Engg. (Electrical Engg.) from                                            10 years of experience in teaching. Currently
                             University of Ottawa in 1965 and Ph.D.                                             working as Lecturer in Dept. of Computer
                             (System Identification) from IIT Bombay                                            Engineering at Thadomal Shahani Engineering
                             in 1970 He has worked as Faculty of                                                College. She is life member of ISTE. Her areas of
                             Electrical Engg. and then HOD Computer                                             interest are Image Processing, Signal Processing
                             Science and Engg. at IIT Bombay. For 13                                            and Computer Graphics.
                             years he was working as a professor and head
                             in the Department of Computer Engg. at
Thadomal Shahani Engineering. College, Mumbai. Now he is Senior




                                                                                 133                                    http://sites.google.com/site/ijcsis/
                                                                                                                        ISSN 1947-5500
                                                             (IJCSIS) International Journal of Computer Science and Information Security,
                                                                                                                  Vol. 9, No. 6 June 2011

 Analysing Assorted Window Sizes with LBG and
KPE Codebook Generation Techniques for Grayscale
              Image Colorization
    Dr. H. B.Kekre                   Dr. Tanuja K. Sarode                 Sudeep D. Thepade                     Ms. Supriya Kamoji
Sr. Professor,MPSTME,                   Asst. Professor,                    Asst. Professor,                        Sr.Lecturer,
 NMIMS Deemed-to-be                 Thadomal Shahani Engg.                    MPSTME,                          Fr.Conceicao Rodrigues
University,Vileparle (W),                   College,                     NMIMS Deemed-to-be                       College of Engg,
   Mumbai-56, India.                Bandra (W), Mumbai-50,              University,Vileparle (W),                   Bandra (W),
                                             India.                        Mumbai-56, India.                      Mumbai-50, India.
Abstract—This paper presents use of assorted window sizes and        Gray scale image is represented by only the luminance values
their impact on colorization of grayscale images using Vector        that can be matched between the two images. Because a single
Quantization (VQ) Code Book generation techniques. The               luminance value could represent entirely different parts of an
problem of coloring grayscale image has no exact solution.           image, the remaining values within the pixel’s neighborhood
Attempt is made to minimize the human efforts needed in
manually coloring grayscale images. Here human interaction is
                                                                     are used to guide the matching process. Once the pixel is
only to find reference image of similar type. The job of             matched, the color information is transferred but original
transferring color from reference image to grayscale is done by      luminance value is retained [2].
proposed techniques. Vector quantization algorithms Linde Buzo
and Gray Algorithm (LBG) and Kekre Proportionate Error               The details in color image can be utilized for analysis and study
(KPE) are used to generate color palette in RGB and Kekre’s LUV      of particular image in the applications like medical
color space. For colorization source color image is taken as
                                                                     tomography, information security, image segmentation, etc.
reference image which is divided into non overlapping pixel
windows. Initial clusters are formed using VQ algorithms LBG         Coloring of old Black and White movies and rare images of
and KPE, used to generate the color palette. Grayscale image         monuments, celebrities is one of the best applications which
which is to be colored is also divided in non overlapping pixel      give good feel and understanding.
windows. Every pixel window of gray image is compared with
color palette to get the nearest color values. Best match is found   In case of pseudo-coloring [3] where the mapping of luminance
using least mean squared error. To test the performance of these     values to color values is automatic, the choice of color map is
algorithms, color image is converted into gray scale image and the
same grayscale image is recolored back. Finally MSE of recolored
                                                                     commonly determined by human decision. The main concept of
image and original image is compared. Experiment is conducted        colorization techniques exploits textual information. The work
on both RGB and Kekre’s LUV color space for the different pixel      of Welsh et al , which is inspired by the color transfer [4] and
windows of size 1x2, 2x1, 2x2, 2x3, 3x2, 3x3, 1x3, 3x1, 2x4, 4x2,    by image analogies [5], examines the luminance values in the
1x4, 4x1. However Kekre’s LUV color space gives outstanding          neighborhood of each pixel in the target image and add to its
performance. For different pixel windows KPE with 1x2 and LBG        luminance the chromatic information of a pixel from a source
with 2x1 pixel window perform well with respect to image quality.    image with best neighborhoods matching .This technique works
                                                                     on images were differently colored regions give rise to distinct
   Keywords- Colorization , Pixel Window, ColorPalette, Vector
                                                                     textures otherwise, the user must specify rectangular swatches
Quantization(VQ) , LBG, KPE.
                                                                     indicating corresponding regions in the two images.
                                                                     Color traits transferred to gray scale images [6] presents novel
                       I.   INTRODUCTION                             coloring techniques where color palette is prepared using pixel
    Colors always provide more clear information than gray           windows of some degree taken from reference coloring image.
scale digital images. Colorization is the art of adding color to a   For every window of gray scale image the palette is searched
monochrome image or movie. Colors we perceive in an object           for equivalent color values which could be used to color gray
are determined by nature of light reflected from the object. Due     scale window [19].
to the structure of human eye, all colors are seen as variable
combinations three basic colors Red, Green, Blue (RGB). The              In this paper, adjacent pixels are grouped together to form a
task of coloring a grayscale image involves assigning RGB            (pixel window) grid. Vector Quantization algorithms LBG and
values to an image which varies along only the luminance             KPE are applied on different pixel window sizes 1x2, 2x1, 2x2,
value. Since different colors may have the same luminance but        2x3, 3x2, 3x3, 1x3, 3x1, 2x4, 4x2, 1x4, 4x1and codebook of size
vary in hue and saturation, the problem of coloring gray scale       512 is obtained. Vector Quantization algorithms LBG and KPE
needs human interaction [1].                                         are applied. Depending on minimum Euclidean distance, LUV




                                                                 134                              http://sites.google.com/site/ijcsis/
                                                                                                  ISSN 1947-5500
                                                              (IJCSIS) International Journal of Computer Science and Information Security,
                                                                                                                   Vol. 9, No. 6 June 2011
components of reference image are transferred to input gray two vectors are generated by using constant error addition to the
image.                                                          codevector. Euclidean distances of all the training vectors are
                                                                computed with vectors v1 & v2 and two clusters are formed
             II. KEKRE’S LUV COLOR SPACE [15,21]                based on closest of v1 or v2. This modus operandi is replaced
                                                                for every cluster. The shortcoming of this algorithm is that the
    In the proposed technique Kekre’s LUV color space is used. cluster elongation is +135O to horizontal axis in two dimensional
Where L gives luminance and U and V gives chromaticity cases resulting in inefficient clustering.
values of color image. Positive values of U indicate prominence
of red components in color image and negative value of V
indicates prominence of green component. The RGB-to LUV
and LUV-to-RGB conversion matrices are given in equation 1
and 2 respectively.

⎡L ⎤     ⎡ 1 1 1⎤     ⎡R ⎤
⎢U ⎥ = ⎢− 2 1 1⎥ * ⎢G ⎥                                        (1)
⎢ ⎥      ⎢       ⎥    ⎢ ⎥
⎢V ⎥
⎣ ⎦      ⎢ 0 1 1⎥
         ⎣       ⎦    ⎢B ⎥
                      ⎣ ⎦
⎡ R ⎤ ⎡1 − 2 0⎤ ⎡ L / 3 ⎤                                                                Figure1 LBG for Two dimensional case.
⎢G ⎥ = ⎢1 1 1 ⎥ * ⎢U / 6⎥                                      (2)
                                                               B. Kekre’s Proportionate Error (KPE) Algorithm [9,10]
⎢ ⎥ ⎢          ⎥ ⎢      ⎥
⎢ B ⎥ ⎢1 1 1 ⎥ ⎢V / 2 ⎥                                              Here to generate two vectors v1 & v2 proportionate error is
⎣ ⎦ ⎣          ⎦ ⎣      ⎦                                      added to the codevector. Magnitude of elements of the
                                                               codevector decides the error ratio. Hereafter the procedure is
                                                               same as that of LBG. While adding proportionate error a safe
                 III. VECTOR QUANTIZATION                      guard is also introduced so that neither v1 nor v2 go beyond the
                                                               training vector space eliminating the disadvantage of the LBG.
    Vector Quantization (VQ) [7],[8] is an efficient and lossy Fig. 2, shows the cluster elongation after adding proportionate
technique for compression of data and has been successfully error.
used in various applications like an pattern recognition[11],
speech recognition and face detection[12][13],image
segmentation[14],speech data compression [16],content based
image retrieval CBIR[17],[18] etc.

     Vector Quantization can be define as a mapping function
that maps k-dimensional vector space to a finite set CB = {C1,
C2,C3, ..…., CN}. The set CB is called codebook consisting of
N number of codevectors and each codevector Ci= {ci1, ci2, ci3,
……, cik} is of dimension k. The key to VQ is the good
codebook. Codebook can be generated in spatial domain by                Figure 2 orientation of line joining two vectors v1 and v2 after addition of
clustering algorithms.                                                                      proportionate error to the centroid.

                                                                                     IV.    PROPOSED COLORING TECHNIQUE
     In color transfer phase, image is divided into non
overlapping blocks and each block then is converted to the           Since the coloring problem always requires human
training vector Xi = (xi1, xi2, ……., xik ). The codebook is then  interaction. So reference image of same class and of same
searched for the nearest codevector Cmin by computing squared     feature as of input grayscale image. The color transfer
Euclidian distance as presented in equation (3) with vector Xi    algorithm is discussed for LUV color space for different m x n
with all the codevectors of the codebook CB. This method is       pixel grid size. The main steps of algorithm for a color transfer
called exhaustive search (ES).                                    are:
d(Xi, Cmin) = min1≤j≤N{d(Xi,Cj)}                            (3)        • Convert RGB components of source color image into
where d(Xi,Cj) = ∑(Xip - Cjp)2                                              respective Kekre’s LUV color components.
It is obvious that, if the codebook size is increased to reduce the    • Divide the image in to blocks of m x n pixels. Hence
distortion the searching time will also increase.                           m x n x3 dimensional training vector set
   The following section describes the VQ codebook                          corresponding to LUV components of each pixel is
Generation Algorithms.                                                      obtained. On this set LBG and KPE algorithms are
                                                                            applied and color palette is generated i.e. codebook of
A. Linde Buzoand Gray Algorithms(LBG) [7,8]                                 size 512.
                                                                       •    The input gray image is divided in mxn blocks of
   In this algorithm centroid is first calculated by taking
                                                                            pixels. Each block (pixel window) is searched for
average as the first code vector for the training set. In figure1



                                                                  135                                    http://sites.google.com/site/ijcsis/
                                                                                                         ISSN 1947-5500
                                                                         (IJCSIS) International Journal of Computer Science and Information Security,
                                                                                                                              Vol. 9, No. 6 June 2011
     •    nearest code vector of color palette. While searching
          only luminance is compared
     •    Once the nearest match is obtained gray image pixel
          window is replaced by LUV codevector
     •    The final colored image in LUV domain is then
          converted into RGB plane and MSE of original color
          image and recolored image are calculated.
                                                                                         6(a)               6(b)                  6(c)               6(d)
                                                                                    Original image       Gray image
                             V.     RESULTS                                                                                 1x2 Grid LBG       1x2 Grid KPE
                                                                                                                               MSE: 92            MSE: 81
The algorithms discussed above are implemented using                                Figure 6 shows reconstruction of face grayscale image using similar source
                                                                                                          image for pixel window 1x2.
MATLAB 7.0 on Pentium IV, 1.66GHz, 1GB RAM. To test the
performance of these algorithms we have converted color
image to grayscale image and the same gray image is recolored
back. Finally MSE of original image and colored image is
compared. Five color images belonging to different classes of
size 128x128x3 are used.                                                                 7(a)          7(b)         7(c)          7(d)             7(d)
Figure3 to Figure6. Shows the results of LBG and KPE for                               Original      Reference      Gray        1x2 Grid      1x2 Grid KPE
Zebra, Book, Cartoon and Face images considering same image                             Image         Image        Image          LBG           MSE 709
                                                                                                                               MSE 990
as reference image.                                                                       Figure 7 shows reconstruction of Scenery grayscale image using
Figure7 and Figure8. Shows the results of LBG and KPE for                                                    different source image.
scenery and dog images considering different image as
reference image.



                                                                                         8(a)          8(b)          8(c)        8(d)           8(e)
                                                                                       Original      Reference    Gray Image   1x2Grid        1x2 Grid
                                                                                        Image         Image                      LBG        KPE MSE
                                                                                                                               MSE 303          285
                                                                                     Figure 8 shows reconstruction of Dog grayscale image using different
                                                                                     source image.
    3(a)               3(b)               3(c)                3(d)
  Original         Gray image        1x2 Grid LBG        1x2 Grid KPE                    Various images, each of size 128x128 pixels, were
   image                             MSE: 178.4          MSE: 73.85                  used to build the color palette, and their grayscale
  Figure 3 shows reconstruction of Zebra grayscale image using similar               equivalents were colored using color palette for various
                  source image for pixel window 1x2
                                                                                     pixel windows. The fig. 9, shows bar chart of average
                                                                                     mean squared error obtained across all five images with
                                                                                     respect to initial few pixel windows for RGB and Kekre’s
                                                                                     LUV color space. It is observed that, Kekre’s LUV color
                                                                                     space gives less MSE compared to RGB color space.
                                                                                     Hence in table 1 only Kekre’s LUV color space results for
                                                                                     different images using 12 varying pixel window
                                                                                     sizes(1x2,2x1,2x2,2x3,3x2,3x3,1x3,3x1,2x4,4x2,1x4,4x1)
     4 (a)             4(b)                4(c)               4(d)                   are given.
Original Image      Gray Image        1x2GridLBG         1x2 Grid KPE
                                        MSE73.8            MS53.32
  Figure 4 shows reconstruction of book grayscale image using similar
                  source mage for pixel window 1x2




                                                                                        Figure 9– Average MSE across various Grid sizes for different color
                                                                                                                    spaces
     5(a)            5(b)               5(c)                5(d)
Original image    Gray image       1x2 Grid LBG        1x2 Grid KPE
                                     MSE: 1260          MSE:1023
Figure 5 shows reconstruction of cartoon grayscale image using similar
                 source mage for pixel window 1x2




                                                                             136                                  http://sites.google.com/site/ijcsis/
                                                                                                                  ISSN 1947-5500
                                                                    (IJCSIS) International Journal of Computer Science and Information Security,
                                                                                                                         Vol. 9, No. 6 June 2011

         Table I. Shows the Results of LBG and KPE for five color images from different categories of size 128x128x3.

            Input     VQ                                                   Grid Sizes
           Images     Alg.
                               1x2     2x1       2x2          2x3    3x2         3x3       2x4      4x2      1x3          3x1       1x4       4x1
           Image1    LBG      92.70    87.11    107.42    116.75    114.65       2302     5576     5652     107.32        106      122.29     114.
                     KPE      81.59    77.49    89.141    90.75     95.83        2122     5557     5663      87.32       89.09      91.41     87.3
           Image2    LBG      1260     1244      1056      1493      1420        6350     6928     9595      1251        1243       1532      1392
                     KPE      1023     1153      1363      1233      1150        6243     6725     9359      1278        1138       1013      1093
           Image3    LBG      178.4     147     440.90    721.73    675.15       2003     2486     4236       373         286      610.66     465.7
                     KPE      73.85    76.12    225.10    461.69    441.62       1823     2264     4663     143.5         142      273.38     226.0
           Image4    LBG      73.89    76.64    107.12    123.82    131.86       2340     3400     3664       90          97       116.92     128.9
                     KPE      53.32    52.73    79.094    103.43    95.439       2813     3371     3701     65.3          65       73.93      76.09
           Image5    LBG      1203     1244      1244      1406      1388        6340     7833     7916     1246         1240      1274       1266
                     KPE      1178     1174      1182      1414      1399        6384     7935     7968     1193         1175      1209       1212
            Average LBG       561.5    559.7      591     772.26    745.9        3867     5244     6212      613          594       731        673
            Average KPE       481.9    506.6     587.8    660.5     636.3        3877     5170     6270     553.4        521.8     532.1      538.8


From the data given in table1, it is seen that the performance                                                 REFERENCES
gradually decreases as the pixel window size increases.Further                   [1]    V. Karthikeyani, K. Duraisamy, Mr.P.Kamalkakkannan, " Conversion
MSE for unidirectional pixel window is less compared to                                 of grayscale image to color image with and without texture synthesis",
                                                                                        IJCSNS International journal of Computer science and network
bidirectional. Pixel window sizes 1x2 and 2x1 are showing                               security, Vol.7 No.4 April 2007.
better results as compared to large pixel window sizes.                          [2]    E.Reinhard, M. Ashikhmin, B. Gooch and P Shirley, “Colour Transfer
Fig.10, shows the comparison of average mean sqared error                               between images”, IEEE Transactions on Computer Graphics and
obtained across all images on Kekre’s LUV color space for top                           Applications 21, 5, pp. 34-41.
five pixel windows. It can be seen from the chart , KPE                          [3]    Rafael C. Gonzalez & Paul Wintz, “ Digital Image Processing”,
                                                                                        Addison Wesley Publications, May 1987.
performs well with respect to LBG. Also performance
                                                                                 [4]    A. Hertzmann, C. E Jacobs, N. Oliver, B. Curless and D.H. Salesin,
deteriorates as pixel window size increases and becomes                                 “image Anologies”, in the proceedings of ACM SIGGRAPH 2002, pp.
bidirectinal.                                                                           341-346.
                                                                                 [5]    G. Di Blassi, and R. D. Reforgiato, “Fast colourization of gray
                                                                                        images”, In proceedings of Eurographics Italian Chapte, 2003.
                                                                                 [6]    H.B.Kekre, Sudeep. D. Thepade, “Color traits transfer to gray scale
                                                                                        images”, in Proc of IEEE International conference on Emerging Trends
                                                                                        in Engineering and Technology, ICETET 2008 Raisoni College of
                                                                                        Engg, Nagpur.
                                                                                 [7]    R. M. Gray, "Vector quantization", IEEE ASSP Mag., pp. 4-29,
                                                                                        Apr11984.
                                                                                 [8]    Y. Linde, A. Buzo, and R. M. Gray, "An algorithm for vector quantizer
                                                                                        design," IEEE Trans.Commun., vol. COM-28, no. 1, pp. 8495, 1980.
            Figure 10 Average MSE across various Grid sizes
                                                                                 [9]    H. B. Kekre, Tanuja K. Sarode, "New Fast Improved Codebook
                        VI.     CONCLUSION                                              Generation Algorithm for Color Images using Vector Quantization,"
                                                                                        International Journal of Engineering and Technology, vol.1, No.1, pp.
  In this paper, the idea of colorization of grayscale images                           67-77, September 2008.
using VQ codebook generation techniques is presented using                       [10]   H. B. Kekre, Tanuja K. Sarode, "An Efficient Fast Algorithm to
two famous codebook generation algorithms alias LBG and                                 Generate Codebook for Vector Quantization," First International
KPE. For both the algorithms 12 assorted pixel window sizes are                         Conference on Emerging Trends in Engineering and Technology,
                                                                                        ICETET-2008, held at Raisoni College of Engineering, Nagpur, India,
considered for preparing the color palettes. As quality of                              July 2008, Available at online IEEE Xplore.
colorization is subjective to source color image and grayscale to                [11]   Ahmed A. Abdelwahab, Nora S. Muharram, "A Fast Codebook Design
be colorized image, the grayscale version of 5 color images are                         Algorithm Based on a Fuzzy Clustering Methodology", International
recolored using total 48 variations of proposed techniques with 2                       Journal of Image and Graphics, vol. 7, no. 2 pp. 291302, 2007.
color spaces (RGB and Kekre’s LUV), 12 pixel window sizes                        [12]   Chin-Chen Chang, Wen-Chuan Wu, "Fast Planar-Oriented Ripple
and 2 codebook generation techniques( LBG and KPE). The                                 Search Algorithm for Hyperspace VQ Codebook", IEEE Transaction
comparison of original color image and recolored image has                              on image processing, vol 16, no. 6, June 2007.
shown that Kekre’s LUV color space outperforms RGB color                         [13]   C. Garcia and G. Tziritas, "Face detection using quantized skin color
space. Further, it can be observed from results that unidirectional                     regions merging and wavelet           packet analysis," IEEE Trans.
                                                                                        Multimedia, vol. 1, no. 3, pp. 264-277, Sep. 1999.
pixel windows gives better colorization than bidirectional pixel
                                                                                 [14]   H. B. Kekre, Tanuja K. Sarode, Bhakti Raul, "Color Image
window sizes. The KPE performs better than LBG for                                      Segmentation using Kekre's Fast Codebook Generation Algorithm
colorization. In all the best performance is shown by KPE with                          Based on Energy Ordering Concept", ACM International Conference
1x2 window size in Kekre’s LUV color space.                                             on Advances in Computing, Communication and Control (ICAC3-
                                                                                        2009), 23-24 Jan 2009, Fr. Conceicao Rodrigous College of Engg.,
                                                                                        Mumbai. Available on online ACM portal.




                                                                           137                                       http://sites.google.com/site/ijcsis/
                                                                                                                     ISSN 1947-5500
                                                                          (IJCSIS) International Journal of Computer Science and Information Security,
                                                                                                                               Vol. 9, No. 6 June 2011
  [15] Dr.H.B. Krkre, Sudeep D. Thepade, “Image Blending in Vista Creation         Dr. Tanuja K. Sarode has Received Bsc.(Mathematics) from Mumbai
       using Kekre’s LUV Color Space”, In Proc. Off PIT-IEEE Colloquium,                                      University     in    1996,     Bsc.Tech.(Computer
       Mumbai, Feb 4-5,2008.                                                                                  Technology) from Mumbai University in 1999,
  [16] H. B. Kekre, Tanuja K. Sarode, "Speech Data Compression using                                          M.E. (Computer Engineering) degree from Mumbai
       Vector Quantization", WASET International Journal of Computer and                                      University in 2004, Ph.D. from Mukesh Patel
       Information Science and Engineering (IJCISE), vol. 2, No. 4, 251254,                                   School of Technology, Management and
       Fall 2008. available: http://www.waset.org/ijcise.                                                     Engineering, SVKM’s NMIMS University, Vile-
  [17] H. B. Kekre, Ms. Tanuja K. Sarode, Sudeep D. Thepade, "Image                                           Parle (W), Mumbai, INDIA. She has more than 12
       Retrieval using Color-Texture Features from DCT on VQ Codevectors                                      years of experience in teaching. Currently working
       obtained by Kekre's Fast Codebook Generation", ICGST-International                                     as Assistant Professor in Dept. of Computer
       Journal on Graphics, Vision and Image Processing (GVIP),Volume 9,                                      Engineering at Thadomal Shahani Engineering
       Issue 5, pp.: 1-8, September 2009. Available online at                      College, Mumbai. Engineering, SVKM’s NMIMS University, Vile-Parle (W),
       http://www.icgst.com/gvip/Volume9/Issue5/P1150921752.html.                  Mumbai, INDIA. She has more than 12 years of experience in teaching.
                                                                                   Currently working as Assistant Professor in Dept. of Computer Engineering at
  [18] H.B.Kekre, Tanuja K. Sarode, Sudeep D. Thepade, "Color-Texture
                                                                                   Thadomal Shahani Engineering College, Mumbai. She is life member of IETE,
       Feature based Image Retrieval using DCT applied on Kekre's Median
                                                                                   member of International Association of Engineers (IAENG) and International
       Codebook", International Journal on Imaging (IJI),Available online at
                                                                                   Association of Computer Science and Information Technology (IACSIT),
       www.ceser.res.in/iji.html.
                                                                                   Singapore. Her areas of interest are Image Processing, Signal Processing and
  [19] Dr. H. B. Kekre, Sudeep D. Thepade, Nikita Bhandari, “Colorization of       Computer Graphics. She has 90 papers in National /International
       Gereyscale images using Kekre’s Bioorthogonal Color Spaces and              Conferences/journal to her credit.
       Kekre’s Fast Codebook Generation “,CSC Advances in Multimedia
       An international journal (AMU), volume 1, Issue 3,pp.49-58, Available       Sudeep D. Thepade has Received B.E.(Computer) degree from North
       at                                                                                                             Maharashtra University with Distinction in
       www.cscjournals.org/csc/manuscript/journals/AMIJ/volume1/Issue3/A                                              2003. M.E. in Computer Engineering from
       MU-13.pdf.                                                                                                     University of Mumbai in 2008 with
  [20] Dr. H. B. Kekre, Sudeep D. Thepade,Adib Parkar, “A Comparison of                                               Distinction, currently submitted thesis for
       Harr Wavelets and Kekre’s Wavelets for Storing Color Information in                                            Ph.D. at SVKM’s NMIMS, Mumbai. He has
       a Greyscale Images”, International Journal of Computer                                                         more than 08 years of experience in
       Applications(IJCA), Volume 1, Number 11, December 2010,pp 32-38.                                               teaching and industry. He was Lecturer in
       Available at www.ijcaonline.org/archives/volume11/number11/1625-                                               Dept. of Information Technology at
       2186.                                                                                                          Thadomal Shahani Engineering College,
  [21] Dr. H. B. Kekre, Sudeep D. Thepade,Archana Athawale, Adib Parkar,                                              Bandra(w), Mumbai for nearly 04 years.
       “Using Assorted Color Spaces and pixel window sizes for Colorization                                           Currently working as Associate Professor in
       of Grayscale images’,ACM International Conferences and workshops            Computer Engineering at Mukesh Patel School of Technology Management
       on emerging Trends in Technology(ICWET 2010), Thakur College of             and Engineering, SVKM’s NMIMS, Vile Parle(w), Mumbai, INDIA. He is
       Engg. And Tech.,Mumbai,26-27 Feb 2010.                                      member of International Association of Engineers (IAENG) and International
  [22] H. B. Krekre,Sudeep Thepade, Adib Parkar, “A comparison of Kekre’s          Association of Computer Science and Information Technology (IACSIT),
       Fast Search and Exhaustive Search for various grid sizes used for           Singapore. He is member of International Advisory Committee for many
       coloring a Grayscale Image” Second International conference on signal       International Conferences. He is reviewer for various International Journals.
       Acquisition and Processing, (ICSAP2010), IACSIT,Banglore,pp.53-             His areas of interest are Image Processing Applications, Biometric
       57,9-10 Feb 2010.                                                           Identification. He has about 110 papers in National/International
                                                                                   Conferences/Journals to his credit with a Best Paper Award at International
                                                                                   Conference SSPCCIN-2008, Second Best Paper Award at ThinkQuest-2009
                          Author Biographies                                       National Level paper presentation competition for faculty, Best paper award at
                                                                                   Springer international conference ICCCT-2010 and second best research project
Dr. H. B. Kekre has received B.E. (Hons.) in Telecomm. Engineering. from           award at ‘Manshodhan-2010’.
                                 Jabalpur University in 1958, M.Tech
                                 (Industrial Electronics) from IIT Bombay in
                                 1960, M.S.Engg. (Electrical Engg.) from           Supriya Kamoji has received B.E. in Electronics and Communication
                                 University of Ottawa in 1965 and Ph.D.                                   Engineering with Distinction from Karnataka
                                 (System Identification) from IIT Bombay                                  University in 2001. Currently pursuing M.E. from
                                 in 1970 He has worked as Faculty of                                      Thadomal Shahani College of Engineering,
                                 Electrical Engg. and then HOD Computer                                   Mumbai, India. She has more than 8years of
                                 Science and Engg. at IIT Bombay. For 13                                  teaching experience. Currently working as an
                                 years he was working as a professor and head                             Senior Lecturer in Fr.Conceicao Rodrigues
                                 in the Department of Computer Engg. at                                   College of Engineering. Mumbai, India. She is a
Thadomal Shahani Engineering. College, Mumbai. Now he is Senior Professor                                 life time member of Indian society of Technical
at MPSTME, SVKM’s NMIMS. He has guided 17 Ph.Ds, more than 100                                            Education (ISTE). Her areas of interest are Image
M.E./M.Tech and several B.E./ B.Tech projects. His areas of interest are Digital    Processing, Computer Organization and Architecture and Distributed
Signal processing, Image Processing and Computer Networking. He has more            Computing.
than 270 papers in National / International Conferences and Journals to his
credit. He was Senior Member of IEEE. Presently He is Fellow of IETE and
Life Member of ISTE Recently 11 students working under his guidance have
received best paper awards. Two of his students have been awarded Ph. D. from
NMIMS University. Currently he is guiding ten Ph.D. students.




                                                                               138                                   http://sites.google.com/site/ijcsis/
                                                                                                                     ISSN 1947-5500
                                                    (IJCSIS) International Journal of Computer Science and Information Security,
                                                                                                        Vol. 9, No. 6, June 2011



      Evolving Fuzzy Classification Systems from
                   Numerical Data

     Pardeep Sandhu                     Shakti Kumar                       Himanshu Sharma                    Parvinder Bhalla
 Department of Electronics        Computational Intelligence           Department of Electronics         Computational Intelligence
     & Communication,                     Laboratory                      & Communication,                       Laboratory
 Maharishi Markandeshwar           Institute of Science and            Maharishi Markandeshwar            Institute of Science and
    University, Mullana,            Technology, Klawad,                  University, Mullana,              Technology, Klawad,
      Haryana, INDIA                   Haryana, INDIA                      Haryana, INDIA                     Haryana, INDIA
er.pardeepsandhu@gmail.co            shaktik@gmail.com                 himanshu.zte@gmail.com           parvinderbhalla@gmail.com
             m


Abstract — Fuzzy Classifiers are an important class of fuzzy           systems are also called as Fuzzy Rule Based Systems
systems. Evolving fuzzy classifiers from numerical data has            (FRBSs) [4]. These systems have been successfully
assumed lot of significance in the recent past. This paper             applied to a wide range of problems from different areas
proposes a method of evolving fuzzy classifiers using a three          presenting uncertainty and vagueness in different ways
step approach. In the first step, we applied a modified Fuzzy          [5], [6], [7]. These FRBS‘s can be categorized as
C–Means Clustering technique to generate membership
                                                                       knowledge based systems and data driven systems. There
functions. In the second step, we generated rule base using
Wang and Mendel algorithm. The third step was used to                  are two ways of providing knowledge to the systems. In
reduce the size of the generated rule base. This way rule              first type of systems called knowledge driven modeling,
explosion issue was successfully tackled. The proposed                 the rule base is provided by an expert who has the
method was implemented using MATLAB. The approach                      complete knowledge of the domain while in second type
was tested on four very well known multi dimensional                   of models called data driven models, this rule base is
classification data sets. The bench mark classification data           generated from available numerical data [8].
sets contain: Iris Data, Wine Data, Glass Data and Pima
Indian Diabetes Data sets. The performance of the proposed                In data driven systems to automatically generate the
method was very encouraging. We further implemented our                rule base, a number of classical approaches like Hong and
algorithm on a Mamdani type control model for a quick                  Lee‘s Algorithm [9], Wang and Mendel Algorithm [4],
fuzzy battery charger data set. This integrated approach was           [6], [10], [11], [12], Online Learning Algorithm [13],
able to evolve model quickly.                                          Multiphase Clustering Approach [14] and soft computing
   Keywords — Linguistic rules, Fuzzy classifier, Fuzzy logic,         techniques like Artificial Neural Networks [15], [16], [17],
Rule base.
                                                                       Genetic Algorithm [18], [19], Swarm Intelligence based
                     I.   INTRODUCTION                                 techniques [20], Ant Colony Optimization [21], Particle
                                                                       Swarm Optimization [22], Biogeography based
   The theory of fuzzy sets and fuzzy logic was introduced             Optimization [23], Big Bang – Big Crunch Optimization
by Lotfi A. Zadeh through his seminal paper in 1965 [1].               technique [24] are available in the literature [25].
Both these, fuzzy set theory and fuzzy logic act as a
powerful methodology for dealing with imprecision and                     This paper is based on an integrated approach that
nonlinearity in an efficient way [2], [3]. As far as the need          makes use of a modified Fuzzy C–Means Clustering
of fuzzy set theory is concerned, there are numerous                   approach (FCM) [26] and Wang and Mendel method [6].
situations in which classical set theory of 0‘s and 1‘s is not         The approach was implemented in MATLAB for fuzzy
sufficient to describe human reasoning. Thus, for such                 classification problems [27] of Iris data of Fisher [28],
situations we need a more appropriate theory that can also             Wine data, Glass data, Pima Indian Diabetes (PID) data
define membership grades in between ‗0‘ and ‗1‘ thereby                and Battery Charger data (control problem) [29]. A system
providing better results in terms of human reasoning.                  was evolved using set of training examples and system‘s
Fuzzy set theory attempts to do this.                                  performance was then evaluated using test data set for the
                                                                       given system. The system performances were evaluated in
   Further this theory of fuzzy logic leads to the                     terms of Average Classification Rate (for classification
development of fuzzy logic based systems, the systems                  problems) and Mean Square Error (for control problem).
which are capable of making a decision on the basis of
knowledge or intelligence provided to the system through                 The paper is organized as follows: Section II introduces
linguistic rule bases. As a particular combination of input            Fuzzy Logic Based Systems. Section III discusses the
is given to the system, system on the basis of knowledge               proposed integrated approach and WM method for rule
embedded into it in the form of linguistic rules makes a               base generation. In section IV the result analysis along
decision and processes those inputs. As the intelligence of            with the comparative study for above mentioned standard
these systems depends upon linguistic rule base, these                 data sets are shown and section V includes conclusions.



                                                                 139                             http://sites.google.com/site/ijcsis/
                                                                                                 ISSN 1947-5500
                                                    (IJCSIS) International Journal of Computer Science and Information Security,
                                                                                                        Vol. 9, No. 6, June 2011

              II.       FUZZY RULE BASED SYSTEMS                       Rule: IF antecedent……THEN consequent…….                                 (2)
   Fuzzy logic is a mathematical approach to emulate the                  The antecedent part provides the input variable
human way of thinking and learning [30]. This logic is an              conditions using IF statements and consequent provides
extension of classical set theory which says a fuzzy set is a          the output using THEN statements. For example, if X and
class of objects with a continuum of grades of                         Y are the input and output universes of discourse of a
membership. Such a set is characterized by a membership                fuzzy system with a rule base of size ‗N‘, then the rule
mapping the elements of a domain, space or universe of                 will be of the form as shown by equation (3):
discourse ‗U‘ to the interval {0, 1}. If ‗U‘ is a collection           Rule ith: IF x is Ai THEN y is Bi                                      (3)
of objects denoted by x, then a fuzzy set ‗A‘ in the
universe of discourse ‗U‘ can be defined as a set of                      Where, x and y represent input and output fuzzy
ordered pairs as shown in equation (1) [5], [8]:                       linguistic variables respectively, and Ai Є X and Bi Є Y
                                                                       (1≤ i ≤N) are fuzzy sets representing linguistic values of x
    
A  xi ,  A ( xi)        x  A                          (1)         and y [5].
  Here x refers to ith element of the set and µA (xi) is the              In Mamdani type systems the consequent is represented
membership grade of xi in set ‗A‘.                                     using fuzzy sets while in Sugeno type systems, it is a
  Fuzzy Logic Based Systems or Fuzzy Rule Based                        fuzzy singleton. Also in TSK type systems, it is a function
Systems (FRBS) are intelligent systems those are based on              of inputs [23].
mapping of input spaces to output spaces where the way of
                                                                                          III.     PROPOSED APPROACH
representing this mapping is known as fuzzy linguistic rules.
These intelligent systems provide a framework for representing            We first broke the system identification problem into
and processing information in a way that resembles human               three sub–problems and solved these one by one as
communication and reasoning process.                                   follows:
                                                                         1. Classify all the relevant input and output domains
                                                                            into various membership functions using modified
                                                                            FCM method [26].
                                                                         2. Apply Wang and Mendel algorithm [6] for creating
                                                                            a fuzzy rule base, evolved as a combination of rules
                                                                            generated from numerical examples and linguistic
                                                                            rules supplied by human experts.
                                                                         3. Keep the number of rules to bare minimum. We
                                                                            used a rule reduction technique as proposed in [32],
                     Figure 1. Fuzzy Logic System                           [33] to keep the rule base as compact as possible.
   Each fuzzy rule based system, typically possesses a                    The backbone of this approach is the Wang and Mendel
fuzzy inference system (shown in Figure 1) composed of                 algorithm [6] which has proved to be very effective.
four major modules: Fuzzification module, Inference
Engine, Knowledge Base and Defuzzification module                            Suppose the given set of desired input–output data pairs
[31]. The fuzzification module performs the                            is:
transformation of crisp inputs into fuzzy domain values. It
is mainly done to find the belongingness of data sets to                     x(1) (1) (1)
                                                                               1 , x2 ; y    x
                                                                                             ,   ( 2 ) ( 2) ( 2 )
                                                                                                 1 , x2 ; y         ,.......                 (4)
different membership functions. The fuzzification can be
                                                                          Here x1, x2 are inputs and y is the output. The problem
performed by either with the help of domain experts or
                                                                       formulation consists of generating fuzzy rules and to use
directly from the available numerical data. These fuzzy
                                                                       these rules to determine a mapping from inputs (x1, x2) to
domain values are then processed by inference engine
                                                                       output (y).
which is composed of composition, implication and
aggregation processes. The method of processing the                          The following steps present our integrated approach:
inputs is supplied by the knowledge base and rule base                   Step 1: Divide the input output spaces into fuzzy
module as it contains the knowledge of the application                 regions:
domain and the procedural knowledge. Finally, the
processed output of inference engine is transformed from                 We divide input spaces into desired number of
fuzzy domain to crisp domain by defuzzification module.                membership functions using modified FCM [26].
   One of the biggest challenges in the field of modeling                 Assuming that the domain intervals of inputs x1, x2 and
fuzzy rule based systems is the designing of rule base as it           output y (equation (4)) lies in [x1-, x1+], [x2-, x2+] and
is characterized by a set of IF–THEN linguistic rules. This            [y-, y+]. Here, the domain interval means the values for a
rule base can be defined either by an expert or can be                 particular variable will lie in this interval. Each of these
extracted from numerical data using any computerized                   input and output, spaces are partitioned into (2N+1)
techniques as mentioned in section I. A rule in fuzzy                  regions. The number N can be different for each of the
domain can be represented by equation (2):                             variables. E.g. if the value of N = 2, then there will be five




                                                                 140                                   http://sites.google.com/site/ijcsis/
                                                                                                       ISSN 1947-5500
                                                    (IJCSIS) International Journal of Computer Science and Information Security,
                                                                                                        Vol. 9, No. 6, June 2011

membership functions [6]. A number of other methods                                       IV.   RESULT ANALYSIS
are also available to divide the input output spaces into                 This section presents the performances obtained by our
fuzzy regions.                                                         integrated approach that uses modified Fuzzy C–Means
  Step 2: Generate fuzzy rules from given input–output                 Clustering [26] and Wang and Mendel algorithm [6] to
data pairs:                                                            evolve fuzzy rule based systems. We applied our approach
                                                                       on four very well known classification data sets from
  In this step, first the degree of a given data set (x1(i),           machine learning repository and one control data set. In
  (i)
x2 ; y(i)) into different fuzzy membership functions are               each experiment, the input and output domain intervals are
determined.                                                            fuzzified using modified FCM approach. The training data
  Second, assign a given data set (x1(i), x2(i); y(i)) to the          samples are selected from available data sets in
region with maximum degree and obtain one rule from                    correspondence with the peaks of the input membership
one data set.                                                          functions. This sequence is used to train the systems
                                                                       which are then tested using testing data sets.
  Step 3: Assign a degree to each rule:
                                                                       A. Example 1: Iris Data Classification Problem
   A degree to each generated rule can be assigned using
following formula of equation (5):                                         The proposed approach has been applied on Iris Data
                                                                       classification problem. The Iris data set is a widely used
    Drule   A ( x1 )   B ( x2 )   C ( y)         (5)           benchmark for classification and pattern recognition
                                                                       studies [27], [28]. The dataset contains 150 samples of
   That is the product of membership grade of input x1 in              data (50 samples for each species) with four attributes as
fuzzy set ‗A‘, membership grade of input x2 in fuzzy set               inputs, Sepal Length, Sepal Width, Petal Length and Petal
‗B‘ and membership grade of output y in fuzzy set ‗C‘.                 Width and three classes of iris plants namely: Iris Setosa,
Also at this point if an expert is available and he assigns            Iris Versicolor and Iris Virginica as output. All the input
his degree of belief in the correctness of a particular data           variables have measurement units in centimeter while the
set then that degree ‗m‘ must be multiplied with the above             output is the type of iris plant. The learning sequence
expression.                                                            includes 24 data samples while the system is tested on all
  Step 4: Create a combined fuzzy rule base:                           150 data samples. By applying the proposed method on
                                                                       the learning sequence, a set of 24 classification rules (one
   The combined fuzzy rule base is assigned rules from                 rule per training data sample) is obtained. From this
either those generated from numerical data or linguistic               combined rule base, the redundant rules are then removed
rules (we assume that a linguistic rule also has a degree              using rule reduction algorithm [32], [33] and the final rule
that is assigned by the human experts and reflects the                 base composing 4 rules are shown in Table I.
expert‘s belief of the importance of the rule). Also, if there
is more than one rule having same antecedents but                      TABLE I.      CLASSIFICATION RULE BASE FOR IRIS DATA CLASSIFIER
different or same consequents then rule with maximum
                                                                          Sepal        Sepal        Petal        Petal
degree is to be selected. In this way, both numerical and                Length        Width       Length        Width
                                                                                                                                Class
linguistic information are represented by a common
framework– the combined fuzzy rule base.                                  SL–L         SW–M         PL–L          PW–L          Setosa

   Step 5: Determine a mapping based on the combined                      SL–M         SW–L         PL–M         PW–M         Versicolor
fuzzy rule base:
   Defuzzification strategy is used to determine the output               SL–M         SW–L         PL–H         PW–M         Virginica
control for given inputs. This step performs nothing but
                                                                          SL–M         SW–L         PL–H         PW–H         Virginica
the same operation as defuzzification module performs in
a fuzzy inference system.
                                                                          Here, L – Low, M – Medium, H – High
  Step 6: Rule reduction:
   This step is used to reduce the number of redundant                   TABLE II.     CLASSIFICATION RATES FOR IRIS DATA CLASSIFIER
                                                                                            (PROPOSED APPROACH)
rules from the rule base. Thus the main objective of this
step has been to deal with rule explosion issue which if                Number                                                Average
left untackled may lead to a rule base with unmanageable,                            Setosa     Versicolor    Virginica
                                                                        of Rules                                               Rate
large number of rules in the rule base.
                                                                           4         98.00%      100.00%        94.00%         97.33%
   This procedure can easily be extended to general multi–
input multi–output cases. So, the approach can be viewed
                                                                           3         98.00%      100.00%        90.00%         96.00%
as a very general ‗model–free trainable fuzzy system‘ for
a wide range of applications, where model free means no
mathematical model is required for the problem and                        Table II shows the class wise classification rates along
trainable means the system learns from examples and                    with the effect of variations in the size of the rule base.
expert rules, and can adaptively change the mapping when               Table III presents a comparative analysis of different
new examples and expert rules are available.                           algorithms with the proposed integrated approach for Iris



                                                                 141                              http://sites.google.com/site/ijcsis/
                                                                                                  ISSN 1947-5500
                                                         (IJCSIS) International Journal of Computer Science and Information Security,
                                                                                                             Vol. 9, No. 6, June 2011

data set. The different parameters taken for comparison                 high performance Iris data classifier can be designed using
include number of input fuzzy sets, number of rules, and                a smaller length learning sequence and a compact set of
classification rates. The table clearly demonstrates that a             rules as shown in Table I.
                    TABLE III.            COMPARISON OF THE PROPOSED APPROACH WITH OTHER APPROACHES (IRIS DATA)

                                                                 Number of Input             Number of          Classification Rates
                                    Algorithm
                                                                   Fuzzy Sets                 Rules               (Testing Data)
           Hong and Lee‘s Algorithm [9]                                      8                 6.21                    95.57%

           Particle Swarm Optimization [22]                                 —                      —                   96.80%

           α–Cut based Fuzzy Learning Algorithm [34]                     8.21                      3                   96.21%

           Fuzzy Classifier Ensembles based Algorithm [35]                  —                      —                   90.70%

           Genetic Algorithm [36]                                           —                  10.10                   90.67%

           LEM–2 Method [37]                                                —                      —                   92.30%

           Proposed Approach                                                10                     4                   97.33%


B. Example 2: Wine Data Classification Problem                          TABLE IV. CLASSIFICATION RULE BASE FOR WINE DATA CLASSIFIER

   The Wine data set is also one of the most well–known                                            Flavan-
                                                                        Alcohol         Ash                     Hue       OD      Proline      Class
                                                                                                      oid
data sets in machine learning literature [27]. The data has
been obtained from the chemical analysis of wines grown                          M       L             M         M        M            H         1
in the same region in Italy but derived from three different                     M       M             M         M        M            M         1
cultivars. The chemical analysis determines the quantities
of thirteen constituents found in each of the three types of                     M       M             M         M        M            H         1
wines. These thirteen constituents are: Alcohol, Malic                           L       L             L         M         L           M         2
Acid, Ash, Alcalinity of Ash, Magnesium, Phenols,
                                                                                 L       M             M         M        M            L         2
Flavanoids, Non–Flavanoid Phenols, Proanthocyaninsm,
Color Intensity, Hue, OD280/OD315 of Diluted Wines                               M       L             M         M        M            L         2
and Proline. This dataset contains 178 samples of data (59                       M       L             L          L        L           M         3
samples for Class ‗1‘, 71 samples for Class ‗2‘ and 48
samples for Class ‗3‘ Wine). Out of these thirteen                               Here, L – Low, M – Medium, H – High
attributes, following six attributes are used to model Wine
Data Classifier: Alcohol, Ash, Flavanoids, Hue,                         TABLE V.              CLASSIFICATION RATES FOR WINE DATA CLASSIFIER
OD280/OD315 and Proline [38]. The training data set                                               (PROPOSED APPROACH)
contains 28 data samples and testing data set contains 178
samples.                                                                    Number                                                          Average
                                                                                         Class ‘1’         Class ‘2’      Class ‘3’
                                                                            of Rules                                                         Rate
   In this case, the proposed approach successfully
generated 28 rules which were reduced to 7 rules by                              7       100.00%             100.00%       95.83%           98.87%
applying rule reduction algorithm [32], [33] as shown in
                                                                                 6       100.00%             98.59%        95.83%           98.30%
Table IV. The performance of the evolved Wine data
classifier is shown in Table V in terms of classification                        5        96.61%             97.18%        95.83%           96.62%
rates. Table V also shows the variations in the
classification rate by varying the number of rules. Table                        4        96.61%             95.77%        95.83%           96.06%
VI shows the comparison of the proposed approach with
other approaches.
                    TABLE VI.           COMPARISON OF THE PROPOSED APPROACH WITH OTHER APPROACHES (WINE DATA)

                                                               Number of Input           Number of              Classification Rates
                                Algorithm
                                                               Attributes Used            Rules                   (Testing Data)
           Evolutionary Approach [38]                                   6                      5                       98.90%

           eClass Classifier [39]                                       13                     7                       95.90%

           SANFIS Learning Algorithm [40]                               13                     3                       99.43%

           Hyper – Cone Membership Function Approach [41]            —                        —                        92.95%

           IPCA Algorithm [42]                                       —                        —                        87.60%
           Proposed Approach                                            6                      7                       98.87%




                                                                  142                                         http://sites.google.com/site/ijcsis/
                                                                                                              ISSN 1947-5500
                                                       (IJCSIS) International Journal of Computer Science and Information Security,
                                                                                                           Vol. 9, No. 6, June 2011

C. Example 3: Glass Data Classification Problem                        non_ float_processed, class ‗5‘ as Containers, class ‗6‘ as
                                                                       Tableware and class ‗7‘ as Headlamps. Although the
   The glass data set [27] is a nine–dimensional data set
                                                                       original data set contains seven classes but it doesn‘t have
with 214 samples from seven classes, also taken from
                                                                       any data sample from class ‗4‘. The learning sequence
Irvine Machine Learning Repository. Here, this data set
                                                                       contains 56 data samples while the testing sequence is
has been chosen because it involves many classes. The
                                                                       composed of all 214 data samples. For this classifier the
nine input attributes are: Refractive Index (RI), Sodium
                                                                       proposed method first generated a rule base of 37 rules
(Na), Magnesium (Mg) Aluminum (Al), Silicon (Si),
                                                                       which was reduced to 20 rules (shown in Table VII) by
Potassium (K), Calcium (Ca), Barium (Ba) and Iron (Fe).
                                                                       using rule reduction algorithm [32], [33]. The class wise
Out of these nine attributes the last two attributes Barium
                                                                       classification results for the modeled Glass data classifier
(Ba) and Iron (Fe) are excluded in this paper due to very
                                                                       for the given test data set are specified in Table VIII.
small variations in their sample points. The output classes
                                                                       Table IX shows a comparison of Glass classifiers for
indicate different types of the glasses: class ‗1‘ as
                                                                       different algorithms. The results show that the
Building_windows_float_processed, class ‗2‘ as Building
                                                                       classification rate of 71.49% can be achieved with lesser
_windows_non_float_processed, class ‗3‘ as Vehicle_
                                                                       training data set and with lesser number of rules.
windows_float_processed, class ‗4‘ as Vehicle_windows_
                                 TABLE VII.       CLASSIFICATION RULE BASE FOR GLASS DATA CLASSIFIER

               RI           Na            Mg             Al             Si           K           Ca             Class

               M             L                H          M              H           M            L                1

               H             H                H          L             M             L           M                1

               H             M                H          L             M             L           M                1

               L             L                H          M              H           M            L                2

               L             M                H          M              H            L           L                2

               L             H                H          M             M             L           L                2

               M             L                H          M             M            M            L                2

               M             M                H          M             M            M            L                2

               H             L                L          L              H            L           H                2

               H             L                L          H              L           M            H                2

               H             H                L          L