Journal of Computer Science and Information Security Volume 9 No. 6 June 2011
Journal of Computer Science and Information Security (IJCSIS ISSN 1947-5500) is an open access, international, peer-reviewed, scholarly journal with a focused aim of promoting and publishing original high quality research dealing with theoretical and scientific aspects in all disciplines of Computing and Information Security. The journal is published monthly, and articles are accepted for review on a continual basis. Papers that can provide both theoretical analysis, along with carefully designed computational experiments, are particularly welcome. IJCSIS editorial board consists of several internationally recognized experts and guest editors. Wide circulation is assured because libraries and individuals, worldwide, subscribe and reference to IJCSIS. The Journal has grown rapidly to its currently level of over 1,100 articles published and indexed; with distribution to librarians, universities, research centers, researchers in computing, and computer scientists. Other field coverage includes: security infrastructures, network security: Internet security, content protection, cryptography, steganography and formal methods in information security; multimedia systems, software, information systems, intelligent systems, web services, data mining, wireless communication, networking and technologies, innovation technology and management. (See monthly Call for Papers) IJCSIS is published using an open access publication model, meaning that all interested readers will be able to freely access the journal online without the need for a subscription. We wish to make IJCSIS a first-tier journal in Computer science field, with strong impact factor. On behalf of the Editorial Board and the IJCSIS members, we would like to express our gratitude to all authors and reviewers for their sustained support. The acceptance rate for this issue is 35%. I am confident that the readers of this journal will explore new avenues of research and academic excellence. Available at http://si
- views:
- 3273
- posted:
- 7/8/2011
- language:
- English
- pages:
- 355

IJCSIS Vol. 9 No. 6, June 2011
ISSN 1947-5500
International Journal of
Computer Science
& Information Security
© IJCSIS PUBLICATION 2011
Editorial
Message from Managing Editor
Journal of Computer Science and Information Security (IJCSIS ISSN 1947-5500) is an open
access, international, peer-reviewed, scholarly journal with a focused aim of promoting and
publishing original high quality research dealing with theoretical and scientific aspects in all
disciplines of Computing and Information Security. The journal is published monthly, and articles
are accepted for review on a continual basis. Papers that can provide both theoretical analysis,
along with carefully designed computational experiments, are particularly welcome.
IJCSIS editorial board consists of several internationally recognized experts and guest editors.
Wide circulation is assured because libraries and individuals, worldwide, subscribe and reference
to IJCSIS. The Journal has grown rapidly to its currently level of over 1,100 articles published and
indexed; with distribution to librarians, universities, research centers, researchers in computing,
and computer scientists.
Other field coverage includes: security infrastructures, network security: Internet security, content
protection, cryptography, steganography and formal methods in information security; multimedia
systems, software, information systems, intelligent systems, web services, data mining, wireless
communication, networking and technologies, innovation technology and management. (See
monthly Call for Papers)
IJCSIS is published using an open access publication model, meaning that all interested readers
will be able to freely access the journal online without the need for a subscription. We wish to
make IJCSIS a first-tier journal in Computer science field, with strong impact factor.
On behalf of the Editorial Board and the IJCSIS members, we would like to express our gratitude
to all authors and reviewers for their sustained support. The acceptance rate for this issue is 35%.
I am confident that the readers of this journal will explore new avenues of research and academic
excellence.
Available at http://sites.google.com/site/ijcsis/
IJCSIS Vol. 9, No. 6, June 2011 Edition
ISSN 1947-5500 © IJCSIS, USA.
Journal Indexed by (among others):
IJCSIS EDITORIAL BOARD
Dr. M. Emre Celebi,
Assistant Professor, Department of Computer Science, Louisiana State University
in Shreveport, USA
Dr. Yong Li
School of Electronic and Information Engineering, Beijing Jiaotong University,
P. R. China
Prof. Hamid Reza Naji
Department of Computer Enigneering, Shahid Beheshti University, Tehran, Iran
Dr. Sanjay Jasola
Professor and Dean, School of Information and Communication Technology,
Gautam Buddha University
Dr Riktesh Srivastava
Assistant Professor, Information Systems, Skyline University College, University
City of Sharjah, Sharjah, PO 1797, UAE
Dr. Siddhivinayak Kulkarni
University of Ballarat, Ballarat, Victoria, Australia
Professor (Dr) Mokhtar Beldjehem
Sainte-Anne University, Halifax, NS, Canada
Dr. Alex Pappachen James, (Research Fellow)
Queensland Micro-nanotechnology center, Griffith University, Australia
Dr. T.C. Manjunath,
ATRIA Institute of Tech, India.
TABLE OF CONTENTS
1. Paper 19051107: Additive Model of Reliability of Biometric Systems with Exponential Distribution of
Failure Probability (pp. 1-4)
Zoran Ćosić, Director, Statheros d.o.o., Kaštel Stari, Croatia
Jasmin Ćosić, IT Section of Police Administration, Ministry of Interior of Una-sana canton, Bihać, Bosnia and
Hercegovina
Miroslav Bača, Professor, Faculty of Organisational and Informational science, Varaždin, Croatia
2. Paper 26051129: IP Private Branch eXchange of Saint Joseph University, Macao: A Design Case Study
(pp. 5-11)
A. Cotão, R. Whitfield, J. Negreiros
Information Technology Department, University of Saint Joseph, Macau, China
3. Paper 30051137: A Novel and Secure Data Sharing Model with Full Owner Control in the Cloud
Environment (pp. 12-17)
Mohamed Meky and Amjad Ali
Center of Security Studies, University of Maryland University College, Adelphi, Maryland, USA
4. Paper 17051101: Performances Evaluation of Inter-System Handover between IEEE802.16e and
IEEE802.11 Networks (pp. 18-24)
Abderrezak Djemai, Mourad Hadjila, Mohammed Feham
STIC laboratory, University of Tlemcen, Algeria
5. Paper 24051124: Recognizing the Electronic Medical Record Data from Unstructured Medical Data Using
Visual Text Mining Techniques (pp. 25-35)
Prof. Hussain Bushinak, Faculty of Medicine, Ain Shams University, Cairo, Egypt
Dr. Sayed AbdelGaber, Faculty of Computers and Information, Helwan University, Cairo, Egypt
Mr. Fahad Kamal AlSharif, Collage of Computer Science, Modern Academy, Cairo, Egypt
6. Paper 31051149: Creating an Appropriate Programming Language for Student Compiler Project (pp. 36-
39)
Elinda Kajo Mece, Department of Informatics Engineering, Polytechnic University of Tirana, Tirana, Albania
7. Paper 31051152: The History of Web Application Security Risks (pp. 40-47)
Fahad Alanazi, Software Technology Research Laboratory, De Montfort University, Leicester, LE1 9BH UK
Mohamed Sarrab, Software Technology Research Laboratory, De Montfort University, Leicester, LE1 9BH UK
8. Paper 31051156: Improving the Performance of Translation Wavelet Transform using BMICA (pp. 48-56)
Janett Walters-Williams, School of Computing & Information Technology, University of Technology, Jamaica,
Kingston 6, Jamaica W.I.
Yan Li, Department of Mathematics & Computing, Centre for Systems Biology, University of Southern Queensland,
Toowoomba, Australia
9. Paper 31051157: Hole Filing IFCNN Simulation by Parallel RK(5,6) Techniques (pp. 57-64)
S. Senthilkumar and Abd Rahni Mt Piah,
Universiti Sains Malaysia, School of Mathematical Sciences, Pulau Pinang-11800, Penang, Malaysia
10. Paper 31051160: Location Estimation and Mobility Prediction Using Neuro-fuzzy Networks In Cellular
Networks (pp. 65-69)
Maryam Borna, Department of Electrical Engineering, Iran University of Science and Technology, Tehran, Iran
Mohammad Soleimani, Department of Electrical Engineering, Iran University of Science and Technology, Tehran,
Iran
11. Paper 31051161: A Fuzzy Clustering Based Approach for Mining Usage Profiles from Web Log Data (pp.
70-79)
Zahid Ansari 1, Mohammad Fazle Azeem 2, A. Vinaya Babu 3 and Waseem Ahmed 4
1,4
Dept. of Computer Science Engineering, P.A. College of Engineering, Mangalore, India
2
Dept. of Electronics and Communication Engineering, P.A. College of Engineering, Mangalore, India
3
Dept. of Computer Science Engineering, Jawaharlal Nehru Technological University, Hyderabad, India
12. Paper 31051168: Inception of Hybrid Wavelet Transform using Two Orthogonal Transforms and It’s use
for Image Compression (pp. 80-87)
Dr. H.B. Kekre, Senior Professor, Computer Engineering Department, SVKM’s NMIMS (Deemed-to-be University),
Vile Parle(W), Mumbai, India.
Dr. Tanuja K. Sarode, Assistant Professor, Computer Engineering Department, Thadomal Shahani Engineering
College, Bandra(W), Mumbai, India.
Sudeep D. Thepade, Associate Professor, Computer Engineering Department, SVKM’s NMIMS (Deemed-to-be
University), Vile Parle(W), Mumbai, India
13. Paper 31051177: A Model for the Controlled Development of Software Complexity Impacts (pp. 88-93)
Ghazal Keshavarz, Computer department, Science and Research Branch, Islamic Azad University, Tehran, Iran
Nasser Modiri, Computer department, Islamic Azad University, Zanjan, Iran
Mirmohsen Pedram, Computer department, Tarbiat Mollem University, Karaj, Iran
14. Paper 31051178: A Hierarchical Overlay Design for Peer to Peer and SIP Integration (pp. 94-99)
Md. Safiqul Islam & Syed Ashiqur Rahman, Computer Science and Engineering Department, Daffodil International
University, Dhaka, Bangladesh
Rezwan Ahmed, American International University – Bangladesh, Dhaka, Bangladesh
Mahmudul Hasan, Computer Science and Engineering Department, Daffodil International University, Dhaka,
Bangladesh
15. Paper 31051199: Evaluation of CPU Consuming, Memory Utilization and Time Transfering Between
Virtual Machines in Network by using HTTP and FTP techniques (pp. 100-105)
Igli TAFA, Elinda KAJO, Elma ZANAJ, Ariana BEJLERI, Aleksandër XHUVANI
Polytechnic University of Tirana, Information Technology Faculty, Computer Engineering Department, Tiranë,
Albania
16. Paper 17051103: A Proposal for Common Vulnerability Classification Scheme Based on Analysis of
Taxonomic Features in Vulnerability Databases (pp. 106-111)
Anshu Tripathi, Department of Information Technology, Mahakal Institute of Technology, Ujjain, India
Umesh Kumar Singh, Institute of Computer Science, Vikram University, Ujjain, India
17. Paper 19051108: Abrupt Change Detection of Fault in Power System Using Independent Component
Analysis (pp. 112-118)
Satyabrata Das, Asstt Prof., Department of CSE, College of Engineering Bhubaneswar, Orissa, India-751024
Soumya Ranjan Mohanty, Asstt Prof., Department of EE, Motilal Neheru National Institute of Technology,
Allahabad, India-211004
Sabyasachi Pattnaik, Prof., Department of I&CT, Fakir Mohan University, Balasore, India -756019
18. Paper 19051113: Modeling and Analyze the Deep Web: Surfacing Hidden Value (pp. 119-124)
Suneet Kumar, Associate Professor; Computer Science Dept., Dehradun Institute of Technology, Dehradun, India
Anuj Kumar Yadav, Assistant Professor; Computer Science Dept., Dehradun Institute of Technology, Dehradun,
India
Rakesh Bharati, Assistant Professor; Computer Science Dept., Dehradun Institute of Technology, Dehradun, India
Rani Choudhary, Sr. Lecturer; Computer Science Dept., BBDIT, Ghaziabad, India
19. Paper 24051122: Instigation of Orthogonal Wavelet Transforms using Walsh, Cosine, Hartley, Kekre
Transforms and their use in Image Compression (pp. 125-133)
Dr. H. B.Kekre, Sr. Professor, MPSTME, SVKM’s, NMIMS (Deemed-to-be University, Vileparle(W), Mumbai-56,
India.
Dr. Tanuja K. Sarode, Asst. Professor, Thadomal Shahani Engg. College, Bandra (W), Mumbai-50, India.
Sudeep D. Thepade, Associate Professor, MPSTME, SVKM’s, NMIMS (Deemed-to-be University, Vileparle(W),
Mumbai-56, India.
Ms. Sonal Shroff, Lecturer, Thadomal Shahani Engg. College Bandra (W), Mumbai-50, India
20. Paper 25051125: Analysing Assorted Window Sizes with LBG and KPE Codebook Generation
Techniques for Grayscale Image Colorization (pp. 134-138)
Dr. H. B. Kekre, Sr. Professor, MPSTME, SVKM’s, NMIMS (Deemed-to-be University, Vileparle(W), Mumbai-56,
India.
Dr. Tanuja K. Sarode, Asst. Professor, Thadomal Shahani Engg. College, Bandra (W), Mumbai-50, India.
Sudeep D. Thepade, Associate Professor, MPSTME, SVKM’s, NMIMS (Deemed-to-be University, Vileparle(W),
Mumbai-56, India.
Ms. Supriya Kamoji, Sr.Lecturer, Fr.Conceicao Rodrigues College of Engg, Bandra (W), Mumbai-50, India
21. Paper 27051132: Evolving Fuzzy Classification Systems from Numerical Data (pp. 139-147)
Pardeep Sandhu, Maharishi Markandeshwar University, Mullana, Haryana, India
Shakti Kumar, Institute of Science and Technology, Klawad, Haryana, India
Himanshu Sharma, Maharishi Markandeshwar University, Mullana, Haryana, India
Parvinder Bhalla, Institute of Science and Technology, Klawad, Haryana, India
22. Paper 27051134: A Low-Power CMOS Implementation of a Cellular Neural Network for Connected
Component Detection (pp. 148-152)
S. El-Din, A.K. Abol El-Seoud, and A. El-Fahar
Electrical Engineering Department, University of Alexandria, Alex, Egypt.
M. El-Sayed Ragab, School of Electronics, Comm. and Computer Eng., E-JUST. , Alexandria, Egypt.
23. Paper 30041170: An Integrated Framework for Content Based Image Retrieval (pp. 153-157)
Ritika Hirwane, SOIT, RGPV, Bhopal
Prof. Nishchol Mishra, SOIT, RGPV, Bhopal
24. Paper 30051135: A Novel Approach for Intranet Mailing For Providing User Authentication (pp. 158-163)
ASN Chakravarthy †, Sri Sai Aditya Institute of Science & Technology, Suram Palem,E.G.Dist , Andhra Pradesh,
India
A.S.S.D. Toyaza ††, Sri Sai Aditya Institute of Science & Technology, Suram Palem,E.G.Dist , Andhra Pradesh,
India
25. Paper 31011189: Visualization of Fluid Flow Patterns in Horizontal Circular Pipe Ducts (pp. 164-170)
Olagunju, Mukaila, Department of Computer Science, Kwara State Polytechnic, Ilorin, Nigeria
Taiwo, O. A (Ph.D), Department of Mathematics, University of Ilorin, Nigeria.
26. Paper 31031183: Iris Image Pre-Processing and Minutiae Points Extraction (pp. 171-174)
Archana R. C., J. Naveenkumar, Prof. Dr. Suhas.H.Patil
Computer Engineering, VDUCOE, Pune, Maharashtra, India
27. Paper 31051142: Establishing Relationships among Chidamber And Kemerer’s Suite of Metrics using
Field Experiments (pp. 175-182)
Ezekiel U Okike, Department of computer Science, University of Ibadan, Ibadan, Nigeria
Adenike O. Osofisan, Department of computer Science, University of Ibadan, Ibadan, Nigeria
28. Paper 31051143: Risk Assessment of Authentication Protocol: Kerberos (pp. 183-187)
Pathan Mohd. Shafi, Smtg Kashibai Navale College of Engineering,Pune
Dr Abdul sattar, Royal Institute of Technology and Science R. R. Dist.
Dr. P. chenna Reddy, JNTU College of Engineering, Pulivendula.
29. Paper 31051144: TOTN: Development of A Tourism – Specific Ontology For Information Retrieval In
Tamilnadu Tourism (pp. 188-193)
K. R. Ananthapadmanaban, Research Scholar, Sri Chandrasekarendra SaraswathiViswa Mahavidyalaya University,
Enathur, Kanchipuram-631 561
Dr. S. K. Srivatsa, Senior Professor, St.Joseph‘s College of Engg., Jeppiaar Nagar, Chennai-600 064
30. Paper 31051145: Unified Fast Algorithm for Most Commonly used Transforms using Mixed Radix and
Kronecker Product (pp. 194-202)
Dr. H.B. Kekre, Senior Professor, Department of Computer Science, Mukesh Patel School of Technology
Management and Engineering, Mumbai, India
Dr. Tanuja Sarode, Associate Professor, Department of Computer Science, Thadomal Shahani College of
Engineering, Mumbai, India
Rekha Vig, Asst. Prof. and Research Scholar, Dept. of Elec. and Telecom., Mukesh Patel School of Technology
Management and Engineering, Mumbai, India
31. Paper 31051147: A Framework for Identifying Software Vulnerabilities within SDLC Phases (pp. 203-
207)
Zeinab Moghbel, Department of Computer Engineering, Science and Research Branch, Islamic Azad University,
Tehran, Iran
Nasser Modiri, Department of Computer Engineering, Zanjan Branch, Islamic Azad University, Zanjan, Iran
32. Paper 31051148: Text Clustering Based on Frequent Items Using Zoning and Ranking (pp. 208-214)
S. Suneetha, Dr. M. Usha Rani, Department of Computer Science, SPMVV, Tirupati
Yaswanth Kumar.Avulapati, Dept of Computer Science, S.V.University, Tirupati
33. Paper 31051162: Steganography based on Contourlet Transform (pp. 215-220)
Sushil Kumar, Department of Mathematics, Rajdhani college, University of Delhi, New Delhi, India
S.K. Muttoo, Department of Computer Science, University of Delhi, Delhi, India
34. Paper 31051164: A Comparative Study of Proposed Improved PSO Algorithm with Proposed Hybrid
Algorithm for Multiprocessor Job Scheduling (pp. 221-228)
K. Thanushkodi, Akshaya College of Engineering and Technology Coimbatore, India
K. Deeba, Department of Computer Science and Engineering, Kalaignar Karunanidhi Institute of Technology,
Coimbatore, India
35. Paper 31051165: SCAM – Software Component Assessment Model (pp. 229-234)
Hasan Tahir, Aasia Khannum, Ruhma Tahir
Department of Computer Engineering, College of Electrical & Mechanical Engineering, National University of
Sciences and Technology (NUST), Islamabad, Pakistan
36. Paper 31051170: Selected Problems on Mobile Agent Communication (pp. 235-239)
Yinka A. Adekunle and Sola S. Maitanmi
Department of Computer Science & Mathematics, Babcock University, Ilisan Remo, Ogun State, Nigeria.
37. Paper 31051175: KGS Based Control for Parking Brake Cable Manufacturing System (pp. 240-248)
Geeta Khare, S.S.J.P, Asangaon
Dr. R.S. Prasad, R.R.I.M.T., Lucknow
38. Paper 31051183: Performance Appraise of Assorted Thresholding Methods in CBIR using Block
Truncation Coding (pp. 249-255)
Dr. H.B. Kekre, Sudeep D. Thepade, Shrikant Sanas
Computer Engineering Department, MPSTME, SVKM’s NMIMS (Deemed-to-be University), Mumbai, India
39. Paper 31051193: Performance Analysis of Cryptographic Algorithms Like ElGamal, RSA, and ECC for
Routing Protocols in Distributed Sensor Networks (pp. 256-263)
Suresha , Department of CSE, Reva Institute of Technology and Management, Bangalore ,Karnataka, India
Dr.Nalini.N , Prof. and Head, Department of CSE, Nitte Meenakshi Institute of Technology, Bangalore, Karnataka,
India
40. Paper 310511100: Exaggerate Self Quotient Image Model For Face Recognition Enlist Subspace Method
(pp. 264-269)
S. Muruganantham, Assistant professor, S.T.Hindu college, Nagercoil, India -629003
T. Jebarajan, Principal, Kings College of Engineering, Chennai, India - 600105
41. Paper 10000000: PHCC: Predictive Hop-by-Hop Congestion Control Protocol for Wireless Sensor
Networks (pp. 270-274)
Shahram Babaie, Eslam Mohammadi, Saeed Rasouli Heikalabad, Hossein Rasouli
Technical and Engineering Dept., Tabriz Branch, Islamic Azad University, Tabriz, Iran
42. Paper 31051158: Secured Right Angled or Ant Search Protocol for Reducing Congestion Effects and
Detecting Malicious Node in Mobile Ad hoc Networks by Multipath Routing (pp. 275-283)
Lt. Dr. S Santhosh Baboo, P.G. Research Dept of Com. Science, Arumbakkam, Chennai – 106., D G Vaishnav
College, Arumbakkam, Chennai – 106.
V J Chakravarthy, Research Scholar, Dravidian University
43. Paper 31051190: Side Lobe Reduction Of A Planar Array Antenna By Complex Weight Control Using
SQP Algorithm And Tchebychev Method (pp. 284-289)
A. Hammami, R. Ghayoula, and A. Gharsallah
Unité de recherche: Circuits et systmes électroniques HF, Faculté des Sciences de Tunis, Campus Universitaire
Tunis EL-manar, 2092, Tunisie
44. Paper 22051119: Image Compression Algorithm- A Review (pp. 290-295)
Marcus karnan, Tamilnadu College of Engineering, Coimbatore, India
M.S.Tamilselvi, Research Scholar, Dept of Computer Science & Engg., ANNA University, Coimbatore, India
45. Paper 31051189: Compensation of Nonlinear Distortion in OFDM Systems Using an Efficient Evaluation
Technique (pp. 296-299)
Dr. (Mrs.). R. Sukanesh, Professor / Department of ECE / TCE, Madurai – 15, India.
R. Sundaraguru, Research Scholar, Anna University, Chennai-25, India.
46. Paper 31051191: Performance Prediction of Single Static String Algorithms on Cluster Configurations
(pp. 300-306)
Prasad J. C., Research Scholar, Dept. of CSE, Dr.MG.R University, Chennai
cum Asst. Professor, Dept of CSE, FISAT, Angamaly, India
K. S. M. Panicker, Professor, Dept of CSE, Federal Institute of Science and Technology [FISAT], Angamaly, India
47. Paper 31051181: Explicit Solution of Hyperbolic Partial Differential Equations by an Iterative
Decomposition Method (pp. 307-309)
Adekunle, Y.A., Department of Computer Science and Mathematics, Babcock University, Ilisan-Remo Ogun State,
Nigeria
Kadiri, K.O., Department Electrical/Electronics Engineering, Federal Polytechnic, Offa, Kwara State, Nigeria
Odetunde, O.S., Department of Mathematical Science, Olabisi Onabanjo University, Ago-Iwoye, Ogun State,Nigeria
48. Paper 31051182: Numerical Approximation of Generalised Riccati Equations By Iterative Decomposition
Method (pp. 310-312)
Adekunle, Y.A., Department of Computer Science and Mathematics, Babcock University, Ilisan-Remo Ogun State,
Nigeria
Kadiri, K.O., Department Electrical/Electronics Engineering, Federal Polytechnic, Offa, Kwara State, Nigeria
Odetunde, O.S., Department of Mathematical Science, Olabisi Onabanjo University, Ago-Iwoye, Ogun State,Nigeria
49. Paper 26051128: Automated Face Detection and Feature Extraction Using Color FERET Image Database
(pp. 313-318)
Dewi Agushinta R.1, Fitria Handayani S.2
1
Information System, 2Informatics
Gunadarma University, Jl. Margonda Raya 100 Pondok Cina, Depok 16424, Indonesia
50. Paper 31051186: Simplified Neural Network Design for Hand Written Digit Recognition (pp. 319-322)
Muhammad Zubair Asghar1, Hussain Ahmad1, Shakeel Ahmad1, Sheikh Muhammad Saqib1, Bashir Ahmad1 and
Muhammad Junaid Asghar2
1
Institute of Computing and Information Technology Gomal University, D.I.Khan, Pakistan
2
Faculty of Pharmacy, Gomal University, D.I.Khan, Pakistan
51. Paper 31051186: Diagnosis of Skin Diseases using Online Expert System (pp. 323-325)
Muhammad Zubair Asghar 1, Muhammad Junaid Asghar 2, Sheikh Muhammad Saqib 1, Bashir Ahmad 1, Shakeel
Ahmad 1 and Hussain Ahmad 1.
1
Institute of Computing and Information Technology Gomal University, D.I.Khan, Pakistan
2
Faculty of Pharmacy, Gomal University, D.I.Khan, Pakistan
52. Paper 31051194: Cloud based Data Warehouse 2.0 Storage Architecture: An extension to traditional
Data Warehousing (pp. 326-329)
Kifayat Ullah Khan, Sheikh Muhammad Saqib, Bashir Ahmad, Shakeel Ahmad and Muhammad Ahmad Jan
Institute of Computing and Information Technology Gomal University, D.I.Khan, Pakistan
(IJCSIS) International Journal of Computer Science and Information Security,
Vol. 9, No. 6, June 2011
Additive model of reliability of biometric systems
with exponential distribution of failure probability
Bihać, Bosnia and Hercegovina
Zoran Ćosić(Author) jacosic@gmail.com
director
Statheros d.o.o.
Kaštel Stari, Croatia Miroslav Bača (Author)
zoran.cosic@statheros.hr professor
Faculty of Organisational and Informational science
Varaždin, Croatia
Jasmin Ćosić (Author) miroslav.baca@foi.hr
IT Section of Police Administration
Ministry of Interior of Una-sana canton
Abstract— Approaches for reliability analysis of biometric Data collection subsystem consists of a biometric sample,
systems are subject to a review of numerous scientific papers. method of sampling, and sensors that are sampled. Signal
Most of them consider issues of reliability of component software processing subsystem consists of drainage structures, quality
applications. System reliability, considering technical and control and comparison of samples. Decision subsystem
software part, is of crucial importance for users and for
manufacturers of biometric systems.
consists of the decision mechanisms and storage subsystems.
In this paper, the authors developed a mathematical model to Schematisation of model described in figure 1 can be shown
analyse the reliability of biometric systems, regarding the on figure 2.
dependence of components with exponential distribution of
failure probability.
Keywords- Additive model, Biometric system, reliability,
exponential distribution, UML,
Figure 2
I. INTRODUCTION
Schematic presentation of a biometric [1] system is a
The general biometric system, according to [1] Wyman shown simplified representation of a system in Figure 1 and shows
in Figure 1, consists of 5 elements which are located in all the serial configuration of system components dependence
biometric systems today.
II. THE DEFINITION OF THE RELIABILITY OF BIOMETRIC
SYSTEMS
Biometric system designers and producers are motivated to
use already constructed components and modules. Component
system has a high reliability expectations , no matter who is
producer. Most of existing reliability models are so generally
described to not consider particularity of components and
modules. In this paper authors will describe methodology
based on mathematical model which take account of
component reliability and connections reliability also. UML
methodology in describing system and his inner interaction,
simplify approach for researchers.
Figure 1. [1] UML [2] is also becoming standard in the process of designing
and manufacturing systems so production of component
Each subsystem consists of the elements that contribute to the systems gets benefits from the UML representation.
overall system quality.
1 http://sites.google.com/site/ijcsis/
ISSN 1947-5500
(IJCSIS) International Journal of Computer Science and Information Security,
Vol. 9, No. 6, June 2011
Assessment of [3] generic biometric system reliability in UML
is given in Figure 3, which describes the use of Use Case Diagrams sequences play an important role in assessing the
diagram: reliability of the system because they give information on how
many components are involved in the execution of a scenario.
Through sequence diagrams it is simple to count the periods of
availability of components in the given scenarios as shown in
figure (3). The probability of failure of components with
known busy periods, can be given by the following
expression:
(3)
Figure 3 [3]
In witch is:
- - probability of component failure i in the scenario j
q1 and q2 represent the probability that users u1 and u2 will - - occupancy time of component i in the scenario j
access the system using some of its functionality.
P11 and P12 represent the probability that user u1 will use the The expression (3) applies only if the following conditions are
functionality of f1 and f2, and P21 and P22 represent the same met:
probability for the user u2.
- Independence of failure: the probability of failure of
The probability of execution of use case x, is defined by the
one component does not depend on other components
expression:
- Regularity of failure: the probability of failure of one
component is equal throughout the execution of
(1) occupation period of the component
m is number of users. You can also show every moment of occupancy of any
component of the system considering the method to be
If we are able to join a no uniform distribution to Diagram of executed at that moment in the scenario
sequences in a given use case then (1) can be expressed as:
If we replace with a set of method of failure probability
(2) where is then equation (3) becomes:
(4)
Where the fj (k) - frequency of the k-th transition of sequence
diagram in the j-th case. P (kj) – presents a probability of
default scenarios. During system operation, components interact and exchange
information. Then it is necessary to take into consideration
occupation period of the component:
(5)
Where is θij- the probability of failure of system components
and bp-busy period of the system.
(6)
Where is Ψlmj- the probability of failure connections between
the components and - the number of
interactions between system components.
(7)
Figure 4 Sequence diagram [3] The reliability of the system taking into account the
probability of failure can be expressed as:
2 http://sites.google.com/site/ijcsis/
ISSN 1947-5500
(IJCSIS) International Journal of Computer Science and Information Security,
Vol. 9, No. 6, June 2011
n2( Δt ) - number of failures in certain time interval Δt
n1( t − Δt ) – the non failed number of elements at the end of
(7a)
the interval Δt , or until t − Δt
The intensity of the element failure is calculated by the
(7b) expression:
1
) 1
III. RELIABILITY PREDICTION OF THE COMPONENTS SERIAL λEL =Θ= (11)
DEPENDENCE WITH EXPONENTIAL DISTRIBUTION n Θ⋅n
Based on the above assumptions [4], [5] system reliability can Where is:
be calculated according to the law of exponential distribution:
n- number of usable parts of the confidence interval
(1 − α ) = 0, 75
(8)
Θ - lower limit of confidence for the mean time between
In wich are: failures
Rs - System reliability
λ - Intensity of system fault The total intensity of failure taking into consideration the
t - Required time of reliable operation of system number of elements that are not failed in a given time is
calculated by formula:
Proof:
(12)
Where is:
nEL- number of elements of subsystems that are not failed
Mean time between failures MTBFs can be calculated using
In a serial dependence between of all parts of the system the the expression:
failure of any part of the system may cause the failure of the
entire system.
Failure intensity function λs in the case of a serial dependence
of system elements is calculated by the expression: (13)
(9)
Calculation of reliability [6] of some element is based on
Where is λi - failure intensity of the i-th part of the system.
empirical data on the time of functioning and eventual failure
of the element.
Failure intensity function λs is equal to the ratio between the
Problem [7], [8] becomes more complex when the information
number of failures in the time-frame and the correct number of
about the failure doesn't exist. In case that part worked
elements in the system, until the beginning of this interval:
perfectly, and information about the exploitation are available.
If one assumes that the given part can apply the rule of the
exponential distribution, it is possible to determine the upper
(10) limit of confidence for the intensity of failure, in the cases of
continuous operation or one or more failures.
Where is:
λs -function of failure intensity of system Lower limit of confidence for the mean time between
)
Δt - failure time of an system element failures Θ , for confidence interval (1 − α ) is calculated using
the formula:
3 http://sites.google.com/site/ijcsis/
ISSN 1947-5500
(IJCSIS) International Journal of Computer Science and Information Security,
Vol. 9, No. 6, June 2011
probability. In accordance with this authors will define
) 2tr 2tr mathematical model for recovery system probability than
Θ≥ = (14) system readiness to use.
χ 2α ,2 r + 2 χ 20.25,2
Where is:
REFERENCES
tr – total time of system operation
r – Number of elements that have failed [1] Zasnivanje otvorene ontologije odabranih segmenata biometrijske
znanosti - Markus Schatten– Magistarski rad – FOI 2007
χ 2α ,2r + 2 - Random variable which has distribution [2] Modelling biometric systems in UML – Miroslav Bača, Markus
Schatten, Bernardo Golenja, JIOS 2007 FOI Varaždin
[3] A Bayesian Approach to Reliability Prediction and Assessment of
Component
IV. SPECIAL CASE OF NOT-FAILURE SYSTEM [4] Based Systems – KH. Singhy, V. Cortellessa, B. Cukic, E.
Gunely,V.Bharadwaj
Considering (8), (10) and (14) we obtain an expression for the
[5] Department of Statistics, Lane Department of Computer Science and
reliability of the whole system: Electrical Engineering West Virginia University, Proceedings of the
12th International Symposium on Software Reliability Engineering
(15) (ISSREí01) 1071-9458/01 2001 IEEE
[6] Simulacijsko modeliranje pouzdanosti tehničkog sustava brodskog
In futher calculations: kompresora – Zoran Ćosić – Magistarski rad - 2007
[7] Teorija pouzdanosti tehničkih sistema, Vujanović Nikola,
Vojnoizdavački novinski centar, Beograd 2005
(16) [8] The Impact of Error Propagation on Software Reliability Analysis of
Component-based Systems – Petar Popic, Master thesis, West Vriginia
2005
Taking into account (11) we obtain an expression for the
reliability of the each elements of the system:
AUTHORS PROFILE
Zoran Ćosić, CEO at Statheros ltd, and business consultant in business process
standardization field. He received BEng degree at Faculty of nautical
(17) science , Split (HR) in 1990, MSc degree at Faculty of nautical science ,
Split (HR) in 2007 , actually he is a PhD candidate at Faculty of
informational and Organisational science Varaždin Croatia. He is
a member of various professional societies and program
V. CONCLUSION AND FURTHER RESEARCH committee members. He is author or co-
author more than 20 scientific and professional papers. His main
The reliability of technical systems is the subject for many fields of interest are: Informational security, biometrics and privacy,
research scientists, according to that analysis are available in business process reingeenering,
different models of reliability of biometric systems that in Jasmin Ćosić has received his BE (Economics) degree from University of
most cases take into account the software as one of the Bihać, B&H in 1997. He completed his study in Information Technology
field (dipl.ing.Information Technlogy) in Mostar, University of Džemal
components of the same system. This study developed Bijedić, B&H. Currently he is PhD candidate in Faculty of Organization
mathematical model of a generic biometric system that and Informatics in Varaždin, University of Zagreb, Croatia. He is
consider user, hardware and software influence and assumes working in Ministry of the Interior of Una-sana canton, B&H. He is a
serial dependence of the components and the exponential ICT Expert Witness, and is a member of Association of Informatics of
B&H, Member of IEEE and ACM. His areas of interests are Digital
distribution of failure probability of system components. Forensic, Computer Crime, Information Security and DBM Systems. He
Scientific contribution is expressed through mathematical has presented and published over 20 conference proceedings and journal
model definition of biometric system reliability prediction articles in his research area
within special case where is not possible to get failure data or Miroslav Bača is currently an Associate professor, University of Zagreb,
in the case of perfectly working system. The results of Faculty of Organization and Informatics. He is
a member of various professional societies and program
calculations can prevent real failure events by defining committee members, and he is reviewer of several international
preventive maintenance. journals and conferences. He is also the head of the Biometrics centre in
This mathematical model allows the prediction of system Varaždin, Croatia. He is author or co-
author more than 70 scientific and professional papers and two
reliability in the early fase of projecting of its components . books. His main research fields are computer forensics, biometrics and
The subject of further research will be to create an privacy professor at Faculty of informational and Organisational science
integrated mathematical model of reliability of complex Varaždin Croatia
systems that considers the parallel and combined dependency
of system components with different distributions of failure
4 http://sites.google.com/site/ijcsis/
ISSN 1947-5500
(IJCSIS) International Journal of Computer Science and Information Security,
Vol. 9, No. 6, 2011
IP Private Branch eXchange of Saint Joseph
University, Macao: a Design Case Study
A. Cotão, R. Whitfield, J. Negreiros
Information Technology Department
University of Saint Joseph
Macau, China
Abstract— To present the specification project of a digital telephone systems because they can significantly lower
telephone system, IP PBX, for the new campus of Saint Joseph telecommunications costs (a significant operating cost) while
University (USJ), Macao, is the main goal of this research. Given business voice network should be integrated with the data one
that the new USJ campus at Green Island, Macao, was projected [3]. The required software will be installed on standard
for this coming September 2012 with the latest technologies
available to achieve energy savings and to contribute somehow to
computer hardware, according to the next items: (A)
the environment sustainability, the available prototype was Understanding of the general benefits and technical trends of
designed using VoIP (Voice Over IP) to follow this novelty trend. small and medium IP PBXs; (B) Configuration of an
It is expected to conclude, as well, that there is a financial reason appropriate hardware and software combination regarding IP
for preferring a VoIP phone system over a conventional one. PBXs; (C) Evaluation of the final prototype; (D) Design and
Choosing this platform eliminates the need for conventional specification proposal as regards a production system suitable
telephone wiring for the new campus, which represents a for being used at USJ new campus.
considerable cost savings and logistics, for instance. Further, Concerning the structure of this writing, section two
good VoIP Open Source software are already available such as summarizes the basic concepts and relationships of PSTN
AsteriskNOW©. At last, the internal USJ connection to the (Public Switch telephone Network), VoIP (Voice Over Internet
Catholic University of Lisbon, Portugal, adds another financial
Protocol), PBX (Private Branch eXchange) and IP PBX for
reason for this project since international calls can be quite
non-technological readers (but who like to understand it). This
expensive. To analyze the setup, test and implementation of a
prototype becomes, hence, the aim of this paper, including the
includes business managers that are not aware of the possibility
explanation of the difficulties, technical lessons and to lower communication costs within their organizations. The
recommendations on the tested IP PBX. following section will describe the specification and design for
the IP PBX server, the IP phones connected to it, the VoIP
Keywords- Voice Over IP (VoIP), Private Branch eXchange gateway to connect the IP PBX to the Public Switch Telephone
(PBX), AsteriskNOW©, PSTN, IP handsets, IP Providers. Network (PSTN) and other VoIP providers. In section four, the
setup and implementation are shown while section five
I. INTRODUCTION illustrates the testing phase, not forgetting the evaluation of its
efficiency and reliability. Inevitable, the last section
USJ is a joint venture result of the Catholic University of recommends the main technical specifications for an IP PBX
Portugal and the Diocese of Macao [1]. It was created in the for the future campus of USJ.
NAPE area of Macao but, with the expansion of new courses
and a significant increase of the student´s number, the II. PSTN, VOIP AND IP PBXS
available facilities rapidly became a serious concern, affecting
the future growth and the overall environment quality. With PSTN stands for Public Switched Telephone Network and is
also known, according to [4], as POTS (Plain Old Telephone
the aim to overcome this issue, a new campus is being
System). Born in 1876 with Alexander Graham Bell, it is a
developed and is under construction at Green Island, Macao.
circuit-switched network where telephones handsets are
This new campus has been designed to use all available new interconnected among themselves through single or multiple
technologies with the purpose of minimize the running costs hub exchanges (cross-connect switches). Including the mobile
as well as to maintain the future environment sustainability. It system, it is the main telecommunication network worldwide
is expected that the new technologies used will take full with 5.4 billion (800 million for fixed line and 4.6 billion for
advantage of solar power to heat, cool and natural lighting as mobile) subscribers.
well as rainwater collecting for re-use. At first, all phones were connected among themselves in a
Historically, some organizations and companies have used meshed network but, with the growing of users, this layout
separate telephone and computer data communications became impractical. Henceforth, a new layout (hierarchical and
networks [2]. VoIP combines both networks to greatly reduce star topologies) was developed to allow this steady growing.
capital and operating costs. USJ wants to adopt this approach Once this tie was established, the human voice is converted
and, thus, the IP PBX technical development becomes the into analogical form and sent through copper twisted pair
ambition. It is believed that VoIP is the future of corporate cables to the central office (CO), where it is converted to digital
5 http://sites.google.com/site/ijcsis/
ISSN 1947-5500
(IJCSIS) International Journal of Computer Science and Information Security,
Vol. 9, No. 6, 2011
signals at the Digital Cross-Connect Switches [5, 6]. These Once again, a key distinction between PSTN and Internet is
digital signals can take different paths depending on, whether it the destiny identification. Within the PSTN, lines, rather than
is a local, national or international call. If it is a local call devices, are identified, that is, your home phone number is the
connected to the same switch, the analogical signal is routed phone line to your place (in fact, you can attach different
directly to the destination number. Otherwise, the very first devices to this line such as faxes and answering machines). By
switch will convert the analogical signal to digital for routing contrast, the Internet identifies each specific device by their IP
afterwards. Later, it is converted back to an analogical form to address.
be sent to the destination call number. IP PBX stands for Internet Protocol Private Branch
VoIP stands for Voice over IP or Voice over Internet eXchange and, basically, is an IT framework which uses the
Protocol. It is another way to make phone calls because, instead common LAN (Local Area Network), Internet, PSTN and
of using the conventional PSTN network, it is done through the VoIP providers for communication purposes (see Figure 1).
Internet infrastructure. VoIP started in 1995 with Vocaltec©
when it released its first Internet phone software. It used the
H.323 protocol and run on any PC with the usual microphones
and speakers. Vocaltec© enjoyed an initial success but, due to
the lack of broadband, it did not survive for long. On 2003, a
major step was given by Skype©. Unlike the previous Internet
experiences, Skype© used its own protocol. Since it held a good
voice quality, it became a major commercial reference for
VoIP. In the first half of 2010, according to Skype© Web site,
users made a total of 6.4 billion calls to landlines and mobile
phones.
A VoIP call usually starts with a typical PC or with an
IP/analogue telephone as long as it is connected to an Analog
Telephone Adapter (ATA). The analog voice is converted to a
successive 0s and 1s and, afterwards, the digital signal is
broken into smaller chunks called packets to be sent to their
destination through the Internet [7, 8]. At the destination, the
process is reversed, the smaller packets are assembled back in a
properly order and converted to an analog signal that is Figure 1. IP PBX and Internet integration.
understandable by the human ear.
VoIP technology uses three protocols: SIP (Session From the view point of the end-user, there are no differences.
Initiation Protocol, an IP signaling procedure to establish, For the organizations, the change relies on the system setup
modify and terminate VoIP calls), RTP (Real-time Transport and phone calls since it is less costly. Basically, an IP PBX
Protocol, a standardized packet format for delivering consists of a server with special hardware for PSTN interfaces,
audio/video over the Internet) and RTCP to monitor SIP phones and VoIP Gateways. Indeed, SIP phone, VoIP
transmission statistics and Quality of Service (QoS) status. phone and IP phone are different names for the same device
The connection between VoIP and PSTN is done through a [10]. It is this device that connects the user with the IP PBX.
communication protocol named ENUM that stands for All the calls received are, then, routed to their destination
telephone numbering mapping [9]. Basically, ENUM translates automatically, according to a set of rules (the dial plan) that
telephone numbers into a format that can be used by the should be previously programmed.
Internet. Bear in mind that PSTN is a circuit switched network An IP PBX server operates similar to a proxy server, that is,
that uses telephone numbers while the Internet structure is a the SIP clients (soft phones or handset IP phones) register at it
packet switched network that uses Uniform Resource and when they wish to make a phone call, they request the IP
Identifiers (URIs) for addressing each device. This ENUM PBX to establish the connection. Internally, the IP PBX has a
protocol enables circuit switched traffic to be carried on a directory with all telephones/users with their corresponding SIP
packet switched network by matching a circuit address address. Therefore, it is possible to connect an internal call
(telephone number) to a network address (URI). Hence, ENUM (located in the same LAN) or route an external call through
links both PSTN and Internet, providing a means for Internet either a VoIP gateway or a VoIP service provider. The
connected phones receive/make calls to the PSTN network. connections with the external parties are done through SIP
Calls between subscribers of the same VoIP provider trunks and, optionally, via a VoIP Gateway, where the SIP
(VoipBuster© or Vonage©, for instance) are usually free. Calls Trunk bonds the IP PBX to an ITSP (Internet Telephony
between subscribers of different VoIP providers should also Service Provider). Peculiarly, the physically moving of a SIP
have no costs associated, as well. On the other hand, when the phone does not affect its relationship to the IP PBX.
calls are originated from VoIP providers and are terminated at
the PSTN, the costs involved equals the interconnection costs
charged by the VoIP provider for the gateway usage to connect III. SYSTEM SPECIFICATION AND DESIGN
the Internet network to the local PSTN. To minimize these According to Figure 2, the IP PBX prototype will run
interconnection charges, as expected, VoIP subscribers use the Asterisk© whose hardware equipments and software packages
Internet network to the nearest PSTN termination point. specifications are as follows: (A) Computer hardware (Intel
6 http://sites.google.com/site/ijcsis/
ISSN 1947-5500
(IJCSIS) International Journal of Computer Science and Information Security,
Vol. 9, No. 6, 2011
Pentium© Dual CPU, RAM of 3GB, NVIDIA© GeForce 9500
GT graphics card, Realtek© RTL8168C network card, 120GB
Maxtor© 6Y120PO ATA HD with a DVD drive); (B) Software
(AsteriskNOW© 1.5 based on Linux CentOS©, MySQL© and
Apache Web server); (C) Connections (local extensions for the
same IP PBX LAN, four remote extensions, links to Macao
PSTN through a VoIP gateway and, as expected, connection to
the Portugal PSTN network through four different VoIP
providers to minimize call costs).
In a more in depth view, the three local extensions will
follow the next pattern: 1001 (ATA Linksys SPA3102), 1002 Figure 3. On the left, the GXP2000© is an IP handset suitable for both small
and large business organizations and it can be connected directly to any LAN
(soft-phone CounterPath’s X-Lite©) and 1005 (Grandstream© (it handles up to four simultaneous VoIP calls). It has a dual 10M/100Mbps
GXP2000 IP Phone) for Macao calls. Regarding external ones, Ethernet port, an intuitive user interface, a large back-lit LCD display with
there will be one VoIP gateway to access the Macao PSTN multiple languages support and privacy protection. On the right, the Siemens©
landline, four VoIP providers to minimize international calls Gigaset C470 IP is a cordless IP phone which allows connecting to any IP
charges to Portuguese nomadic numbers, G9 Telecom© to PBX, via a LAN, as well as to a local PSTN.
receive and make phone calls from and to Portuguese nomadic
phone numbers, SMSDiscount© to connect to the Portuguese
PSTN land lines, VoIPCheap© to connect to the Macanese and
Hong Kong PSTN land lines and, at last, SmartVoIP© to
connect to the Portuguese mobile network (check figure
below)).
Figure 4. On the left, the IP soft-phone CounterPath’s X-Lite© can be used to
make and receive voice/video calls [11]. Its minimum specifications required
to connect and operate with the IP PBX are an audio codec G.711, SIP
protocol and ID/Voicemail caller. On the center, the ATA Linksys© SPA3102
is an adapter with the ability to connect analog telephones and fax machines to
the IP PBX through a computer data network. Curiously, it has also the ability
to bond to any local PSTN. On the right, the ATA Linksys© PAP2T allows the
link of one or two analog phones to the IP PBX in order to create one or two
extra extensions.
IV. SYSTEM SETUP AND IMPLEMENTATION
To start, the Asterisk©NOW IP PBX prototype must be
integrated both to the Macao CTM provider (with a broadband
Figure 2. The IP PBX layout (512Kbps is the minimal recommended Internet connection) and USJ’s LAN. The ATA Linksys©
Internet bandwidth). SPA3102 will allow the link between the PSTN network and
all USJ analog phones. Basically, it will work as a VoIP
Regarding the IP PBX clients specifications, there are different Internet gateway as well as an extension from the IP PBX.
brands, models and types of IP phones available in the market Hence, the bond between the IP PBX and the Macao residential
such as soft-phones, ATA (Analog Telephone Adapter) and PSTN is established through this VoIP gateway. Secondly, the
handset IP phones. With the exception of the first option that ATA Linksys© PAP2T connects two analog handsets but
can be downloaded free of charge, the remaining ones are not working as two different extensions of the IP PBX. Third, the
available in Macao. For the purpose of testing, one IP PBX server and all its local extensions (ATA, handset IP
Grandstream© GXP2000 (standard desktop IP phone handset) phone and soft-phone) will be integrated with the CTM Macao
and one ATA Linksys© SPA3102 (it works as an analog network, underpinned by an ADSL Modem and a broadband
handset extension and, as well, as a VoIP gateway to connect router (see Figure 5). As expected, this router implements NAT
the IP PBX to the Macao PSTN) were purchased from an (Network Address Translation) and firewall functionality in
online USA site. The ATA Linksys© PAP2T and the Siemens© order to protect the local USJ LAN from Internet intrusion. It
Gigaset C470 IP phone (used as remote extensions located in also provides DHCP service to allocate private IP addresses to
Portugal) were purchased from an online Portuguese site (see the local LAN equipment. Fourth, the IP PBX LAN router has
Figures 3 and 4) with the capability to work with more than one to be configured to allow the UDP (User Datagram Protocol)
VoIP provider as long as it is well configured. Moreover, both data packets to pass through it and to be forward to the right IP
are able to work as an extension of the IP PBX prototype. address. With this purpose, several UDP ports were setup: 5004
to 5037 (Real-time Transfer Protocol, RTP), 5039 to 5082
(Session Initiation Protocol, SIP) and 10000 to 20000 (extra
RTP ports). In a brief way, SIP ports are used for signaling the
7 http://sites.google.com/site/ijcsis/
ISSN 1947-5500
(IJCSIS) International Journal of Computer Science and Information Security,
Vol. 9, No. 6, 2011
connection between two IP phones (the telephone ring) and, Step Action
when the called IP phone answers it, the RTP protocol start 1 Splash screen and choose install.
being used since it is the main responsible to transport the 2 Format hard drive.
3 Accept default disk partitioning.
audio packets [12].
4 Choose the time zone.
5 Input the “root” password.
6 When the installation finishes, remove the bootable CD and
reboot the PC.
7 Configure the firewall setting to “Enabled” and “Permissive”.
8 Choose a static IP instead of DHCP concerning the network
configuration.
9 The IP PBX configuration finishes by rebooting while the
login screen is splashed.
The trunk lines allow the IP PBX to connect with the external
parties, that is, it links the PSTN and VoIP providers. In this
case, the available trunk connects the Macao CTM PSTN
through a VoIP gateway (the ATA Linksys© SPA3102). It can
be used for both outgoing and incoming calls. Alternatively,
the VoIP trunks allow the system to call external parties (VoIP
Figure 5. The overall IP PBX network diagram.
and PSTN telephone numbers in other countries) through local
VoIP providers using the following Internet infrastructure: (A)
Notice that the IP PBX server computer must be configured G9 Telecom© for income/outcome phone calls to Portugal
with a private static IP address to simplify the configuration of nomadic numbers; (B) VoIPCheap© to make land line and
the local VoIP extensions and the forwarding port. This mobile phone calls to Hong Kong and Macau; (C)
configuration is accomplished at the router level and based on SMSDiscount© to make land line calls to Portugal; (D)
the Port Range Forwarding procedure as figure 6 shows. SmartVoIP© will be used for outgoing mobile calls to
Portugal. For instance, if someone is using one of the
extensions connected to the IP PBX and needs to phone a
CTM© number or a Macao mobile one, the call will be
established through the Macao PSTN trunk (or through the
VoIPCheap© trunk, if the PSTN trunk is already being used).
Nevertheless, if he/she wants to call a PSTN land line in
Portugal, this call will be established through the
Figure 6. Example of a port range regarding forward router configuration.
SMSDiscount© trunk.
The Asterisk©NOW download (version 1.5) comes with an Regarding trunks decision, the IP PBX does it through the
integrated distribution that includes Linux distribution (a Outbound Routes which defines the sequence path regarding
stripped down version of CentOS©), MySQL© database, what to do when one external telephone call arrives into the IP
Apache© Web server, PHP© Web programming language, PBX or when someone dials an external phone number
Asterisk© IP PBX server and FreePBX Asterisk© (strategy definition for which trunk should be used to establish
administration packages. The installation sequence is any particular connection).
summarized in Table 1. On the configuration side, this includes With the present prototype, there will be only two trunk
the creation of Asterisk©NOW Extensions, trunks, lines for inbound routes to be configured: the Macao PSTN line
inbound/outbound routes, Follow me, Disa and conference and the Portugal nomadic number. The remaining trunks have
functions, Backup/Restore and DID (Direct Inward Dial). no associated telephone number and cannot receive telephone
Extensions are used for internal calls that only involve the calls (only used for outbound calls). According to Table 2, all
IP PBX. Trunks are used for external calls that are routed the received phone calls from the Macao PSTN trunk line were
through VoIP gateways or VoIP providers [13]. It covers calls set to be forward to the extension 1005 (IP phone
from and to outside parties, that is, PSTN numbers, nomadic Grandstream© GXP2000 handset). Similarly, all incoming calls
numbers (VoIP numbers) or other IP PBX’s. Once again, from Portugal nomadic numbers were routed to the same
extensions are all those numbers regarding soft-phones, ATA’s extension.
or IP phones directly connected to the IP PBX (configured in
the IP PBX itself and in each IP handset). For instance, the TABLE II. CREATION AND CONFIGURATION OF INBOUND ROUTES
©
WITHIN ASTERIX NOW.
common telephone number that people have in their office desk
usually holds a sub-number with three or four digits. It is this Step Action
sub-number that allows he/she to call their office colleagues or 1 Within the Inbound Routes menu, choose the Add Incoming Route
third PSTN parties as long as these extensions are defined in option.
the PBX office. 2 Add the description (for instance, the device model SPA3102) and
the CTM PSTN number.
3 Choose the destination (set destination menu) for incoming phone
TABLE I. INSTALLATION OF ASTERIX©NOW SOFTWARE PACKAGE. calls on this trunk route. Keep in mind that for this trial product, all
8 http://sites.google.com/site/ijcsis/
ISSN 1947-5500
(IJCSIS) International Journal of Computer Science and Information Security,
Vol. 9, No. 6, 2011
the incoming calls will be sent to the extension 1005 of the The conference function connects users from different
Grandstream© IP phone GXP2000. places to minimize phone calls cost [14]. There are two ways
4 Click the Submit/Apply button.
to setup this procedure [15]: (A) The participants are informed
The defined dial plan for all outbound routes is summarized in in advance on the date/time of the conference and, previously,
Table 3. Note that Macao, Portugal and Hong Kong the users should call the phone conference number; (B) The
international country codes are 853, 351 are 852, respectively. participants are informed in advance on the date/time of the
conference and, on a pre-defined date/time, the IP PBX
TABLE III. DIAL PLAN FOR THE IP PBX PROTOTYPE SYSTEM. administrator pulls them to the conference through the
Trunk Inbound route Outbound route FreePBX Flash panel (user’s phones will ring and, after they
In case of Macao-PSTN Destination = 008531. answer, the conference call is already setup).
failure, VoIPCheap© will Extension 1005 0085328XXXXXX The Direct Inward Dial (DID) feature redirect a PSTN
be used 008536XXXXXXX
008538XXXXXXX phone number with a single prefix while the last
00853999 two/three/four digits varies. Thus, each block map number
00852XXXXXXXX corresponds to a different extension. To implement DID, it
starts to request a special kind of trunk from the PSTN
G9 Telecom© 003513XXXXXXXX
provider. For this particular line, as each call is started, the
SMSDiscount© 003512XXXXXXXX suffix digits are actually passed to the IP PBX so it can decide
003517XXXXXXXX which route extension to call to. Usually, PSTN telephone
003518XXXXXXXX numbers are obtained in a block of numbers, for instance, from
SmartVoIP© (in case of 003519XXXXXXXX
28831000 until 28831009 (a ten block numbers, in this case).
failure, SMSDiscount© This block of numbers is, then, configured to match the spatial
will be used) extensions defined in the IP PBX. Typically, the first number
28831000 is a direct line for the receptionist while the
In line with Table 4, different dial plans with different VoIP
remaining ones are DIDs (28831001 signifies organization
provider trunks were configured to minimize call costs.
extension 1001, 28831002 means organization extension 1002
TABLE IV. RATES CHARGED BY THE MAIN VOIP PROVIDERS OF USJ. and so on). Hence, any external user that wants to make a
direct phone call to extension 1005, just needs to dial the
Destination VoIP provider (charge rate in Euros/Minute) telephone number 28831005.
G9 VoIPC SMS SmartVoIP©
Telecom© heap© Discount© Naturally, the soft-phones are configured directly in each
Hong Kong 0.100 0.000 0.010 0.000 laptop while the handset IP phone although ATA have to be
(Land Line) configured in a different way, depending on the brand and
Hong Kong 0.250 0.000 0.005 0.000 model. Extension soft-phones and other handsets need to be
(Mobile)
Macao (Land 0.250 0.020 0.030 0.030
register at the IP PBX with a different IP address (a password
Line) must be supplied as part of the registration process). If an
Macao 0.250 0.030 0.030 0.030 extension is turned off or disconnected from the network, for
(Mobile) instance, the IP PBX will divert calls to the voicemail or
Portugal 0.016 0.000 0.000 0.000
(Land Line)
another pre-defined function. Extensions on the same LAN
Portugal 0.106 0.100 0.065 0.060 can also be hard coded with its IP address from the IP PBX.
(Mobile) Yet, outside extensions are different, depending on whether
the IP PBX has a public IP address (or not). In this case study,
The Follow me function is applied whenever the user is not
the DNS (Domain Named Service) is used to obtain the
able to receive the phone call in his/her extension and he/she
required IP address. Even if five IP phones were installed
wants to forward that call to another extension or even to an
external phone number (both PSTN and VoIP number). In this (CounterPath's X-Lite©, Linksys ATA SPA3102©, Linksys©
particular case, the IP PBX was setup on for every weekday ATA PAP2T, Grandstream© GXP2000 and Siemens© Gigaset
(after working hours) and for those days where someone is out C470 IP), Table 5 only shows the main four steps of the first
of the office (it forwards all the calls from the present extension appliance.
to the user´s mobile).
TABLE V. CONFIGURATION OF IP COUNTERPATH’S X-LITE©.
The Disa function is applied whenever he/she needs to do a
costly phone call (international one to New Zealand, for Step Action
instance). To avoid to be personally charged for this 1 Download the CounterPath’s X-Lite© software
(http://www.counterpath.com/x-lite.html).
circumstance, the user calls to a Disa active phone number of
2 On the SIP Accounting Settings menu, add a new account.
the office. If the request password is correct then the IP PBX 3 Fill the following fields: Display name, Extension number,
will setup the line automatically, according to the cheapest Password, Authorization user name and Domain (the IP
defined route (the dial plan of the outbound routes). address of the IP PBX).
Unsurprisingly, the password traditional system is required for 4 Configuration conclusion after the Ready message.
safety purposes.
9 http://sites.google.com/site/ijcsis/
ISSN 1947-5500
(IJCSIS) International Journal of Computer Science and Information Security,
Vol. 9, No. 6, 2011
V. SYSTEM TESTING
True to form, the subsequent pace is to evaluate the IP PBX
prototype efficiency and reliability of the IP PBX, telephone
calls among extensions located in the same/different IP PBX
LAN, inbound/outbound connections (via SIP and PSTN
trunks), voicemail, Follow me, Disa and conferences
capabilities. As shown below, the tests of Asterisk©NOW
includes its start up procedure (see Table 6), IP PBX access
(see Table 7) and Secure Shell (see Table 8), client’s
registration (see Figure 7), PSTN and SIP trunks (see Table 9).
TABLE VI. IP PBX SERVER START UP TEST RESULTS. IT WAS OBSERVED
THAT THE SERVER REQUIRED 29 SECONDS TO SHUTDOWN.
Date Start Time Duration (Seconds) Start Up Errors? Figure 7. Snapshot of the IP phones (IP PBX clients) extensions (1001,
2010.10.17 16:13:00 83 No 1002, 1005, 1007, 1025, 1026 and 1029) registered after the start up
2010.10.17 16:23:00 83 No procedure. No abnormalities were found.
2010.10.17 16:34:30 83 No
2010.10.17 16:44:00 83 No Afterwards, the phone calls between extensions of the
2010.10.17 16:55:00 83 No
same/different IP PBX LAN followed. Figure 8 and 9 shows
TABLE VII. CHECK RESULTS OF THE IP PBX SERVER ACCESS THROUGH
the call status information between extension 1001-1002 and
THE FREEPBX GUI INTERFACE. 1029-1025 and, auspiciously, the quality voice was considered
excellent in both cases, according to the excellent, good,
Date Access Time Login Access to Test reasonable and bad scale. The same relationship was found for
(hh:mm:ss) accepted? IP PBX? Result
2010.10.17 16:15:00 Yes Yes Pass
other calls extensions such as 1002 to 1005 and 1005 to 1001.
2010.10.17 16:25:00 Yes Yes Pass
2010.10.17 16:36:30 Yes Yes Pass
2010.10.17 16:47:00 Yes Yes Pass
2010.10.17 16:58:00 Yes Yes Pass
TABLE VIII. TEST RESULTS OF IP PBX SERVER ACCESS VIA SSH CLIENT Figure 8. Snapshot of the in boundary call (Macao-Macao) between 1001
INTERFACE. and 1002 extensions.
Date Access Time Login Access to Test
(hh:mm:ss) accepted? IP PBX? Result
2010.10.17 16:18:00 Yes Yes Pass
2010.10.17 16:28:00 Yes Yes Pass
2010.10.17 16:39:30 Yes Yes Pass
2010.10.17 16:50:00 Yes Yes Pass Figure 9. Snapshot of the out boundary call (Macao-Maputo, Mozambique)
2010.10.17 17:02:00 Yes Yes Pass between 1029 and 1025 extensions.
TABLE IX. EVALUATION OF PSTN AND SIP TRUNKS. ONCE AGAIN, NO At last, Table 10 and 11 exhibit some trial results of the
IRREGULARITIES WERE FOUND.
inbound and outbound calls using PSTN and SIP trunks. As
Time Trunks Registered well, the voice mail (from extension 1001 to 1002 and 1025),
PSTN G9 VoIP SMS Smart Follow me and Disa (see table 12) tests occurred with no major
Telecom Cheap Discount VoIP problem.
18:56 Yes Yes Yes Yes Yes
19:05 Yes Yes Yes Yes Yes
19:13 Yes Yes Yes Yes Yes TABLE X. APPRAISAL RESULTS OF THE OUTBOUND CONNECTION TO
19:20 Yes Yes Yes Yes Yes MACAO USING PSTN TRUNKS.
19:28 Yes Yes Yes Yes Yes
Start Duration Origin Destination Voice
Time (hh:mm) Quality
(hh:mm) PSTN Place Fix Place
Line
or
Mobile
20:08 02:59 1001 Macao Mobile Macao Good
20:15 05:23 1005 Macao Mobile Macao Good
TABLE XI. ASSESSMENT RESULTS OF THE OUTBOUND CONNECTION TO
OTHER COUNTRIES USING SIP TRUNKS.
10 http://sites.google.com/site/ijcsis/
ISSN 1947-5500
(IJCSIS) International Journal of Computer Science and Information Security,
Vol. 9, No. 6, 2011
Start Duration Origin Destination Voice REFERENCES
Time Quality
[1] USJ (University of Saint Joseph), “About University of Saint Joseph
PSTN Place Fix Place History”, available at
Line http://www.usj.edu.mo/?content_left&col=1&id=15 [accessed Mar 09].
or
[2] Boavida, F., Monteiro, E, “Computer Networks Engineering”, 4th
Mobile
Edition, ISBN 972-722-203-x, FCA-Lidel, 2007, pp. 554.
20:47 02:23 1001 Macao Fix Macao Good
Line [3] Vestias, M., “CISCO Networking”, 4Th Edition, FCA-Lidel, ISBN 978-
972-722-506-4, 2010, 648 p.
20:29 02:51 1002 Macao Mobile Macao Good
[4] Hamdi, M., Verscheure, O., Hubaux, J.-P., Dalgic, I., Wang, P., “Voice
Service Interworking for PSTN and IP Networks”, Communications
Magazine, IEEE, Vol 37 (5), ISSN 0163-6804, 1999, pp. 104-111.
TABLE XII. EXAMINATION RESULTS OF THE DISA FUNCTION. [5] Obara, H., Yasushi, T., “An Efficient Contention Resolution Algorithm
Access Disa Line given Call Test Result for Input Queuing ATM Cross-Connect Switches”, International Journal
of Digital & Analog Cabled Systems, Vol2 (4), pp. 1989, 261-267.
(hh:mm) Active by IP PBX? Success?
? [6] Holtmanns, S., Horn, G., Moeller, W., “Identity Management in Mobile
22:27 No No No Pass Communication Systems”, in Selected Topics in Communication
22:30 Yes Yes Pass Networks and Distributed Systems, Sudip Misra & Isaac Woungang
(Eds), World Scientific Publishing, 2010, pp. 709-730.
The function conference test was done with the extensions [7] Gratz, J., “Voice Over Internet Protocol”, Science & Techology, 6 Minn.
1001 (Macao), 1025 and 1026 (Portugal) and 1029 J.L., 2004, 443 pp.
(Mozambique). According to Figure 10, the overall quality was [8] Prabhakaran, K., “Advanced Link State Protocol”, in Computer and
considered pretty good. Network Technology: Proceedings of the International Conference on
ICCNT 2009, Zhou & Mahadevan (Eds), World Scientific Publishing,,
2009, pp. 89-92.
[9] Neustar, “What Is ENUM?”, Available at http://www.enum.org/
what.html [accessed Oct 10].
[10] Barbeau, M., Boone, P, Kranakis, E., “Wimax/802.16 Broadband
Wireless Netwworks”, in Selected Topics in Communication Networks
and Distributed Systems, Sudip Misra & Isaac Woungang (Eds), World
Scientific Publishing, 2010, pp. 79-111.
Figure 10. Snapshot of the conference calls among 1001, 1025, 1026 and
1029 extensions. [11] Blueface, CounterPath X-Lite Softphone Specifications. [Online]
Available at http://www.blueface.ie/helpandadvice/specification/
xlite.aspx [accessed Sep 10].
VI. FINAL THOUGHTS [12] Sharma, S., “Hello Expired Time Based Greedy Routing Scheme for
Mobile Ad Hoc Networks”, in Computer and Network Technology:
For personal use, a well known VoIP application is Skype©. Proceedings of the International Conference on ICCNT 2009, Zhou &
This application allows audio and video communications at Mahadevan (Eds), World Scientific Publishing,, 2009, pp. 45-49.
very low costs (from Skype© to PSTN telephones) or even at [13] Smarter, “Linksys SPA3102 Voice Gateway with Router - VoIP
no cost at all (from Skype© to Skype©). Therefore, the incentive gateway”, available at http://www.smarter.com/bridges-routers/linksys-
of this project is to assemble, implement and configure a VoIP spa3102-voice-gateway-with-router-voip-gateway/pd--ch-2--pi-
770317.html [accessed Sep 10].
phone system for USJ needs based on Linux© and other FOSS
[14] Hallberg, B., “Networking”, 5th Edition, McGraw-Hill Professional
(Free Open Source Software) technologies [16]. According to Publishing, p. 415, 2009.
the previous results, it seems it is possible to setup an IP PBX [15] Asterisk, “Forum-AsteriskNOW Support”, [Online] Available at
for USJ, including IP phones. http://forums.digium.com/viewforum.php?f=14&sid=feeaa4f3fbe8e9fc1
Regarding future work, the productive equipment depends, 1706bb68efd5cf1&start=2200 [accessed Sep 10].
nowadays, by the end of the construction of the new campus. [16] Chava, K., How, J., “Integration of Open Source and Enterprise IP
Still, two technical lessons should be highlighted from the past: PBXs, Testbeds and Research Infrastructure for the Development of
(A) To have two Internet broadband lines, one for the IP PBX Networks and Communities” (3rd International Conference), ISBN 978-
1-4244-0739-2, 2007, pp. 1-6.
server and another for the remainder Internet data traffic
network; (B) To design carefully the SIP and RTP ports for AUTHORS PROFILE
both protocols work all together without any conflicts. António Cotão holds a Master degree in Information Technology from the
University of Saint Joseph, Macau, China, and, at the moment, he works
Finally and based on the planning projections of all in the logistics and financial department of a Portuguese pharmaceutical
staff/student number of USJ by 2012, it is recommended the (Hovione).
following hardware components: (A) IP PBX Server (2 Intel Richard Whitfield is a full professor at Saint Joseph University, Macau,
Xeon© processor 7500 series, 16GB RAM, Motherboard with China, and, currently, he is one of the responsibles for the construction
standard graphics and dual 100/1000 Ethernet NIC cards, of the new USJ campus. He holds a doctoral degree in Manufacturing
RAID 5 redundancy with 1TB for each HD, UPS and from the University of Melbourne, Australia.
Digium/ATA’s gateway to connect the Macao PSTN. (B) For João Negreiros is an associate professor at University of Saint Joseph from
2011 and he holds a doctoral degree in Information Technology from the
the clients, Polycom©, Cisco© and Linksys© brands are highly New University of Lisbon, Portugal.
recommended appliances. (C) Asterisk©NOW as the main core
software.
11 http://sites.google.com/site/ijcsis/
ISSN 1947-5500
(IJCSIS) International Journal of Computer Science and Information Security,
Vol. 9, No. 6, June 2011
A Novel and Secure Data Sharing Model with Full
Owner Control in the Cloud Environment
Mohamed Meky and Amjad Ali
Center of Security Studies
University of Maryland University College
Adelphi, Maryland, USA
mmeky@faculty.umuc.edu and aali@umuc.edu
Abstract— Cloud computing is a rapidly growing segment of the to the public such as the Google App Engine [3] and Microsoft
IT industry that will bring new service opportunities with Live Mesh [4]. Storage-as-a-Service, such as Amazon simple
significant cost reduction in IT capital expenditures and storage service [5], gives data owners a cost effective service to
operating costs, on-demand capacity, and pay-per-use pricing store massive data and handles efficient routine data backup by
models for IT service providers. Among these services are utilizing the vast storage capacity offered by a cloud computing
Software-as-a-Service, Platform-as-a-Service, Infrastructure-as– infrastructure. In addition, it gives customers the ability to
a-Service, Communication-as-a-Service, Monitoring-as-a-Service, expand and reduce IT resources as needed. However, with the
and Storage-as-a-Service. Storage-as-a-Service provides data development of cloud computing, deployment of IT systems
owners a cost effective service to store massive data and handles
and data storage is shifted to off-premises third-party IT
efficient routine data backup by utilizing the vast storage
capacity offered by a cloud computing infrastructure. However,
infrastructures, i.e., cloud computing platforms. Shifting data
shifting data storage to cloud computing infrastructure storage to cloud computing infrastructure introduces several
introduces several security threats to data as cloud providers may security threats to data, as cloud providers may have complete
have complete control on the computing infrastructure that control on the computing infrastructure that underpins the
underpins the services. These security threats include services. These security threats include unauthorized data
unauthorized data access, compromise data integrity and access, compromised data integrity and confidentiality, and less
confidentiality, and less direct control over data for data owner. direct control over data for data owners. To overcome these
The current literatures propose several approaches for storing threats, we present a secure and efficient model that allows the
and sharing data in the cloud environments. However, these data owners to have full control to grant or deny data sharing in
approaches are either applicable to specific data formats or the cloud environment. In addition, the proposed model ensures
encryption techniques. In this paper, unlike previous studies, we data integrity and confidentiality, and prevents cloud providers
introduce a secure and efficient model that allows the data from revealing data to unauthorized users. The proposed model
owners to have full control over data sharing in the cloud can be used in several applications such as remote file storage,
environment. In addition, it prevents cloud providers from data publication, on-demand data access, and online
revealing data to unauthorized users. The proposed model can be educational programs. Each application can use its data format
used in different IT areas, with different data and encryption and encryption technique to provide secure data sharing in the
techniques, to provide secure data sharing for fixed and mobile cloud. In addition, the proposed model uses a low computing
computing devices.
power (e.g. symmetric encryption) and a one- authentication
Keywords- cloud computing; cloud storage; data sharing
step to accept or deny a data access request. Therefore, it can
model; data access control; data owner full control, cloud storage be used with low computing power devices such as mobile
as a service; data encryption devices. The remainder of this paper is organized as follows. In
section II, we survey and analyze the related work. Section III
describes the details of our proposed model, followed by the
I. INTRODUCTION security analysis in section IV, and finally, section V concludes
Cloud computing is a rapidly growing segment of the IT the paper.
industry that will bring new service opportunities with
significant cost reduction and increased operating efficiency for II. RELATED WORK
IT vendors. Cloud computing includes three major models:
Software-as-a-Service, Platform-as-a-Service, and Deployment of storage as a cloud computing service,
Infrastructure-as-a-Service [1]. Additional models are evolving where data storage is shifted to off-premises third-party
as the concept of cloud computing develops new services such infrastructure, introduces special security threats. Therefore,
as Storage-as-a-Service, Communication-as-a-Service, and data owners have to establish the following special security
Monitoring-as-a-Service. An important characteristic of cloud requirements to safeguard the data in the midst of un-trusted
computing is pay-per-use [2]. Customers pay for cloud services cloud environments:
only when they use them. Several cloud services are available
12 http://sites.google.com/site/ijcsis/
ISSN 1947-5500
(IJCSIS) International Journal of Computer Science and Information Security,
Vol. 9, No. 6, June 2011
A. Ensuring Date Integrity and Confidentiality III. THE P PROPOSED MODEL
The cloud storage providers should not have the capability In this section, we will explain our proposed access model
of compromising the integrity and confidentiality of the data based on a scenario illustrated in Figure 1 and notations listed
stored in the cloud. Confidentiality means keeping users’ data in Table 1. As shown in Figure 1, a data owner, who stores his
secret in the cloud systems while data integrity means encrypted data in the cloud, receives a data access request
preserving information integrity, i.e., no data loss or from a user. After successfully authenticating the user and
modification by unauthorized users [6]. checking the policies, relevant to the user, the data owner
sends a control message to the user and a data access permit to
B. Controling Data Access and Sharing the cloud storage provider. The data access permit has relevant
The data owner should be the only authority that grants and information that allows the cloud storage provider to apply
access to authorized users. data owner’s policy and provides specific data to the user.
Meanwhile the control message, sent by the data owner, will
allow the user to decrypt and authenticate the data that will be
C. Authentication
granted from the cloud storage provider. As shown in step 4 in
The Authentication is used to verify the claimed identity of Figure 1, the user compares the information received from the
the data owner, user, or other entity [7] such as cloud provider. data owner with information received from the cloud provider.
To meet these security requirements, data owners have to If there is a match, the user ensures that the received
enforce authorization access policies that prevent revealing information is valid and authentic.
data information to cloud service providers or unauthorized In the proposed model, a cloud storage provider has no
users. Previous studies proposed several approaches for storing knowledge about the data encryption algorithm and decryption
and sharing data in the cloud environments. However, these key. This way, data owners keep control over data integrity
approaches are either applicable to specific data formats or and confidentiality in the cloud. Meanwhile, data owners
encryption techniques. For example, the model introduced in control user policy access and reveal relevant information that
[8] applies the publisher policy model presented in [9] to secure grants users access and protects data against any modification.
storage of Extensible Markup Language (XML) data in the
cloud by adding special secure co-process to the stored
machine, as part of the cloud infrastructure, to enable efficient
encryption to the stored XML documents. Although
mechanism published in [8] may enforce owner’s policies on
XML documents, the cloud providers have access to plain
XML data. Reference [10] introduced a model for securing
data sharing on the cloud. In that model, data sharing is
achieved by re-encrypting the data to the authorized users by
the cloud provider. Although model illustrated in [10] can
enforce sharing policies, specified by data owners, and
preventing unauthorized access to data, the model’s idea works
only with one encryption technique (progress elliptic curve Figure 1. Secure Data Sharing Model with Full Control in the Cloud
encryption) and requires the cloud provider to re-encrypt the
encrypted data before forwarding it to authorized users. TABLE I. MODEL’S NOTATIONS
Reference [11] introduced a model to outsource very large Notation Description Comments
blocks of data by encrypting each block of data with a different
O-ID Data Owner ID
encryption key. However, the model published in [11] fails to
demonstrate how a user will ensure data confidentiality after C-ID Cloud storage provider ID
receiving data from the cloud. In addition, whenever a user's
U-ID User ID
access right is revoked, the data block group needs to be
fragmented and several data blocks need to be re-encrypted. D-ID Shared data ID
Our model is more secure and more efficient than the model
SU User secret anonymity Published by
presented in [11] and immune to eavesdropping attacks since, data owner
in our model, a user is not allowed to communicate with the SC Cloud provider secret anonymity Published by
cloud provider. In summary, our model gives the data owner data owner
full control to grant or deny data sharing in the cloud using du Secret encryption key for exchanging Published by
efficient and secure procedures. In addition, it prevents cloud messages between data owner and the user data owner
providers from revealing data contents to unauthorized users. dc Secret encryption key for exchanging Published by
messages between data owner and the data owner
The proposed model can be used in several applications (e.g. cloud provider
remote file storage, data publication, online educational XOR Logical exclusive or operation
programs), with different data and encryption techniques, to
provide secure data sharing for both fixed and mobile ks A one-time session key to be used with Generated by
XOR operation when transferring message data owner
computing devices. from the cloud provider to the user
13 http://sites.google.com/site/ijcsis/
ISSN 1947-5500
(IJCSIS) International Journal of Computer Science and Information Security,
Vol. 9, No. 6, June 2011
h (.) A one-way secure hash function such as encryption algorithm, EN, encryption key, kd, and data hash
SHA-1 value, h (data), that are relevant to the data (D-ID), a one-
|| A concatenation operator
time session key, ks, and optional field, OP. The optional
{.}k Encryption operator using encryption key, k field, OP, could be used to extend the capability of the
proposed model. For example, the optional field could have
EN Encryption algorithm used for encrypting Chosen by the
the shared data data owner
the time when the data should be accessed (e.g. for
based on data downloading a test on an online educational program) or
type special access policy that could be related to Mandatory
ENC{dat Encrypted data Sent by cloud Access Control (MAC) or Role Based Access Controls
a} provider (RBAC) [13]. After preparing the message, m2, the data owner
kd Encryption key used for encrypting the Chosen by the sends the control message = {O-ID, {m2, h (m2 // SU)}du} to the
shared data data owner
h(data) Hash value of the shared data Calculated at
user. Upon receiving the control message, {O-ID, {m2, h (m2 //
the data owner SU)}du}, the user will authenticate and check the integrity of
the received message as follows:
For execution of this proposed model, the data owner first a) Decrypt the received message, using the symmetric secret
needs to complete the following tasks:
key, du, and obtain m2= {C-ID // D-ID // Nu // Nd // EN // kd
a) Issue two secret anonymities, SC and SU, for the cloud
// ks // h(data) // OP}, and h (m2 // SU)
service provider and the user.
b) Compare the values of D-ID and Nu, obtained from m2, to
b) Issue two secret symmetric encryption keys, dc and du, for
those values sent in message m1. If there is a match, the
the cloud service provider and the user.
user continues.
c) Use a secure channel, such as Diffie-Hellman key
c) Compute h (m2, SU) and check whether it equals the
agreement [12], to exchange SC and dc with the cloud
received h (m2 // SU)). If there is a match, the user
provider, and submit SU and du to the user
authenticates the data owner.
In addition, we assume that the data owner encrypts the d) Keep C-ID, ks, and Nd for processing cloud provider
data with a suitable encryption algorithm, relevant to the data message, m4, in step 5.
type, and submitted the encrypted data to the cloud service 3. Data Owner Sends a Data Access Permit to the Cloud
provider though a secure channel. The proposed model has the Provider
following five steps:
In addition to sending the control message to the user, the
1. A user Resquest Data Access from the Data Owner data owner prepares a message m3 = {D-ID // U-ID // Nu // Nd
A user who would like to access data, defined by D-ID, // ks // OP} and sends a permit data access message = {O-ID,
generates a nonce, Nu, and prepares a message m1= {U-ID // {m3 // h (m3 // SC)dc}} to the cloud provider
D-ID // Nu} to be sent to the data owner. The user then sends a
request data access message = {U-ID, {m1 // h (m1 // SU)}du} to 4.Cloude Provider Sends the Encrypted Data to the
the data owner. User
2.Data Owner Authenticates and Sends Control Upon receiving the grant data access message, {O-ID, {m3
Message to the User // h (m3 // SC)}dc}, the cloud provider executes the following
steps:
Upon receiving the data access request from the user, the
data owner executes the following steps: a) Decrypt the received message, using the symmetric secret
a) Decrypt the received message, using the symmetric secret key, dc, (that is relevant to O-ID) and obtain m3 = {D-ID,
key, du, (that is relevant to U-ID) and obtain m1 = (U-ID, U-ID // Nu // Nd // ks // OP}, and h (m3 // SC).
D-ID // Nu), and h (m1 // SU). b) Verify the format of D-ID from the decrypted message m3.
b) Verify the format of U-ID, D-ID from the decrypted message If there is no match, the cloud provider terminates the
m1. If there is no match, the data owner terminates the connection. Otherwise, the cloud provider continues.
connection. Otherwise, the data owner continues. c) Compute h (m3 // SC)) and checks whether it equals the
c) Compute h (m1 // SU) and check whether it equals the received h (m3 // SC)). If there is a match, the cloud
received h (m1 // SU)). If there is a match, the data owner provider ensures the authenticity of the data owner.
determines the authenticity of the user. d) Extract ks from m3 and prepare a message m4 = {D-ID, U-
ID // Nu // Nd // OP // ENC {data}} XOR ks.
After authenticating the user, the data owner generates a e) Send a message = {C-ID, m4 // h (m4 // ks)} to the user
nonce, Nd, a one-time session key, ks, and prepares two special defined by U-ID, obtained from message m3, as shown in
messages m2, and m3 to be sent to the user and the cloud Figure 1.
provider respectively. The message, m2= {C-ID // D-ID // Nu //
Nd // EN // kd // h (data) // ks // OP}, contains the following
parameters: cloud provider identification, C-ID, shared data
identification, D-ID, message nonce, Nu and Nd, the
14 http://sites.google.com/site/ijcsis/
ISSN 1947-5500
(IJCSIS) International Journal of Computer Science and Information Security,
Vol. 9, No. 6, June 2011
5.User Verifies the Received Data from the Cloud information disclosure during sharing, and other security
Provider attacks.
Upon receiving a message {C-ID, m4 // h (m4 // ks)} from 1) Unauthorized data access attack
the cloud provider, the user retrieves the one session key, ks,
received from the data owner in m2, and executes the Since data owners keep the encryption information (key
following steps: and algorithm) and check the identity of users, unauthorized
data access is not possible in our model. In general,
a) Compute m4 XOR ks and obtain m4 = {D-ID, U-ID // Nu // unauthorized data access attacks occur by one of the following
Nd // OP // ENC {data}}. methods:
b) Compute h (m4 // ks) and compare it with the received h (m4 1. The attacker acquires data from the cloud storage
// ks). If there is a match, the user continues. provider. In our model, the user doesn’t initiate any messages
c) Compare the values of C-ID, D-ID, Nu ,and Nd, received with the cloud provider to gain data access. Even if the cloud
from cloud provider, to those values obtained from message provider sends data to an unauthorized user, the user can’t
m2, received from the data owner. If there is a match, the decrypt the received message since the encryption information
user authenticates the received message. (key and algorithm) is not known to unauthorized users and to
d) Encode the received encrypted data, ENC {data}, with the the cloud providers. Therefore, it is not possible for
encoding key, kd, received from the data owner in m2. unauthorized users to know the encryption information
e) Compute h (data) and compare it with h (data) obtained without the help of the data owner.
from the data owner in message m2. If there is a match, the 2. The attacker acquires data access from the data
user ensures the integrity and confidentiality of the received owner. To get data access permission from the data owner, the
data. attacker must have the knowledge of user anonymity, US, and
the encryption key, du. It is not possible for the attacker to
IV. SECURITY ANALYSIS OF THE PROPOSED MODEL guess both parameters and access the data.
This section illustrates how the proposed model achieves
2) Information disclosure during sharing attack
security requirements for storing data in cloud environments
and how it offers enhanced resiliency to security threats. Since data is always in its encrypted form, there is no way
data can be decrypted before it is delivered to authorized
A. Security Requirement Achieved users. This ensures that the entire sharing process will not
disclose information to cloud providers and unauthorized
1) Ensuring data integrity and confidentiality users. To acquire data during sharing, an attacker must have
the decryption key and algorithm. Since this information is
In the proposed model, since the data is stored in encrypted kept with the data owner, cloud storage providers and
form on the cloud and the data owner keeps the encryption key unauthorized users cannot decrypt the data.
and algorithm information, the cloud storage provider does not
have the capability of compromising the integrity and 3) Data owner/user’s identify guessing attack
confidentiality of the data stored in the cloud infrastructure. As shown in Figure 1 and Figure 2, the user/data owner
2) Controlling data access and sharing appends a secret user’s anonymity to the exchanged message
In the proposed model, since the data owner is the only (m1/m2) before computing its hash code, and then encrypts the
authority that authenticates the user and issues the data exchanged message by the secret symmetric key, du. Both
encryption information (algorithm and key) to authorized secrets (SU, and du) are known only to the data owner and the
users, cloud providers cannot grant data access to authorized user. At the receiving side, the data owner/user
unauthorized users. decrypts the message and appends the same secret anonymity,
SU, to the message before calculating its hash code to check
3) Authentication the message’s authenticity. Since the hash code provides
Authentication is the act of establishing or confirming authentication and the encryption provides confidentiality to
claims made by or about the subject are true and authentic the exchanged message between data owner and user, the
[14]. In the proposed model, authentication is achieved by adversary can’t guess the user’s anonymity from the
using a hash code that contains a secret anonymity SU or SC exchanged messages and therefore can’t imitate user identity
and encrypt by a secret encryption key (du or dc) as shown in to create a new data access request. Similarly, the adversary
Figure 1. For example, the data owner appends a secret user’s cannot imitate a data owner and send fake data access to a
anonymity, SU, to the exchanged message, m2, before user.
computing its hash code, h (m2 // SU). The data owner then
encrypts the exchanged message, {m2 // h (m2 // SU)} by the
secret symmetric key (du) and sends it to the user.
B. Resilience Against Security Threats
This subsection shows how the proposed model is resilient
to security threats such as unauthorized data access attack,
15 http://sites.google.com/site/ijcsis/
ISSN 1947-5500
(IJCSIS) International Journal of Computer Science and Information Security,
Vol. 9, No. 6, June 2011
algorithm, and data encryption key) from m2 since he or she
cannot decrypt m2 without knowing secrets SU, and du. In
addition, the adversary will not be able to decrypt m4, received
from the cloud service provider, since he or she cannot reveal
the one time encryption key, ks, issued by data owner in
message, m2.
V. CONCLUSION
Figure 2. Securing transmission between the data owner and the user This paper has introduced a secure and efficient model that
4) Cloud provider’s identity guessing attack offers the data owner full control to grant or deny data sharing
in the cloud environment. In addition, it prevents cloud
As shown in Figure 1 and Figure 3, the data owner uses a providers from reveling data to unauthorized users. The
cloud provider’s anonymity, SC, and encryption key, dc, to proposed model can be used in several applications such as
provide authentication, by hash code, and confidentiality, by remote file storage, data publication, on-demand music access,
encryption, when sending messages to the cloud provider. and online educational programs. Each application can use its
Therefore, the adversary cannot guess the cloud’s anonymity own data format and encryption technique to provide secure
from the exchanged messages. Similarly, the adversary cannot data sharing in the cloud. In addition, since the proposed model
imitate a data owner and sends fake data access permit uses low computing power (e.g. symmetric encryption) and a
messages, m3, to the cloud provider. one- authentication step to accept or deny a data access, it can
be used with mobile or fixed devices. Security analysis has
demonstrated that the proposed model meets cloud security
requirements and is resilient to several security threats.
REFERENCES
[1] T. Sridhar, “Cloud computing – a primer, Part 1: models and
technologies,” The Internet Protocol Journal, vol. 12 (3), pp. 2–19,
September 2009.
[2] J. W. Rittinghouse and J. F. Ransome, “Cloud computing:
Figure 3. Securing transmission between the data owner and the cloud service implementation, management, and security,” CRC Press. Boca Raton,
provider 2010
[3] Google Inc., “Google app engine,” 2011, retrieved in March 2011 from
5) Impersonation attack http://appengine.google.com
An impersonation attack involves an adversary who [4] Microsoft Inc., “Microsoft live mesh,” 2011, retrieved in March 2011
attempts to impersonate a data owner, a user, or a cloud from http://www.mesh.com
provider. [5] Amazon Inc., “Simple storage service,” 2011, retrieved in March 2011
from http://aws.amazon.com/s3
a) An adversary can’t imitate a data owner to grant a [6] M. Zhou, R. Zhang, W. Xie, W. Qian, and A. Zhou, “Security and
user data access without knowing user secrets (SU, Privacy in Cloud Computing: A Survey,” Sixth international conference
du), cloud provider secrets (SC, dc), and data on semantics, knowledge and grids, pp.105-112, 2010.
[7] C. Kaufman, R. Perlman, and M. Speciner, “Network security: private
encryption information (encryption algorithm, data communication in a public world,” Upper Saddle River, New Jersey:
encryption key). Prentice Hall Press, 2002
b) Without knowing the secrets (SU, du), an adversary [8] K. Hamlen, M. Kantarcioglu, L. Khan, and B. Thuraisingham, “Security
cannot imitate a user to decrypt the message m2 and issues for cloud computing,” International Journal of Information
Security and Privacy , vol. 4 (2), pp. 39-51, 2010.
then get data access
[9] E. Bertino, B. Carminati, E. Ferrari, B. Thuraisingham, and A. Gupta,
c) Since the cloud provider doesn’t know the data “Selective and authentic third party distribution of XML documents,”
encryption algorithm, EN, the data encryption key, IEEE Transactions on Knowledge and Data Engineering , vol. 16 (10),
kd, and the message encryption key, ks, (issued by the pp- 1263-1278, 2004.
data owner to the authorized user), an adversary [10] G. Zhao, C. Rong, J. Li, F. Zhang, and Y. Tang, “Trusted data sharing
over untrusted cloud storage providers,” 2nd IEEE international
cannot imitate a cloud provider to provide users with conference on cloud computing technology and science, pp- 97-103,
fake data. 2010
[11] W. Wan and Z. Li, “Secure and efficient access to outsourced data,”
6) Replay attack 16th ACM conference on computer and communication security, 2009.
A replay attack is a method in which an adversary tries to [12] W. Diffie and M. Hellman, “New directions in cryptography,” IEEE
replay messages obtained in previous communications. For Transactions on Information Theory , vol. 22 (6), pp- 644-654, 1976
example, an adversary might replay the used message m1 to [13] M. Ciampa, “Security+Guide to Network Security Fundamentals,”
the data owner requesting data access and then receive the Boston, MA: Course Technology, Cengage Learning, 2009
message m2 from data owner. However, the adversary cannot [14] R. Zhang and L. Liu, “Security models and requirements for healthcare
derive correct data information (data ID, data encryption application clouds,” IEEE 3rd International Conference on Cloud
Computing, 2010
16 http://sites.google.com/site/ijcsis/
ISSN 1947-5500
(IJCSIS) International Journal of Computer Science and Information Security,
Vol. 9, No. 6, June 2011
AUTHORS PROFILE
Mohamed Meky is an IT professional who has a unique combination of
teaching, research, leadership, and industrial experiences. He published
several articles, developed many courses, and lead different industrial projects
in IT field. His current research interest is in security area.
Amjad Ali is the Director of the Center for Security Studies and a Professor of
Cybersecurity at University of Maryland University College. He played a
significant role in the design and launch of UMUC’s cybersecurity programs.
He teaches graduate level courses in the area of cybersecurity and technology
management. He has served as a panelist and a presenter in major conferences
and seminars on the topics of cybersecurity and innovation management. In
addition, he has published articles in the cybersecurity area.
17 http://sites.google.com/site/ijcsis/
ISSN 1947-5500
(IJCSIS) International Journal of Computer Science and Information Security,
Vol. 9, No. 6, 2011
Performances Evaluation of Inter-System Handover
between IEEE802.16e and IEEE802.11 Networks
Abderrezak Djemai1, Mourad Hadjila2, Mohammed Feham3
STIC laboratory, University of Tlemcen, Algeria
1
djemai_tlm@yahoo.fr, 2mhadjila_2002@yahoo.fr, 3m_feham@mail.univ-tlemcen.dz
Abstract— This article presents the mechanisms to be
implemented for analyzing the performances of the inter-system II. WIFI AND WIMAX TECHNOLOGIES DESCRIPTION
handover between WiFi and WiMAX networks. The presence of
an entity of handover is significant so that the mobile terminal A. WiFi Overview
supports both technologies enabling it to make heterogeneous WiFi is a high rate wireless transmission used to connect
transfers. In this paper, we propose the development of a laptops or any type of peripheral in a range of several tens of
software platform able to manage the interoperability between meters in indoor applications to several hundreds of meters in
WiMAX and WiFi with uninterrupted communication.
open space.
Keywords- Networks, Wireless, WiFi, WiMAX, Handover, WiFi networks present a multitude of functionalities which
Packets. come from fixed and mobile communications world. These
functionalities allow them to be more reliable, providing the
I. INTRODUCTION several services to the users.
The wireless data networks knew a true explosion since the The principal functionalities of a WiFi network are:
end of the Nineties to make connection to Internet. Wireless
environment presents many differences with the world of the • The fragmentation and the re-assembly which
wired networks in particular at the level of the low layers in allow avoiding the problem of transmission of
communications which are the physical and data links layers. important volumes of data thus decreasing the
error rate.
The routing of the data towards and since wireless mobile
equipment is a crucial problem especially between two • The mobility management.
different networks. Times of interruption of the • Variation of the transmission rate according to the
communications can make these last unusable or not easily radio environment.
understanding (i.e. such as for example in the case of a
videoconference). Thus, this operation consists in defining new • The insurance of a good quality of service.
protocols and network mechanisms for a minimization or a
Figure 1 illustrates the WiFi network topology.
suppression of times of interruption.
The last decade was marked by the emergence of many
wireless technologies such as Bluetooth 802.15 or the WiFi
(Wireless Fidelity) 802.11.
The most recent technology which makes today great
development in the field of the wireless transmission is
WiMAX (Worldwide Interoperability for Microwave Access)
[1]. Appeared in June 2001, WiMAX is now the network of
access to large band more requested thanks to its new
performances of the data rate and the range. Figure 1. WiFi network topology
The remainder of this paper is organized as follows: section
I presents a brief description of WiFi and WiMAX B. WiMAX Overview
technologies. Section II is devoted to the concepts of handover WiMAX (Worldwide Interoperability for Microwave
WiFi-WiMAX and handover WiMAX-WiFi. Section III is Access) is a hertzian solution for WMAN networks. It is based
reserved for the results of simulation and finally we conclude on the standard IEEE 802.16, validated in 2001 by the
this paper. international agency of IEEE standardization.
The initial version of the standard works in the band (10-
66) GHz and requires a line of sight (LOS) between the
transmitter and the receiver. However, the extension 802.16a,
works in the band (2-11) GHz, adapted better to the
18 http://sites.google.com/site/ijcsis/
ISSN 1947-5500
(IJCSIS) International Journal of Computer Science and Information Security,
Vol. 9, No. 6, 2011
regulations, and allows a transmission in no line of sight Station) during its displacement of the coverage area of a Base
(NLOS) space. Station (BS) to another.
WiMAX would be an alternative to wired broadband The standard 802.16e supports three types of Handover
technologies. It would reinforce the connection in terms of which are:
capacity, rate and coverage. Its transmission capacities are
theoretically of 70 Mbps for a range of 50 km. In practice, it • The Hard Handover,
allows a transmission rate of 10 Mbps for a range of 20 Km. • The MDHO (Macro Diversity Handover),
Figure 2 shows the WiMAX network architecture. • The FBSS (Fast Base Station Switching).
The Hard handover is obligatory, as for the two others they
are optional.
A. Hard Handover
During the Hard Handover, the MSS communicates with
several BS at the same time. The link with the old BS is
cancelled before the establishment of the new one. The
handover is carried out as from the moment that the signal of
the close cell is more important than that of the current BS.
Figure 3 shows the Hard Handover execution.
Figure 2. WiFi network architecture
C. Comparaison between WiMAX and WiFi
The table I recapitulates the difference between WiFi and
WiMAX technologies.
TABLE I. TECHNICAL SPECIFITIES OF WIFI AND WIMAX
TECHNOLOGIES
Parameter Wifi 802.11 WiMAX 802.16 Difference
The physical layer of
802.16 tolerates
timeouts (reflections)
About 300
Up to 45 Km- through the
meters Figure 3. Hard Handover Execution
Range cells of 5 to 10 implementation of 256
maximum
Km FFT (Fast Fourier
Transform) as against 64
for 802.11 B. Macro Diversity Handover(MDHO)
802.16 has better While Macro Diversity Handover [3] is supported by the
Optimized for Long-range
Coverage short range optimized for
penetration through MSS and the BS, the whole of diversity is updated at the MSS
obstacles to longer
inside outdoor use
distances
and the BS. It should be noted that the whole of diversity is the
Designed for 802.11 MAC protocol list of the base stations participating to the procedure of
Designed to
LANs, is for a
support up to 100
uses a CSMA/CA while Handover, whose field level is higher than a certain value.
dozen users, 802.16 uses TDMA.
users, sizes of
Adaptability band sizes of 802.16 can use all the Moreover, this list is defined for each MSS associated with
bands varying
fixed available frequencies
frequencies
from 15 to 20
whereas 802.11 is the network. During Macro Diversity Handover, the MSS who
MHz
(20 MHz) limited takes part in the procedure of Handover communicates with all
Higher frequency the base stations belonging to the whole of diversity. During
2.7 bps/Hz or 5 bps/Hz or up to
coupled with error
Bit rate up to 54 Mbps 100 Mbps in 20
correction providing the procedure of MDHO, in the downlink direction, two base
in 20 MHz MHz
better use of spectrum stations or more transmit data to the MSS so this creates
802.11 avoids collisions diversity in reception. In the uplink direction, the transmissions
quality of of messages via
Quality of
service
Integrated in
CSMA/CA. from the MSS are received by several base stations.
Service MAC at differents
support 802.16: same frequency
(QoS) layers The following figure illustrates the architecture of Macro
(802.11e) but spread overtime
(TDMA) Diversity Handover.
III. HANDOVER
The handover [2] is the mechanism which ensures the
continuity of the connection of one MSS (Mobile Subscriber
19 http://sites.google.com/site/ijcsis/
ISSN 1947-5500
(IJCSIS) International Journal of Computer Science and Information Security,
Vol. 9, No. 6, 2011
message contains a new prefix and informs the director of
interface. A timer is associated with the prefix. When the prefix
is expired, an opinion is sent to the director of interface. The
implementation also supports RS (Router Solicitation) to make
it possible a MN to discover a new BS after Handover.
B. Media Independent Handover (MIH IEEE 802.21)
The realization of handover between heterogeneous
networks of access in a transparent way from the point of view
of the mobile user (without interruption nor deterioration)
requires the taking into account of certain concepts such as
continuity of service, quality of service, the discovery and the
selection of the network [4], [5].
Thus the work group IEEE 802.21 created a basic
architecture which defines a function MIHF “Media
Figure 4. Macro Diversity Handover Independent Handover Function” which will help the mobile
systems to carry out a handover without service interruption
C. Fast Base Station Switching (FBSS) between heterogeneous networks such as IEEE 802.3 (wire
The principle is more or less similar to that of the MDHO LAN), IEEE 802.11x (wireless LAN), IEEE 802.16e (mobile
in the sense that there is always the overall concept of diversity. WiMAX network), GPRS and UMTS (mobile network 3G).
With the difference here, that the mobile subscriber station The IEEE 802.21 standard [4] is the development of an
chooses a base station from the whole of diversity to become architecture that enables service continuity in a transparent
its principal base station. The principal base station is the only manner when the mobile terminal (MN) moves between two
base station with which the mobile subscriber station heterogeneous networks in data link level.
exchanges traffic at the same time in the uplink and downlink,
by including the messages of management. It is also with this A set of functions to optimize the handover is defined in the
BS that the MSS is recorded, synchronized or is made its protocol stack of mobility management MME (Mobility
control in the downlink. However, with each transmitted frame, Management Entity) of network elements and there is a
the MSS can change the principal base station as shown on creation of a new entity called MIHF (Media Independent
figure 5. Handover Function). It works on layer 3 and can communicate
between local and remote interfaces which can be in contact via
another MIHF.
This is illustrated on the figure 6.
Figure 5. Fast Base Station Switching
IV. NECESSARY SIMULATION MODULES
Neighbor Discovery (ND), the module Media Independent
Handover (MIH) and the mobility management module Figure 6. Overall picture of design of MIH [6], [7]
(MIPv6) are the key elements used in the code of simulation.
C. MobilityManagement Module (MIPv6)
A. Neighbor Discovery (ND)
MIPv6 describes the mobility management of IPv6
The module ND is used to provide the detection of
terminals. This mobility allows that an IPv6 terminal is always
movement of layer 3. In the network, the BS sends periodically
reachable whatever its localization in the Internet and its
messages RAs (Router Advertisement) to inform the Mobile
connection remain active in spite of its displacement.
Nodes (MNs) about the prefix of network. The ND agent
located in MN receives these RAs and determines if the The figure 7 contains several actors:
20 http://sites.google.com/site/ijcsis/
ISSN 1947-5500
(IJCSIS) International Journal of Computer Science and Information Security,
Vol. 9, No. 6, 2011
• The mobile node (MN): is the IPv6 terminal which can
move.
• The agent mother (Home Agent, HA): is equipment of
network which manages mobility with the manner of a
HLR in cellular networks.
• Correspondent Terminal (Correspond Node, CN): is an
IPv6 terminal with which the MN has or will have an
active connection.
One distinguishes two types of networks on which MN can
Figure 7. Basic mechanism of IPv6 mobility
come to be connected:
• Network mother is the network of MN origin, or it is Figure 8 shows Optimization of the routing between the
correspondent and the mobile. If another correspondent CN
addressable by its address mother (HA: Home Address).
wants to communicate with MN, it sends its first packet to the
• Visited network is the network where MN moves. At the address mother of MN , where the HA plays its part of proxy
time of its arrival in this type of network, MN recovers and transfers the packet towards the MN . After the arrival of
thanks to the self-configuration mechanism of IPv6 [8], a transferred packet, this last can choose to announce to the
[9] a topologically correct IPv6 address called temporary correspondent his current location , thus allowing a direct
communication between CN and MN .
address (Care-of Address).
The basic principle of IPv6 Mobile is that MN is always
addressable by its address mother, whether it is in its network
mother or in a visited network.
If MN is in its network mother, the routing of the packets is
carried out in a standard way, while being based on the tables
of the routers. MN is neither more nor less than one “fixed”
IPv6 terminal.
If MN carries out a movement to go on a visited network ,
this one recovers a temporary address on this network; i.e.
pertaining to the prefix used on this bond of the network. It
records its new position near the agent mother , thanks to a
message called Binding Update (BU) comprising at the same
time its address mother and its temporary address, and awaits a
confirmation of its share in the form of a message called Figure 8. Optimization of the routing between the Correspondent and the
Mobile
Binding Acknowledgment (BA). The agent mother plays the
part of proxy and intercepts all the packets intended for the
address mother to direct them towards the new MN position – V. SIMULATION AND RESULTS
i.e its temporary address “primary”.
The results shown in this part are obtained by NS2
MN announces its new position to the correspondent simulator. NS2 is a software tool for simulation of data-
with which it was in communication, always thanks to the BU processing networks. It becomes today a standard of reference
and BA messages, in order to optimize the communications in this field. The software is runnable as well under Unix as
(the communications will not be sent any more to the address under Windows. The Simulator is composed of an application
mother then directed by the agent mother towards the program interface in TCL and a core written in C++ in which
temporary address “primary”, but directly sent from the the majority of the protocols networks were implemented.
correspondent terminal to the mobile node).
A. Scenario of Simulation
In this part we consider a simple topology including a
multi-interface node supporting two technologies WiFi and
WiMAX. The mobile node (MN) establishes a connection with
CN (Correspondent Node).
Let us suppose that the MN employs at the beginning the
WiMAX interface, one commutates the traffic with the WiFi
interface when it becomes available.
21 http://sites.google.com/site/ijcsis/
ISSN 1947-5500
(IJCSIS) International Journal of Computer Science and Information Security,
Vol. 9, No. 6, 2011
The figure 9 describes the four essential components of our TABLE III. WIMAX NETWORK PARAMETERS
scenario:
Parameters Signification
• Router 0 (CN) Channel/WirelessChannel Channel type : Wireless
• Router1 (Gateway) Propagation/TwoRayGround Radio propagation model : 802.16
Phy/WirelessPhy /OFDM Network interface type : 802.16
• WiMAX Base station (BS 802.16) Mac/802_16 MAC layer type 802.16
• Access point WiFi (AP 802.11) Queue/DropTail/PriQueue
LL
Queue interface type
Link layer type 802.16
• Mobile node (MN) Antenna/OmniAntenna Antenna model
50 Maximum queue size
adhocrouting Used routing protocol. In this case DSDV
Table IV gives the WiMAX base station parameters.
TABLE IV. PARAMETERS OF WIMAX BASE STATION
Parameters Signification
WiMAX cell coverage 1000 m
Pt 0.025w
RXThresh 1.26562 x 10-13 w
CSThresh 0.8 x [1.26562 x 10-13] w
3) WiFi Network parameters:
Table V describes the configuration of the WiFi network
Figure 9. Topology of the scenario (2000m X 2000m) parameters.
B. Parameter Setting and Configuration of the Networks TABLE V. WIFI NETWORK PARAMETERS
Before being able to use the simulator, the topology of the
Parameters Signification
network and the need for each node must be described in a
TCL file which will be then read by the simulator. The Channel/WirelessChannel Channel type : Wireless
parameters and the configurations defined in this file are the Propagation/TwoRayGround Radio propagation model : 802.11
Phy/WirelessPhy Network interface type : 802.11
following: Mac/802_11 MAC layer type 802.11
1) Simulation parameters:
Table VI represents the configuration of the Access point
Table II represents the configuration of the simulation WiFi.
parameters.
TABLE VI. PARAMETERS OF THE ACCESS POINT WIFI
TABLE II. SIMULATION PARAMETERS
Parameters Signification
Parameters Signification WiFi coverage 20 m
Pt 0.025 w
Trafic_start = 05s : trafic start
freq 2.412 GHz
Trafic_stop = 70 s: trafic end
RXThresh 6.12277 x 10-9 w
Simulation_stop = 70s : simulation end CSThresh 0.9 x [6.1227 x 10-9] w
RNG (Random Number Generator) fixed at 1 for all
Seed
simulated scenarios
2) WiMAX Network parameters: C. Performance evaluation of the Handover
This part contains the results of the simulated scenarios and
Table III describes the configuration of the WiMAX the analysis of the influence of metric used in the execution of
network parameters. vertical handover WiFi-WiMAX in two directions: WFWXHO
(handover WiFi towards WiMAX) and WXWFHO (handover
WiMAX towards WiFi).
This metric concern the lost packets rate and it is given by:
number of lost packets (1)
Loss Packets Rate =
total number of generated packets
22 http://sites.google.com/site/ijcsis/
ISSN 1947-5500
(IJCSIS) International Journal of Computer Science and Information Security,
Vol. 9, No. 6, 2011
does not receive any more of the packets of the old base
1) Handover WiFi-WiMAX (WFWXHO): The simulated station.
scenario consists to transfer the traffic between the router 0
(CN) and the MN which linearly moves from WiFi network to 2) Handover WiMAX-WiFi (WXWFHO): In this part, we
WiMAX network. suppose that the mobile was initially connected to the
Figure 10 shows the simulation model of WiFi-WiMAX WiMAX network, when it leaves the coverage area, it
Handover. commutates the traffic on the WiFi interface.
Figure 12 illustrates the simulation model of WiMAX-WiFi
Handover.
Figure 10. Simulation model (WFWXHO)
Figure 11 depicts the evolution of the lost packets rate
according to the simulation time.
Figure 12. Simulation model (WXWFHO)
Figure 13 shows the evolution of the loss packets rate
according to the simulation time.
Figure 11. Evolution of lost packets rate (WFWXHO)
According to this figure we deduce that:
• For a weak time of simulation the number of the lost
packets is very high.
• When the time of simulation is increased the number of
the lost packets falls; that means that when the MN Figure 13. Evolution of lost packets rate (WXWFHO)
moves from a network mother (WiFi) towards a visited
network (WiMAX), it communicates initially with its 3) Comparative Study: In This part, we will make the
agent mother (with messages BA and BU) to assign her comparison between the two types of handover WFWXHO
new position to him and thus to ensure the redirection of and WXWFHO according to time of simulation and the
transmission speed of MN.
the packets to him.
Figure 14 presents the evolution of the lost packets rate
• If we examine the files traces generated, we find that the according to simulation time.
destruction of the packets is due to the time of
establishment of a new localization where the mobile
23 http://sites.google.com/site/ijcsis/
ISSN 1947-5500
(IJCSIS) International Journal of Computer Science and Information Security,
Vol. 9, No. 6, 2011
The standard IEEE 802.21 is a standard under development
to offer a handover and ensures interoperability between
heterogeneous networks.
Handover for Mobile IPv6 (MIPv6) was accepted as a more
or less effective solution of handover for applications of the
types UDP such as the voice to solve the problems of lost
packets.
Following the analysis carried out of these practical
challenges, an architecture of development was proposed to be
able to simulate a scenario supporting various types of
applications between an access point WiFi and an access point
802.16 e.
As a perspective for this work, it would be interesting to
consider other scenarios of simulation, which could illustrate
Figure 14. Comparison between WFWXHO and WXWFHO according to the effect of the load of the mobile nodes on the performances
time. of the vertical handover between WiFi and WiMAX or other
types of applications such as FTP, TELNET…
For various simulation times, the number of the lost packets
during the execution of WFWX handover is more important
compared to that of WXWF Handover. REFERENCES
[1] J. G. Andrews, A. Ghosh, R. Muhamed, “Fundamentals of WiMAX:
The figure 15 shows evolution of the lost packets rate Understanding Broadband Wireless Networking, ” Prentice Hall PTR,
according to the speed of the mobile. Feb, 2007.
[2] Sik Choi, Gyung-Ho Hwang, Taesoo Kwon, Ae-Ri Lim, and Dong-Ho
Cho, “Fast Handover Scheme for Real Time Downlink Services in
IEEE 802.16e BWA System’’, May.2005.
[3] Institute of Electrical and Electronics Engineers, IEEE standard for local
and metropolitan area networks- Part 21: Media Independent Handover,
in: IEEE std 802.21-2008, vol. 21, 2009 pp.C1-C301.
[4] C. Baudoin, R. Dhaou, F. Arnal, M. Salhani, A-L. Beylot, “Analyse
d’applicabilité de standards de télécoms terrestres aux systèmes
de télécommunications par satellite, scénario 4G”, Rapport de
contrat, IRT-06-09-01, Institut National Polytechnique de Toulouse,
Sep.2006.
[5] IEEE 802.21, DCN 2105-0240-01-0000-
Joint_Harmonized_MIH_Proposal_Draft_Text.doc, May, 2005.
[6] NIST, The Network Simulator NS-2 NIST add-on-IEEE802.21 model
(based on IEEE P802.21/D03.00)-Draft 1.0,
http://w3.antd.nist.gov/seamlessandsecure/files/mobility/doc/MIH-
module.pdf (January, 2007)
[7] Yoon Young An, Byung Ho Yae1, Kang Won Lee, You Ze Cho, and
Woo Young Jung, ”Reduction of Handover Latency Using MIH
Figure 15. Comparison between WFWXHO and WXWFHO according to
Services in MIPv6”, IEEE Proceedings of the 20th International
speed.
Conference on AINA’06, june.2006.
[8] G. Cizault, “IPv6 Théorie et pratique”, paris, O'reilly, 1998.
When mobility is small (10-11m/s), the number of the lost
[9] R. Koodli(Ed.), “Fast Handover for Mobile IPv6,” IETF RFC 4068, Jul.
packets of scenario (WiFi-WiMAX) is lower than that of 2005.
(WiMAX-WiFi), but this is opposite when speed is increased.
VI. CONCLUSION
In this paper, we considered the calculation of the lost
packets rate to evaluate the performances of the inter-system
handover between the two wireless networks WiFi and
WiMAX in the two directions WFWXHO and WXWFHO.
The implementation of the modules such as MIH (module
developed by IEEE 802.21) and MIPv6 (module of the
management of mobility) is necessary to support the Vertical
Handover.
24 http://sites.google.com/site/ijcsis/
ISSN 1947-5500
(IJCSIS) International Journal of Computer Science and Information Security,
Vol. 9, No. 6, June 2011
Recognizing The Electronic Medical Record Data
From Unstructured Medical Data Using Visual
Text Mining Techniques
Prof. Hussain Bushinak Dr. Sayed AbdelGaber Mr. Fahad Kamal AlSharif
Faculty of Medicine Faculty of Computers and Information Collage of Computer Science
Ain Shams University Helwan University Modern Academy
Cairo, Egypt Cairo, Egypt Cairo, Egypt
Abstract: Computer systems and communication technologies 2. Help to derive data directly from the electronic record,
made a strong and influential presence in the different fields making research-data collection by product of routine
of medicine. The cornerstone of a functional medical clinical record keeping. .
information system is the Electronic Health Records (EHR)
management system. EHR implementation and adoption face 3. Help to Move from paper-based health care system to
different barriers that slow down its deployment in different secure electronic medical records which will save lives
organizations. This research focuses on resolving the most and reduce health care costs.
public barriers, which are data entry, unstructured clinical
data modifying the physician work flow. This research
4. Help in Early detection of infectious disease by
proposed a solution, which use Text mining and Natural advanced data collection, fusion and processing
language processing techniques.This solution tested and techniques which would be at the forefront in spotting
verified in four real-world clinical organizations. The the emergence of new diseases, and crucial to tracking
suggested solution proved correcteness and perciseness with the spread of known diseases[2].
91.88%..
II.ELECTRONIC HEALTH RECORD ,DEFINITION AND MODELS
Keywords: Electronic Health Reacord, Textmining, EHR defined as longitudinal electronic record of
Unstructured Medical Data , medical Data entry, Health patients' health information generated by one or more
Information Technology. encounters in any care delivery setting. This information
includes, but not limited to, patient demographics, progress
I.INTRODUCTION notes, examinations details like symptoms and findings,
medications, vital signs, past medical history,
The paper-based medical record is woefully inadequate immunizations, laboratory data, and radiology reports. The
for meeting the needs of modern medicine. It arose in the EHR automates and streamlines the clinician's workflow.
19th century as a highly personalized "lab notebook" that The EHR has the ability to generate a complete record of a
clinicians could use to record their observations and plans clinical patient encounter as well as supporting other care
so that they could be reminded of pertinent details when directly or indirectly related activities via interface
they next saw that same patient. There were no bureaucratic including evidence-based decision support, quality
requirements, no assumptions that the record would be used management, and outcomes reporting. The EHR means a
to support communication among varied providers of care, repository of patient data in a digital form stored and
and remarkably few data or test results to fill up the exchanged securely and accessible by multiple authorized
record’s pages. The record that met the needs of clinicians a users. [2][3][4]
century ago has struggled mightily to adjust over the
decades so as to accommodate to new requirements as
health care and medicine have changed which leads to the There are many EHR architectural models that can be
existence of Health Information Technology (HIT) [1]. used all over the world. The most two popular EHR models
are:
HIT allows comprehensive management of medical
knowledge and its secure exchange among health care
1. Central Repository Model
consumers and providers. Broad uses of HIT will:
The center of EHR model will be the repository, which
1. Help to eliminate the manual tasks of extracting data
will be fed by the existing applications in different care
from charts or filling out specialized datasheets.
locations such as hospitals, clinics, and family physician
practices. The feed from these applications will be
messaging based on the pre-agreed standards. The
messaging needs to be based well-defined standards, for
25 http://sites.google.com/site/ijcsis/
ISSN 1947-5500
(IJCSIS) International Journal of Computer Science and Information Security,
Vol. 9, No. 6, June 2011
example the HL7. Reference Information Model (RIM) for repository using a shared database or by providing a
which XML could be used as the recommended common user interface to all hosted applications and
Implementation Technology Specification (ITS). [5] extracting data from these systems using a portal whose
authentication and authorization mechanism can also be
controlled at the data center level as shown in figure 3. [5]
Figure 1. EHR Central Repository Model
Figure 3. Shared Services Model
The event-driven messages that need to be sent and
stored in the repository will essentially be event-based III.BARRIERS OF THE ELECTRONIC HEALTH RECORD
summaries as shown in figure (2). The event-based IMPLEMENTATION
summaries stored in the repository can be queried and Implementation of EHR faces different barriers, but
retrieved by different clinicians who are treating the these barriers vary from one environment to another.
patients in different scenarios and by different clinical Hereafter, the main focus will be on the general barriers
settings. The retrieval and access of data from the that exist in most of EHR implementation attempts, these
repository is subject to establishing that the clinicians barriers are:
legitimately access the data for treating only those patients
who are in their care. The retrieval is done through 1. Financial Barriers
messaging which can be done either through synchronous Financial barriers are divided into the following points:
or asynchronous messages depending on the urgency,
complexity, and importance of the data that is being High Costs: These costs are divided into two
retrieved. [5] main parts, initial cost and ongoing cost. [6]
Under-developed business case: This barrier
raised because of the following: Uncertainty
of EHR returns on investment, Financial
benefits are only achieved on the long run and
The main objective and benefits of EHR is to
provide a high quality medical service for the
citizens. [6]
2. Technological Barriers
Technological barriers are divided into four points: [7]
Inadequate technical support
Figure 2. EHR Message Events Inadequate data exchange
Security and privacy
Lack of standards
2. Managed Services Model 3. Physicians Attitudinal and Behavioral Barriers in data
entry:
The managed services model is based on hosting
applications for different care providers and care settings in Many health information system projects fail due to
a data center by a consortium, which may consist of group attitudes, behaviors, barriers in data entry and lack of
of infrastructure providers, system integrators, and systematic consideration of human-centered computing
application providers. The hosted applications can be used issues such as usability, workflow, organizational change,
to provide an effective EHR by building a common and process reengineering. There are two major factors that
26 http://sites.google.com/site/ijcsis/
ISSN 1947-5500
(IJCSIS) International Journal of Computer Science and Information Security,
Vol. 9, No. 6, June 2011
lead to sluggish performance of this EHR system, these Textual Objects: Based on a written or printed
factors are: complexity of the Graphical User Interface language, such as clinical reports, nursery
(GUI) and system response time. This forces clinician to notes and examination sheets. [11]
see fewer patients and have longer workdays, largely
because of the extra time needed to use the system. [8]
In 2004,Lisa Pizziferri and others concluded that the Using unstructured data for storing clinical data has the
benefits of using EHR system can be achieved and accepted following limitations:
by physicians if only the physicians do not need to sacrifice The data is not consumable from a semantic
their time with patients or other activities during clinic level without a compatible interface or
sessions. Physicians recognize the quality improvements application.
achieved by EHRs, but their time should be saved by
decreasing the time required for data entry in EHR systems. Any technology cannot be necessarily gained
[9] insight into the context of the information
unless it can actually be read.
4. Organizational Change Barriers
6. Barriers of using unstructured data in Electronic Health
This category contains many points, these points are: Record:
Aggregation of information across all the records in
Design of and alignment with workflow and a large repository could bring benefits for clinical
office integration: research. When physicians work with structured data,
54.2 percent out of the 5000 respondents they could receive alerts of the drugs that have bad
reported that they are worried about slower interaction together which enables them to enhance
workflow and low productivity according to the treatment process and avoid the medication errors;
the American Academy of Family Physicians but this cannot be done with unstructured data [12].
survey results (American Academy of Family
Physicians 2004). [10] IV.SURVEYING THE SOLUTIONS OF EHR DATA ENTRY
BARRIERS:
Migration from paper-based systems:
In October 2010, Ergin Soysal, Ilyas Cicekli, and
Staff training: Nazife Baykal designed and developed an ontology
based information extraction system for radiological
5. The format of Clinical Data store in EHR systems reports. [15]
Generally speaking, there are two main types of The main goal of this technique is to extract and
data store shapes: structured data and convert the available information in free text Turkish
unstructured data. radiology reports into a structured information model
using manually created extraction rules and domain
Structured data: Structured data is a data that ontology. This technique extracts data from the
has a relational data model and enforce radiological reports, which is a free text written by
composition to the atomic data types. physicians and insert it as a structured data into the
Structured data is managed by technology that EHR. [13]
allows for querying and reporting against
predetermined data types and understood
relationships, like patient demographics, However, this technique has the following
laboratory tests, etc. [11] drawbacks:
Unstructured data: Unstructured data consists It concentrates mainly on abdominal
of any data stored in an unstructured format at radiology reports.
an atomic level. That is, in the unstructured It does not use a huge and trusted medical
content, there is no conceptual definition and expressions repository, which may reduce
no data type definition - in textual documents, the quality of information extraction
a word is simply a word. [11] process. Consequently, wrong clinical
information will be recorded.
Unstructured data consists of two basic categories: In September 2010, Adam Wright, Elizabeth S.
Bitmap Objects: Inherently non-language Chen, and Francine L. Maloney developed a technique
based, such as X-rays, radiology, video or for identifying associations between medications,
audio files. laboratory results and problems. They developed a
27 http://sites.google.com/site/ijcsis/
ISSN 1947-5500
(IJCSIS) International Journal of Computer Science and Information Security,
Vol. 9, No. 6, June 2011
knowledge base of medication and laboratory result It does not use spelling correction.
problems associations in an automated fashion. It was There is no clear structure data model to
based on two data mining techniques; frequent item store the extracted data from the clinical
set mining and association rule mining. This technique report.
was successfully able to identify a large number of It does not use a huge and trusted data
clinically accurate associations. A high proportion of source for medical expressions like Unified
high-scoring associations were adjudged clinically Medical Language Systems (UMLS).
accurate when evaluated against the gold standard
(89.2% for medications with the best-performing In July 2010, another technique for automatically
statistic, chi square, and 55.6% for laboratory results extracting information needed from complex clinical
using interest) [14]. However, this technique has the questions was developed by Yong-gang Cao, James J.
following drawbacks: Cimino, John Ely and Hong Yu. They built a fully
automated system Ask EHRMES Help clinicians
The researchers assumed that patients’ data extract and articulate multimedia information from
was structured. literature to answer their ad hoc clinical questions.
Building the knowledge base concentrated This system automatically retrieves, extracts, and
only on patient’s problems, medications integrates information from the literature and other
and laboratory results, which mean the information resources and attempts to formulate this
other data, such as the patient’s history, information as answers in response to ad hoc medical
diagnosis, and procedures are not in questions posted by clinicians, all of which can be
account. achieved within a time-frame that meets their demands
Data entry is done through traditional GUI. [17]. This technique succeeds in clinical question
So, this solution did not enhance the answering and in identifying the category of the
physician workflow. question but in the EHR system adoption process
faced the following limitations:
This technique extracted the clinical
In September 2010, a system for misspellings in information to identify the question
drug information system queries was developed by category but not to store this information in
Christian Senger, Jens Kaltschmidt, Simon P.W. the EHR repository.
Schmitt, Markus G. Pruszydlo and Walter E. Haefeli. It works only on question answering but
This system attempted to solve the problem of drug’s not in the data entry process.
data entry in Drug Information System (DIS). The It does not enhance the physician workflow
researchers evaluated correctly spelled and misspelled during the examination process.
drug names from all queries of the University Hospital Although the previous techniques attempted to solve
of Heidelberg. The results identified that search the EHR data entry barrier but it has the following
engines of DIS should be equipped with error-tolerant limitations:
search capabilities. Auto-completion lists might These techniques concentrate on specific
expedite searches but might fail regularly due to the parts of data, such as diseases and leaves.
high frequency of typographic errors already in initials. The used medical expression repository
It improved the DIS data entry by using spelling does not contain all the expressions or the
corrected tools to make the drug information semantic relations between them.
understandable and available, but it concentrated only Some of these techniques store the EHR
on DIS without examination, history, and procedure data as free text (unstructured data form).
data [16]. The physician workflow has some
modifications which, in turn, leads to more
In august 2010, a technique was developed by physical and mental efforts and reduces the
Yong-gang Cao, James J. Cimino, John Ely and Hong physician’s productivity.
Yu. It was an automated identification of diseases and
diagnosis in clinical records. This technique presents
V. BRIDGING THE UNSTRUCTURED DATA TO STRUCTURED
an approach for a prototyping of a diagnosis classifier
EHR
based on a popular computational linguistics platform
[18]. This technique has the following limitations: The suggested idea is to convert the unstructured
It focuses only on the diseases key words free text clinical data to structured EHR data without
to be extracted and ignores other important modifying the workflow of physicians or adding any
parts like operations, symptoms, additional physical or mental effort to them. Figure (4)
finding…etc. shows the algorithm of the suggested technique.
28 http://sites.google.com/site/ijcsis/
ISSN 1947-5500
(IJCSIS) International Journal of Computer Science and Information Security,
Vol. 9, No. 6, June 2011
Figure 6 Spell Check input and output
Step 3: Text mining with Natural Language Processing
Techniques
In this step, the resulted data will be cleaned and
partitioned into statements. to be classified and coded;
Using text mining and NLP all medical data will be
classified and coded in the form of multiple statements
and remove the unwanted words. This step consists of:
[19]
Text preprocessing,
Figure 4 Objective Technique Steps Part of speech tagging,
Statements Segmentation,
Noun phrase extraction.
Step1: Optical Character Recognition OCR The declaration of each pervious component is
The physician writes his/her diagnoses as regular on showing in the following.
pen-pad, paper or tablet PC. If the clinical report wrote 1. Text preprocessing: Is called tokenization or text
on paper, it will need to scan it. The clinical report normalization and it does include the following
data will be stored as image of a free hand text which steps: [19]
can be process. This free hand text image scans with Throw away unwanted stuff (e.g.,
OCR tool to convert to machine encoded text. The unwanted brackets and tags).
Details of this step represented in figure (5). Word boundaries: white space and
punctuations.
Stemming (Lemmatization): This is
optional. English words like ‘look’ can be
inflected with morphological suffixes to
produce ‘looks, looking, looked’. They
share the same stem ‘look’. Often (but not
always) it is beneficial to map all inflected
forms into the stem. This is a complex
process since there can be many
exceptional cases (e.g., department vs.
Figure 5 OCR and Handwriting input and output depart, be vs. were). The most commonly
used stemmer is the Porter Stemmer.
Step 2: Spelling Corrector However, there are many others.
Machine encoded text may include spelling errors Stop word removal: the most frequent
which may yield wrong information during the words often do not carry much
extraction process. So, all the incorrect spelling words
meaning.
will be correct to move to the next step. This step
requires a medical dictionary that contains most of the Capitalization, case folding: often it is
medical expressions in different forms such as verbs, convenient to lower case every
adjectives, nouns… etc. Figure (6) represent the character.
details of this step.
2. Part of speech tagging: A Part-Of-Speech Tagger
(POS Tagger) is a piece of software that reads text
in some language and assigns parts of speech to
each word (and other token), such as nouns, verbs,
adjectives, etc. [19]
3. Statements segmentation: The output of this part
divides the clinical text into several statements.
[19]
29 http://sites.google.com/site/ijcsis/
ISSN 1947-5500
(IJCSIS) International Journal of Computer Science and Information Security,
Vol. 9, No. 6, June 2011
4. Noun phrase extraction: In this part, all noun
phrases are extracted and the complex noun
phrase is decomposed into smaller noun phrases.
Figure 8 UMLS expressions coding
The pseudo code of UMLS coding algorithm can be:
For each Statement S in Statements //in physician
sheet
Begin
For each noun-phrase N in S
Begin
If N exists in UMLS then,
Extract N and C // where c is the
Figure 7 Text mining and NLP tasks UMLS code
Put N with C as pair <N, C>
Step 4: Unified Medical Language System (UMLS) End if
Coding End
To identify the clinical information, there is a need for End
a huge repository for all clinical expressions to extract
the matched clinical expressions. UMLS used to
achieve this purpose. The UMLS is a compendium of Step 5: Classify EHR Components
many controlled vocabularies in the biomedical The suggested technique applied on physician’s
sciences and created in 1986. It provides a mapping examination sheet. The examination sheet contains the
structure among these vocabularies and allows following classes:
translating among the various terminology systems. It History
may be viewed as a comprehensive thesaurus and Examination
ontology of biomedical concepts. [20] Diagnosis
Procedure
Each part treated as a class and all coded clinical data
UMLS consists of the following components: [20] that were produced from the previous steps classified
Metathesaurus, the core database of the into one of the previous classes.
UMLS, a collection of concepts and terms
from the various controlled vocabularies The first step in the classification process is building a
and their relationships. collective set of features that is typically called a
Semantic Network, a set of categories and dictionary. The UMLS clinical expressions in the
relationships that are being used to classify dictionary form represent the base to create a
and relate the entries in the Metathesaurus. spreadsheet of numeric data corresponding to the
Specialist Lexicon, a database of previous defined classes.
lexicographic information to be used in
natural language processing.
A number of supporting software tools.
Morphologically analyzed words are compared to the
UMLS entries to find the best matched expression TABLE (1): CLASSES DICTIONARY
according to its Morphological position. Each noun
phrase which matches a clinical expression entry in
the UMLS, put as a pair that contains the noun phrase
with its UMLS’s clinical codes.
v
30 http://sites.google.com/site/ijcsis/
ISSN 1947-5500
(IJCSIS) International Journal of Computer Science and Information Security,
Vol. 9, No. 6, June 2011
cosine value is close to 1 this means that the clinical
phrase is more similar to the compared class.
Each row defines a class and each column represents a
UMLS code. The cell in the spreadsheet represents a Step 6: Storing data in EHR Repository
measurement of the feature corresponding to the The classified clinical phrase stored in its class inside
column and the class corresponding to the row. The the EHR database with its matched UMLS code. For
dictionary of words covers all the possibilities and the example, a physician wrote the following:
number corresponds to the columns. All cells values
ranged between zero and one depending on whether There is enlarged prostate with tender base of the bladder.
the words were encountered in the Class or not. The
form of classes’ dictionary is shown in table (1). This statement contains two findings, and then this
statement compared with each class. The cosine vector
The second step is measuring the similarity between scores for this statement against each defined class
extracted expressions and the defined classes then according to the previous equations are calculated.
classify each expression to the most similar class. The The winning class will be the high score one. The data
Cosine algorithm selected to calculate the Similarity will store in the winning class with its UMLS codes as
between the extracted clinical phrases and predefined pairs inside EHR repository:
classes. Steps of Cosine Similarity algorithm are: < enlarged prostate, Finding>
Compute the similarity of new clinical < tender base of the bladder, Finding>
phrase to all Classes in Dictionary. The EHR put in a structured form for analysis and data
Select the Class that is most similar to the mining operation, or as a perfect resource for decision
new clinical phrase. support system.
The class which occurs most frequently is
the similar one.
VI. THE EXPERIMENTAL STUDY
The aim of the experiment is to prove the success of
the suggested technique in a real world cases. For any
experiment, there are some hypotheses; the hypotheses
of this experiment are:
Physician has little experience of computer
using.
Physician’s handwriting is readable.
The used medical abbreviations should be
standard.
The experiment applied during the
examination session.
Figure 9: Computing similarity scores for New Clinical Phrase
The required equipments to implement the
experiment are:
For cosine similarity, only positive words shared by An electronic pen pad.
the compared phrases are considered. Frequency of A Laptop or personal computer.
word occurrence is also valued. The clinical phrase is Windows vista or later
compared with each class by the following equation: SQL server 2008
[21] Microsoft office 2007 or later (For
applying OCR in Pin pad)
Norm (P) = W (j): is the weight of the word phrase in .Net framework 4
class UMLS database system
Cosine (P1, P2) = wp1 (j) * wp2 (j))/ (Norm (P1) * Medical dictionary (for spelling correction)
Norm (P2)) The implementation of the experimental study is
Wpi: is the weight of the word phrase in class i going through the following steps:
The cosine similarity of two Classes will range from 0 Step 1: At the nurse office the patient
to 1. The angle between two term frequency vectors demographics data recorded using the following
cannot be greater than 90°, consequently, when the screen.
31 http://sites.google.com/site/ijcsis/
ISSN 1947-5500
(IJCSIS) International Journal of Computer Science and Information Security,
Vol. 9, No. 6, June 2011
Figure 12: Applying OCR on the diagnosis sheet
Step 4: After the OCR done, the system starts to
checks and corrects the spelling errors of the
Figure 10: EHR demographics form examination data according to the installed
medical dictionary through an interaction session
with the physician.
Step 2: The physician uses the pen pad to write
the diagnosis.
Figure 11: Pen pad to Computer Form
The physician has the freedom to erase, add or
modify any partition of his/her diagnosis. This
step helps him/her to work as regular without any Figure 13: Applying spell check on the examination text
additional effort. The data is directly recorded on
the computer which will help the physician to
retrieve it easy with its form or as structured data.
Step 3: After the physician finished his/her hand
writing, he/she press OCR button to convert the
diagnosis from image form to machine coded text
as shown in the following figure: Step 5: After the spelling correction done, the
physician presses “insert into EHR” button to
convert the diagnosis data from unstructured to
the structured form. Conversion is done through
the following steps:
Text preprocessing: All brackets, unwanted
stuff, and word boundaries are removed.
32 http://sites.google.com/site/ijcsis/
ISSN 1947-5500
(IJCSIS) International Journal of Computer Science and Information Security,
Vol. 9, No. 6, June 2011
Parts of speech tagging: Assigning parts of o One tablet twice daily for three
speech to each word. months
Statements segmentation: Examination text o One tablet
is split into multiple statements. o Twice daily
Phrase tagging: Each phrase is tagged with o Three months
the suitable code to identify all phrases o R3 Depavit B12 ampule
contained in the diagnosis sheet.
The output of this step is the examination of Step 7: All noun phrases are coded with UMLS
words with their parts of speech; this output exists codes. The output of this step represented in table
in the following format: (2).
(TOP (S (NP (DT A) (ADJP (NP (CD 15) (NNS years)) (JJ TABLE (2): NOUN PHRASES WITH THEIR UMLS CODES.
old)) (JJ female) (NN patient)) (VP (VBZ complains) (PP (IN
from) (NP (JJ nocturnal) (NN enuresis))) (PP (IN since) (NP
(NN birth)))) (. . .)))
(TOP (S (NP (NP (JJ Plain) (NN X-ray)) (PP (IN of) (NP (DT
the) (NN abdomen)))) (VP (VBD was) (ADJP (JJ free))) (. .)))
(TOP (S (NP (JJ Abdominal) (NN ultra) (NN sonography))
(VP (VBD was) (ADJP (JJ free))) (. .)))
(TOP (S (NP (PRP he)) (VP (VBZ has) (NP (NP (NNP
Enuresis)) (SBAR (S (NP (DT The) (NN patient)) (VP (MD
should) (VP (VB receive))))) (: :) (NP (NP (NNP R1) (NNP
Uipam) (NN tablet)) (NP (NP (CD one) (NN tablet)) (NP (RB
twice) (RB daily)) (PP (IN for) (NP (CD three) (NNS
months))))))) (. .)))
(TOP (S (PP (IN R2) (NP (NNP Dipripam) (CD 20) (NN mg)
(NN capsule))) (NP (NP (CD one) (NN tablet)) (NP (RB
twice) (RB daily)) (PP (IN for) (NP (CD three) (NNS
months)))) (. .))) (TOP (S (NP (DT R3) (NNP Depavit) (NNP
B12) (NN ampule)) (. .)))
Figure 14: Output of Text mining technique
Each statement got score according to UMLS
Noun Phrase Extraction: codes and the class’s dictionary which declared in
All noun phrases are extracted and table (1). Table (3) shows the statements and their
compounded. Noun phrases are divided scores.
into a smaller noun phrases, such as the
following:
o A 15 years old female patient TABLE (3): STATEMENTS’ SCORE.
o 15 years
o Nocturnal enuresis since birth
o Birth
o Plain X-ray of the abdomen
o Plain X-ray
o The abdomen
o Abdominal ultra sonography
o Enuresis
o The patient
o R1 Uipam tablet
o One tablet twice daily for three
months Step 8: According to the scores showed in table
o One tablet (3), the statements classified into their classes.
o Twice daily The predefined classes are:
o Three months History
o Dipripam 20 mg capsule Examination
Diagnosis
Procedure
33 http://sites.google.com/site/ijcsis/
ISSN 1947-5500
(IJCSIS) International Journal of Computer Science and Information Security,
Vol. 9, No. 6, June 2011
The classifier uses the COS similarity algorithm Table (6) shows the overall precession
to classify each statement according to the class percentage in each of tested department.
dictionary. Table (4) shows the score of each
statement relative to nearst class.
TABLE (6): RESULTS OF THE EXPERIMENTAL STUDY.
TABLE (4): COS SIMILARITY SCORES FOR EACH CLASS.
Department Overall Precise
Surgical Oncology 92.96%
Surgery Urology 91.55%
Cardiology 92.33 %
General Surgery 88.61%
Overall precession 91.36
Some factors affect the results, such as quality of
physician hand writing. The effect of this factor clears
in the result of experiment four, since it is the lowest
precision percentage (91.36 %). High precision OCR
tool can minimize the effect of this factor; but it may
Step 9: After determining the winning class for be expensive. The results indicated that the suggested
each statement, each noun phrase with its UMLS technique success with high percentage in a real world
code saved inside the EHR in the winning class as experiment, which means that this technique can be
a paired tag. Table (5) shows this format. applied in the real live in future.
TABLE (5): DATA THAT INSERTED INSIDE THE EHR
VIII. CONCLUSION
The suggested technique succeeded in working as a
bridge between unstructured and structured medical
data. The medical data stored inside the EHR system
in its right position without any additional physical or
mental effort by physician, which in turn satisfy the
main objective of this research.
REFERENCES
[1] Institute of Medicine. “Review of the Adoption and
Implementation of Health IT Standards by the DHHS
Office of the National Coordinator for Health
Step 10: This extracted information compared Information
with the physician manual results to identify the Technology”http://www.iom.edu/Activities/Workforc
suggested technique precision. e/HealthITStandards.aspx
VII. RESULTS DISCUSSION [2] Richard Dick, Elaine B. Steen, and Don Detmer, “The
Computer Based Patient Record: An Essential
The experimental study conducted on four Technology for Health Care”, National Academy
Medical departments. In each department 10 Press, 1997.
diagnosis sheets tested. The tested departments
are: [3] See HIMSS web page for the consensus definition of
Surgical Oncology an electronic health record.
Surgery Urology http://www.himss.org/ASP/topics_ehr.asp.
Cardiology
General Surgery [4] J.H. van Bemmel and M.A. Musen, “Handbook of
Medical Informatics”, Springer, 1997.
34 http://sites.google.com/site/ijcsis/
ISSN 1947-5500
(IJCSIS) International Journal of Computer Science and Information Security,
Vol. 9, No. 6, June 2011
[5] K. Ananda Mohan,” National Electronic Health [18] Dina Demner-Fushman, James G. Mork, Sonya E.
Record Models”, Tata Consultancy Services Shooshan, Alan R. Aronson ,“UMLS content views
(TCS),2004. appropriate for NLP processing of the biomedical
literature vs. clinical text”, Elsevierhealth, 2009.
[6] Miller, R. H. and Sim, Ida. “Physicians’ Use Of
Electronic Medical Records: Barriers And Solutions”.
[19] Malgorzata Marciniak,Agnieszka Mykowiecka,”
Health Affairs, 2004.
Aspects of Natural Language
Processing”,Springer,2009.
[7] Waegemann, “EHR vs. CPR vs. EMR. Healthcare
Informatics”, 2003.
[20] Catherine R. Selden,Betsy L. Humphreys,” Unified
[8] Himali Saitwala, Xuan Fengb, Muhammad Walji, Medical Language System: Current Bibliographies in
Vimla Patel, Jiajie Zhanga, ”Assessing performance of Medicine”, National institute of health,1990.
an Electronic Health Record (EHR) using Cognitive
[21] Jiawei Han,Micheline Kamber,” Data mining:
Task Analysis” , Elsevierhealth, 2010.
concepts and techniques”,Diana Cerra,2006.
[9] Lisa Pizziferri, Anne F. Kittler, Lynn A. Volk, Melissa
M. Honourb, Sameer Gupta, Samuel Wang, Tiffany
Wang, Margaret Lippincott, Qi Li and David W.
Bates,” Primary care physician time utilization before
and after implementation of an electronic health
record: A time-motion study”, Elsevierhealth,2004.
[10] American Academy of Family Physicians. “Family
Practice Management Monitor”, AAFP pushes for
affordable EMR system, 2004.
[11] Oleh Hrycko,” Electronic Discovery in Canada: Best
Practices and Guidelines”,CCH,2007.
[12] Angus Roberts , Robert Gaizauskas, Mark Hepple,
George Demetriou, Yikun Guo, Ian Roberts, Andrea
Setzer,” Building a semantically annotated corpus of
clinical texts”, Elsevierhealth,2009.
[13] Hanna M. Seidlingab, Marilyn D. Paternoac, Walter E.
Haefelib, David W. Bates,” Coded entry versus free-
text and alert overrides: What you get depends on how
you ask”, Elsevierhealth,2010.
[14] Adam Wright, Elizabeth S. Chenc, d and Francine L.
Maloney,” An automated technique for identifying
associations between medications, Laboratory results
and problems”, Elsevierhealth, 2010.
[15] Ergin Soysal, IlyasCicekli, NazifeBaykal,” An
ontology based information extraction system for
radiological reports”, Elsevierhealth, 2010.
[16] Christian Senger, Jens Kaltschmidt, Simon P.W.
Schmitt,Markus G. Pruszydlo, Walter E.
Haefeli ,“Misspellings in drug information system
queries: Characteristics of drug name spelling errors
and strategies for their prevention”, Elsevierhealth,
2010.
[17] Yong-gang Cao, James J. Cimino, John Ely, Hong Yu,
“Automatically extracting information needs from
complex clinical questions”, Elsevierhealth, 2010.
35 http://sites.google.com/site/ijcsis/
ISSN 1947-5500
(IJCSIS) International Journal of Computer Science and Information Security,
Vol. 9, No. 6, 2011
Creating an Appropriate Programming Language for
Student Compiler Project
Elinda Kajo Mece
Department of Informatics Engineering
Polytechnic University of Tirana
Tirana, Albania
ekajo@fti.edu.al
Abstract— Finding an appropriate and simple source language, to Compiler frameworks are widely used as a simple tool for
be used in implementing student compiler project, is one of implementing new languages based on existing ones. The
challenges, especially in cases when the students are not familiar complexity begins to increase if the differences between the
with high level programming languages. This paper presents a existing language and the new one become significant [4].
new programming language intended principally for beginners
and didactic purposes in the course of compiler design. SimJ, a
That is why we used Java as a base language for SimJ. For this
reduced form of the Java programming language, is designed for purpose we have chosen Polyglot [4,5] as a compiler
a simple and faster programming. More readable code, no framework for creating compiler for languages similar to Java.
complexity, and basic functionality are the primary goals of
SimJ. The language includes the most important functions and II. THE POLYGLOT FRAMEWORK
data structures needed for creating simple programs found
generally in beginners programming text books. The Polyglot
compiler framework is used for the implementation of SimJ. Polyglot is an extensible Java compiler toolkit designed for
Keywords- compiler design; new programming language; polyglot
experimentation with new language extensions. The base
framework polyglot compiler, jlc ("Java language compiler"), is a mostly-
complete Java front end [1]; that is, it parses [1,2] and
I. INTRODUCTION performs semantic checking on Java source code. The
A compiler course takes a significant place in computer compiler outputs Java source code. Thus, the base compiler
science curricula. This course is always associated with an implements the identity translation. Language extensions are
implementing project. Being a multidimensional course, it implemented on top of the base compiler by extending the
requires the students to be familiar with high level concrete and abstract syntax and the type system [4].
programming languages among the other things. The first After type checking the language extension, the abstract
impact with these high level languages is almost always syntax tree (AST) [1,14] is translated into a Java AST and the
considered confusing because of their complexity. This existing code is output into a Java source file which can then
becomes more obvious in object-oriented languages like Java be compiled with javac.
[8]. Object-orientation [15] hinders to learn Java step-by-step Polyglot supports the easy creation of compilers for languages
from basic principles, because right from the beginning the similar to Java. The Polyglot framework is useful for domain-
learner has to define at least one public class with a method specific languages, exploration of language design, and for
with signature public static void main(String[] args). So the simplified versions of Java for pedagogical use. As mentioned
teacher has two choices here: trying to explain most of the above, the last part is where we intend to focus on this paper.
concepts involved (classes, methods, types, arrays, etc.) or just A Polyglot extension is a source-to-source compiler that
provide the surrounding program text and let the learner add accepts a program written in a language extension and
code to the body of the method main. translates it to Java source code [4,5]. It also may invoke a
SimJ is a simple, Java based programming language. It is Java compiler such as javac to convert its output to bytecode
conceived and designed to ease teaching of basic [13]. A SimJ oriented view of this process, including the
programming to beginners. We believe that they should learn eventual compilation to Java bytecode, is shown in figure 1.
easily the basic concepts, before they are exposed to more
complex programming issues. It is much simpler for a new
programmer to write println ("Hello world) instead of writing
a confusing line like System.out.println ("Hello world"). This
simple but concise example shows the importance of the first
impact with programming languages. The role of SimJ is to Figure 1. The Polyglot Compiler Framework Architecture
make this impact less “painful”.
36 http://sites.google.com/site/ijcsis/
ISSN 1947-5500
(IJCSIS) International Journal of Computer Science and Information Security,
Vol. 9, No. 6, 2011
public class A {
The first step in compilation is parsing input source code to public static void main(String[] args) {
produce an AST. Polyglot includes an extensible parser try {
generator, PPG [5], which allows the implementer to define BufferedReader reader = new BufferedReader(
the syntax of the language extension (SimJ in our case) as a set new InputStreamReader (System.in));
of changes to the base grammar for Java [7]. The extended System.out.print(“Your name:” );
AST may contain new kinds of nodes either to represent String name = reader.readLine();
syntax added to the base language or to record new System.out.print(“\nHello, ” + name + “!”);
information in the AST. }
The core of the compilation process is a series of compilation catch (IOException ioexeption) {
passes applied to the abstract syntax tree. Both semantic System.out.println(ioexeption);
analysis and translation [1] to Java may comprise several such }
passes. The pass scheduler selects passes to run over the AST }
of a single source file, in an order defined by the extension, }
ensuring that dependencies between source files are not
violated. Each compilation pass, if successful, rewrites the class A {
AST, producing a new AST that is the input to the next pass. main() {
A language extension may modify the base language pass print(“Your name:”);
schedule by adding, replacing, reordering, or removing String name = readLine();
compiler passes. The rewriting process is entirely functional; print(“\nHello, ” + name + “!”);
compilation passes do not destructively modify the AST. }
Compilation passes do their work using objects that define }
important characteristics of the source and target languages. A
type system object acts as a factory for objects representing
types and related constructs such as method signatures[4,5]. Figure 2. Example code writen in Java and SimJ
The type system object also provides some type checking
functionality. A node factory [4] constructs AST nodes for its The simplified versions of the printing methods are quite
extension. In extensions that rely on an intermediate language, obvious, since they are almost always used in simple
multiple type systems and node factories may be used during programs. It is also important to mention that, compared to
compilation. After all compilation passes complete, the usual Java, the structure of the program is unchanged thus
result is a Java AST. A Java compiler such as javac is invoked preserving its object-orientation character.
to compile the Java code to bytecode. Another important goal of this language is to help teaching of
compiler design [1].
III. SIMJ PROGRAMMING LANGUAGE SimJ language specification [3,10,11] shown in figure 3 is
SimJ (stands for Simple Java) is a simplified version of the very simple, short, equipped with the fundamental and mostly
Java programming language conceived especially for used parts of a programming language at the beginning level
beginners. The language is very simple, easy to learn and is [9,7]. Related work (i.e. MiniJava [1]) shows that simplicity is
very similar to Java. Previous work has been done in this field the primary characteristic of these languages.
(i.e. the J0 programming language [5] but these languages are As mentioned previously we think that similarities with Java
quite different compared to Java syntax [7]. We think that are important but also they should not lose their identity. In
similarity with Java is very important in order to allow the MiniJava for example the System.out. println(), that is the
programmer to switch to Java without any problems regarding same as in Java, is defined to do the printing but the meaning
the syntax when he thinks is ready to explore the full potential of System.out in this language cannot be found. With SimJ we
and the advanced features of it. try to address these problems by creating a simple but well
Figure 2 shows an example of the same code written in Java defined language that syntactically talking is not a reduced
and in SimJ. This example shows, as mentioned above, that exact copy of the mother language but has its own identity.
the code in SimJ is clearly more readable than the one in Java.
Generally, programming courses and textbooks for beginners
include many programs that during their execution require or
the input of the user. In Java this part it’s definitely neither Program ::= MainClass ( Class )*
MainClass ::= "class" Identifier "{" "main" "(" ")" "{" Statement "}" "}"
simple nor easy to implement at the beginning level. We Class ::= "class" Identifier "{" (Variable)* (Method)* "}"
address this problem by removing the complex part and Variable ::= Type Identifier ";"
leaving only the “understandable” one (i.e. readLine()). Method ::= Type Identifier "(" (Type Identifier ("," Type Identifier)*)?
")" "{" (Variable)* (Statement)* "return" Expression ";" "}"
Type ::= "boolean"
| "int"
| "char"
| "string"
37 http://sites.google.com/site/ijcsis/
ISSN 1947-5500
(IJCSIS) International Journal of Computer Science and Information Security,
Vol. 9, No. 6, 2011
| "int" "[" "]" IV. IMPLEMENTATION
| Identifier
Statement ::= "{" ( Statement )* "}"
For the implementation of SimJ we have used Polyglot as a
| "if" "(" Expression ")" Statement "else" Statement framework that improves and simplifies compiler design for
| "while" "(" Expression ")" Statement languages similar to Java. This process consists in creating a
| "for" "(" Expression ";" Expression ";" Expression ")" Statement new language extension. Extensions (in our case SimJ) usually
| "switch" "(" Expression ")" "{" ("case" Expression ":"
Statement "break" ";")* "default" ":" Statement "}" have the following sub packages [5]:
| "print" "(" Expression ")" ";"
| "println" "(" Expression ")" ";" • ext.simj.ast – AST nodes specific to SimJ
| "readLine" "(" ")" ";"
| "readInt" "(" ")" ";"
language.
| Identifier "=" Expression ";" • ext.simj.extension – New extension and
| Identifier "[" Expression "]" "=" Expression ";" delegate objects specific to SimJ.
Expression ::= Expression ( "||" | "&&" | "<" | ">" | "!=" | "==" | "+" | "-"
| "*" | "/" ) Expression
• ext.simj.types – Type objects and typing
| Expression "[" Expression "]" judgments specific to SimJ.
|Expression "."Identifier"("(Expression("," Expression)*)?")" • ext.simj.visit – Visitors specific to SimJ.
| <INTEGER>
| <STRING> • ext.simj.parse – The parser and lexer for the
| <CHARACTER> SimJ language.
| "true"
| "false"
In addition, our extension defines the class
| Identifier
| "this" ext.simj.ExtensionInfo [5], which contains the
| "new" "int" "[" Expression "]" objects which define how the language is to be parsed and
| "new" Identifier "(" ")" type checked. There is also a class ext.simj.Version
| "!" Expression
| "(" Expression ")" defined [5], which specifies the version number of SimJ. The
Identifier ::= <IDENTIFIER> Version class is used as a check when extracting extension-
specific type information from .class files.
Figure 3: SimJ language specification The design process of SimJ includes the following tasks [5]:
This is an important point that helps reducing possible
ambiguities and makes the language more understandable. • Syntactic differences between SimJ and Java are
SimJ includes the basic building blocks of a programming defined based on the Java grammar found in polyglot/
language. From this point of view it is quite similar with Java ext/jl/parse/java12.cup.
[8,7]. We have implemented the basic primitive data types • Any new AST nodes that SimJ requires are defined
(figure 2): based on the existing Java nodes found in polyglot.ast
(interfaces) and polyglot.ext.jl.ast (implementations).
• boolean – true or false • Semantic differences between SimJ and Java are
• int – integers defined. The Polyglot base compiler (jlc) implements
• char – characters most of the static semantic of Java as defined in the
Java Language Specification [7].
• string – sequence of characters (string in SimJ for
simplicity is considered a primitive data type) • Translation from SimJ to Java is defined. The
translation produces a legal Java program that can be
• int[] – array of integers
compiled by javac.
Mostly used control flow statements [9,8] are implemented in
We implement SimJ by creating a Polyglot extension with
SimJ (figure 2). Their syntax is the same as in Java
the characteristics described above. Implementation follows
considering that they have no redundant complexity to be
these steps [5]:
removed:
• build.xml is modified and a target for SimJ is
• if else
added. This is done based on the skeleton extension
• for
found in polyglot/ext/skel. Running the
• while
customization script polyglot/ext/newext
• switch
copies the skeleton to polyglot/ext/simj, and
Principal operators [9,8] are also present in SimJ. These substitutes our languages name at all the appropriate
include: addition, subtraction, multiplication, division, logical places in the skeleton.
and, logical or, logical not, smaller than, greater than, not • A new parser is implemented using PPG. This is done
equal, equal. by modifying
38 http://sites.google.com/site/ijcsis/
ISSN 1947-5500
(IJCSIS) International Journal of Computer Science and Information Security,
Vol. 9, No. 6, 2011
polyglot/ext/simj/parse/simj.ppg using [7] Gosling, J., Joy, B., Steele, G., Bracha, G. (2005). The Java Language
Specification (3rd ed.). Addison Wesley.
the SimJ syntax.
[8] Arnold, K., Gosling, J., Holmes, D. (2005). The Java Programming
• The required new AST nodes are implemented. The Language (4th ed.). Addison Wesley Professional.
node factory [9] Kernighan, B.W., Ritchie, D.M. (1988). The C Programming Language
polyglot/ext/simj/ast/SimJNodeFactor (2nd ed.). Prentice Hall.
y_c.java is modified in order to produce these [10] Clinger, W., Rees, J. (2001). Report on the Algorithmic Language
nodes. Scheme. Retrieved January 24, 2007, from http://www-swiss.ai.mit.edu/
~jaffer/r4rs_toc.html.
• Semantic checking for SimJ is implemented based on
[11] Krishnamurthi, Sh. (2006). Programming Languages: Application and
its rules. Interpretation. Retrieved January 28, 2007, from
• The translation from SimJ to Java is implemented http://www.cs.brown.edu/~sk/Publications/Books/ ProgLangs/.
based on the translation defined above. This is [12] Cornell University, Department of Computer Science. (2003). J0: A Java
implemented as a visitor pass that rewrites the AST Extension for Beginning (and Advanced) programmers. Retrieved
January 20, 2007, from http:// www.cs.cornell.edu/Projects/j0/.
into an AST representing a legal Java program.
[13] Lindholm, T., Yellin, F. (1999). The Java Virtual Machine Specification
(2nd ed.). Addison Wesley.
V. CONCLUSIONS [14] Jones, J. (2003). Abstract Syntax Tree Implementation Idioms. Retrieved
February 6, 2007, from http://jerry.cs.uiuc.edu/~plop/plop2003/Papers/.
Our motivation for creating SimJ was to provide a simple, [15] Ambler, S.J. (2006). Introduction to Object-Orientation and UML.
understandable and easy to learn programming language Retrieved February 11, 2007, from
http://www.agiledata.org/essays/objectOrientation101.html.
similar to Java that improves the learning of programming
[16] O’Docherty, M. (2005). Object-Oriented Analysis and Design:
basic structures and being a source language exemplar for Understanding System Development with UML 2.0. John Wiley & Sons
implementing student compiler project. We discovered that the [17] Graver, J.O. (1992). The Evolution of an Object-Oriented Compiler
existing approaches did not fully address the problem of a Framework. Retrieved January 30, 2007, from
simplified Java like structured language and that is not only a http://cs.ubc.ca/rr/proceedings/spe91-95/spe/vol22/ issue7/spe767jg.pdf
reduced copy of it. Our language is simple but improves
existing solutions by merging their advantages and trying to
avoid the weak points.
Using Polyglot Framework to build the compiler we conclude
that it is an effective and easy way to produce compilers for
Java-like languages like SimJ. It is simple and has a well
defined structure thus offering the possibility to generate a
base skeleton for new language extensions on which we can
add the desired specifications.
Our language, SimJ is a well structured simplified version of
the Java programming language that is not only a reduced
copy of it. SimJ could be used by beginners that want to learn
Java but don’t know anything about object oriented
programming. It is also a good choice for learning compiler
design because of its well defined and easy to implement
structure.
REFERENCES
[1] Appel, A.W , Palsberg, J. (2002). Modern Compiler Implementation
in Java (2nd ed.). Cambridge University Press.
[2] Metsker,S. J. (2001). Building Parsers with Java. Addison Wesley.
[3] Slonneger, K., Kurtz, B.L. (1995). Formal Syntax and Semantics of
Programming Languages, A Laboratory Based Approach. Addison
Wesley.K. Elissa, “Title of paper if known,” unpublished.
[4] Mystrom, N., Clarkson, M.R., Myers, A.C. (2003). Polyglot: An
Extensible Compiler Framework for Java. Retrieved January 20, 2007,
from http://techreports.library.
cornell.edu:8081/Dienst/UI/1.0/Display/cul.cs/TR2002-1883.
[5] Cornell University, Department of Computer Science. (2003). How to
Use Polyglot. Retrieved January 20, 2007, from
http://www.cs.cornell.edu/projects/polyglot/.
[6] Cornell University, Department of Computer Science. (2003).. PPG: A
Parser Generator for Extensible grammars. Retrieved January 20, 2007,
http://www.cs. cornell.edu/projects/polyglot/.
39 http://sites.google.com/site/ijcsis/
ISSN 1947-5500
(IJCSIS) International Journal of Computer Science and Information Security,
Vol. 9, No. 6, June 2011
,
The History of Web Application Security Risks
Fahad Alanazi Mohamed Sarrab
Software Technology Research Laboratory Software Technology Research Laboratory
De Montfort University De Montfort University
Leicester, LE1 9BH UK Leicester, LE1 9BH UK
P0800238x@mydmu.ac.uk msarrab@dmu.ac.uk
Abstract—this article refers generally to current web application employed to protect them.This paper will identify and discuss
risks that are causing public concern, and piquing the interest of ten web applications’ vulnerabilities, which constitute a threat
many scientists and organizations, as a result of an increase in to web applications’ security; assessing information provided
attacks. The primary concern of many governments, by researchers and OWASP regarding risk assessment and
organizations and companies is data loss and theft. Thus, these protection.
organizations are seeking to insure their web applications against
vulnerabilities. Revealing that awareness of the vulnerabilities of
II. INJECTION FLAWS
web applications leads to recognition of the need for
improvements. The three main facets of web security are: In 2007 OWASP [30] mentioned numerous Injection flaws
confidentiality, integrity and safety of content, and continuity. including: SQL, LDAP, XPath, XSLT, HTML, XML and OS;
This paper identifies and discusses ten web application with SQL being the most common of such injection types. In
vulnerabilities, detailing the opinions of researchers and OWASP 2004 OWASP [29] cited the main cause of vulnerability in
regarding risk assessment and protection. web applications to be there use of features of the operating
system and external programs to implement functions. This
enables attackers to exploit previous information from an
I. INTRODUCTION HTTP request, to inject malicious code as the web application
passes information through.
The Internet is a fascinating and multi-faceted technology,
opening a window on the world by allowing people across the The attack occurs when data is sent to the interpreter after the
globe to access information simply and quickly; allowing them user has initiated a command or query. The attacker exploits
to broadcast their ideas and culture, communicate and access this situation with the injection of malicious code alongside
research data from anywhere. It is now even seen as a form of the command or query, which enables full access to the system
e-government; based on its achievements in the last four years bypassing any protection and calling for data from operating
and the acquisition of 300 million users. systems and databases.OWASP in 2010 [31] described this
type of attack, as the attacker sending simple text to exploit the
However, the Internet lacks geographic borders, or national
syntax that targets the interpreter. Almost all data sources use
controls and this has led to concerns about the security of
an injection vector’ which includes internal sources. This flaw
conducting business online. Indeed; there are those who
is typically found in SQL queries, LDAP queries and OS
expend considerable effort in seeking to penetrate and steal
commands [21].
important information from websites, justifying apprehension
amongst the owners of this information and electronic service Recommendations
providers. Therefore, companies are doing their utmost to
maintain the confidentiality, privacy and accuracy of • Avoid using interpreters if possible.
information they hold (integrity); systems can now be
protected in a number of ways and some of the programs that • Input validation.
have helped in intrusion detection and reducing viruses have
somewhat eased the trepidation of network users. • Avoid detailed error messages that may be useful to
an attacker.
Recently attackers have turned their focus to web applications
which allow surfing, shopping, communication with • Reject all script injection (Gregory (2009).
companies in other countries, etc. This is because they rely on
databases to facilitate information exchange and the SQL Injection
distribution of information. These applications have an
increasing number of users, increasing their attractiveness to SQL injection is common among injection flaws, and yet
attackers, despite the numerous programmers and developers applications those are vulnerable to itare used in our daily
40 http://sites.google.com/site/ijcsis/
ISSN 1947-5500
(IJCSIS) International Journal of Computer Science and Information Security,
Vol. 9, No. 6, June 2011
lives, relying on their safety; e.g. for making bookings and III. Cross Site Script (XSS)
paying bills. As the number of such applications increases, so
does the sophistication of the attacks that target them. The Cross site scripting is another intrusion method that
hackers use many methods to create defects in web manipulates the web browser to display malign code, which
applications; of these SQL injection is one of the easiest and then initiates in the user’s session. This can be done in a
most dangerous, potentially damaging the whole system. number of ways typically in Hypertext Markup Language
[HTML] [15]. Cross site scripting can be used in a number of
SQL injection is an attack in which SQL code is inserted or ways from theft of a cookie to taking over an entire session.
appended into application user input parameters that are later This is referred to as an intruder guided attack [18]. Insertion
passed to a back-end SQL server for parsing and execution of a script into a field can be an efficient attack but
[8]. SQL injection is a serious threat to any site or application circumventing the filter can be a problem. Cross site scripting
that contains a database; by injecting, and executing, the SQL uses an array of methods for abuse and intrusion [15].
code with basic code, attackers can gain unauthorized access
to private databases containing important and secure According to Ciampa[11] a Cross Site Script (XSS) attack is
information, thus compromising the integrity of sensitive data characterized by the use of special engineering; allowing the
by allowing for alteration or deletion [2]. SQL injection attacker, through the use of JavaScript language, to extract
attacks affect authentication processes impinging on the important information from the victim before utilizing it.
verification of user identity and allowing attackers to connect Lopez and Hammerli [24] argue that XSS is targeted on the
to the system without the password by using the query web application’s site and uses either stored XSS or reflected
language injection. XSS. The hackers attempt to attack users’ browsers and take
control with malicious script. When an attack is successful, the
Preventing SQL injection attacker can access important resources in the web application;
i.e. Cookies.
• String input must use two single quotation marks
rather than a single quotation mark. If there is single According to Belapurkar et al [5] these attacks rely on users to
quotation mark this should be replaced by two single input information and this means attackers can inject
quotation marks [10]. dangerous code whilst inputting data to gain access to the site.
The XSS often occur when the web application requires input
• Verification occurs from a single quotation mark in via a Username and Password page, as attackers can benefit
the inputs field, so if there is a single quotation it from this by tricking the user. In addition, any script entered
should be remove. in/form fields or in an URL is likely to pose a risk to the site
of this type of attack. XSS depends on injecting client-side
• Verification and removal of TSQL comments such as script, leading to account theft and changes to the content on a
– and /**/ because these comments might damage the page. XSS occurs when the web application fails to escape
data. user-submitted content properly before rendering it into
HTML [19].
• Detection and verification of TSQL keywords such as
SELECT, which might be used to query specific OWASP cited the ability of attackers to use XSS to send
elements. malicious code or script to an unsuspecting user, affecting
sensitive and important information that the browser has
• Ensure clients and server input. maintained as well as cookies and session tokens. The
malicious script can rewrite and rephrase the contents of the
• Use of elaborate SQL constructs that might cause HTML page because the browser does not know the origin of
errors and impede the execution of injected code. the script, or whether it can be trusted.OWASP divided this
type of attack into two categories:
• Verification from system records to limit the number
of users that do not have/do have an account in the • Stored: This attack is occurs through injection of
system to detect any unauthorized access to the malicious code or script into the target server and is stored
system by comparing these numbers. permanently in messages, comment forums or databases
etc. If/when the user requests information, the stored
• Use a secure policy for the system; by determining malicious script information is transferred to the server.
permissions, for example limiting some permission to
only reading and writing [16]. • Reflected: This type of attack is the most common type
and is reflected off the web server as in an error message.
This type of attack tricks the user when they click on links
where malicious script or code has been entered.
41 http://sites.google.com/site/ijcsis/
ISSN 1947-5500
(IJCSIS) International Journal of Computer Science and Information Security,
Vol. 9, No. 6, June 2011
OWASP highlights the dangers of disclosure. When attackers • Use XSS filter to detect any malicious code [23].
hijack user’s sessions, full control is gained and the attacker
can access end user files. The attacker can also redirect the • Avoid special characters in input box such as <>, ―
user to pages or other sites and can modify presentation of ―, % , ; ) because these characters can help the
content by installing Trojan programs. Therefore, OWASP attacker to acquire sensitive data.
recommend verification from inputs and filtering to scripts
because most XSS attacks occur in JavaScript.XSS attack is • Limit the data that might be a part of scripting attack
dangerous for applications and servers due to the fact that [17].
most of these display simple web pages that contain errors
such as 500 “internal server error”. These may include IV. Buffer Overflow
information which enables attackers to corrupt the server and
the user’s browser by reflected attack. Buffer Overflow is an attack that occurs when web
applications have no control over input that might contain
In 2007, OWASP [30] referenced cross site script as a subset commands, encoding or improper formats. The attacker uses
of HTML injection. In this type of attack the victim‘s browser buffer overflow by inputting and overrunning the memory
is exploited by the attacker through executed script by user space which is used by the operating system [6]. Dubrawsky
sessions. All malicious scripts are related to JavaScript, but [12] argued that buffer overflow happens when the attacker
any scripting language supported by the victims’ browser may inputs additional information into the buffer that is (a holding
be vulnerable to this type of attack. OWASP described all the area for data) that cannot handle. Buffer overflow attack relies
associated web applications that are vulnerable to three types on programming language work that includes C and C++.
of XSS attack:
The buffer overflow occurs when the memory size exceeds the
• Reflected XSS: Easiest for exploiting the page. allocation for a buffer as a result failure to limit the inputted
information. Furthermore, it occurs when the web applications
• Stored XSS: The most dangerous is that it can take hostile use low-level programming languagesbecause these languages
data, store it within a file or database then at a later time do not perform automated bounds checking.
display the data for the user without a filter to detect input
Buffer overflow can happen if data is not checked for the
to the website.
length of value when copying it into the buffer from another
source, i.e. a Network socket [7]. This agrees with supports
• DOM based XSS: The JavaScript and variables are being
Wells’ [35] argument that storage flaws affects web
manipulated rather than HTML elements.
application security. According to Wells security measures
OWASP did not concentrate on these three areas, as in must be employed which include data encryption because web
addition there is a possibility of risky and unpredictable applications could contain sensitive information.
browser behaviors which may lead to attack. XSS may affect
Buffer overflows are in essence a technique used when data is
any components that the browser uses.
written into a fixed sized memory block resulting in memory
JavaScript allows for attack due to its strengths as a around the destination buffer becoming jammed and over
programming language which allows manipulation of the capacity. This would give the intruder access to parts of the
rendered page by adding new elements, internal DOM, processing memory allowing for the entry of malign code [13].
changing or deleting the page. Additionally, this type of attack This involves writing data to places in the memory stack that
permits use of XmIHttpRequest because attackers can contain information about the operating system, if this data is
circumvent the browser and forward the victim‘s data to accessed and overwritten then this usually results in a machine
aggressive sites, then create malicious codes to force open the crashing and the system resetting; the intruder can also make
browser for a long period of time. the process memory point to his code, which could result in
passwords being accessed or new accounts being created [9].
Recommendations The best way to overcome this kind of attack is to completely
avoid using a memory management system [13].
• Encode sensitive data.
OWASP [29] referred to web application components being
• Validate input data for length. improperly validated in some languages, leading to buffer
overflow attacks to access the system. This type of attack is
• To detect XSS in input donot use blacklist. difficult to detect and eradicate when discovered. Buffer
overflow can be found in the web application or‚ both the web
• Before using any untrusted data HTML tags should server or application server products that serve the static and
be removed [14]. dynamic aspects of the site. It can be found in custom web
application code but detection buffer overflow flaws are less
42 http://sites.google.com/site/ijcsis/
ISSN 1947-5500
(IJCSIS) International Journal of Computer Science and Information Security,
Vol. 9, No. 6, June 2011
likely in custom web applications. If a custom application is where mistakes have commonly been made; unencrypted
discovered, the ability of the attacker is reduced, because ‚the critical data; insecure storage of keys, certificates, and
source code and detailed error messages for the application are passwords; improper storage of secrets in memory; poor
normally not available to the attacker. randomness selections; poor choice of algorithms; attempting
to invent new encryption algorithms; failure to include support
To determine if the server products are vulnerable there should for encryption key changes and other required maintenance
be a review of all code that accepts input from users via the procedures. Therefore, all websites which use encryption to
HTTP request to ensure that it can properly handle arbitrarily protect sensitive and important information in storage and
large input and ensure that it provides appropriate size transit are vulnerable to these kinds of attacks.
checking on all such inputs.
Detection of these flaws takes place in the following ways:
Buffer overflow was not mentioned in OWASP [30] or [31] Examine tokens, session IDs, cookies and other credentials to
because it was detected by either an; Intrusion Detection see if they are obviously not random. As a means of protection
System or IDS software, hardware or a combination of from this type of attack OWASP recommended a preference
both.There are two types of IDS: for re-entering data and not storage. OWASP also proposed,
where a need to use encryption exists, utilizing a library that is
• Network intrusion detection system: This can capture data exposed to public scrutiny and make sure that there are no
packets travelling on the network. open vulnerabilities [26, 29].
• Host-based intrusion detection systems: These can look
into the system and application log files to detect any In 2007 OWASP [30] cited failure to encrypt sensitive
intruder activity. information in web applications to be the result of poorly
designed cryptography. There are many associated
Recommendations cryptographic flaws that use inappropriate or strong ciphers,
which may lead to the discovery of sensitive data. As a result
• Do not use C and C++ programming language when OWASP mentioned that all web applications are
building a web application [32]. vulnerable.These were the most common problems in 2007.
Not encrypting sensitive data using home grown algorithms;
• Limit input data to prevent long input strings that insecure use of strong algorithms; continued use of known
might include malicious code [17]. weak algorithms (MD5, SHA-1, RC3, RC4…etc.); hard
coding keys; and, storing keys in unprotected stores.OWASP
V. Insecure Cryptographic Storage [31] again stated that the most common flaws relate to not
encrypting data, however, due to limited access precise flaws
Web applications sometimes use cryptographic functions in
are difficult to determine.
order to secure data. Unless these functions are coded
properly, this is not an easy thing to do.They can only offer a Recommendations
weak form of protection. Applications that do not offer a good
level of protection often use inappropriate ciphers. Thus, it is • Use only public algorithms.
advisable to ensure that everything is to be encoded is encoded
[21]. • Avoid using weak algorithms.
Recommendations: • Infrastructure credentials for web application such as
database credentials should be securely encrypted
• One should use only approved public algorithms. [21].
These include AES, RSA and public key.
• To protect insecure storage one must use proper
• Cryptography stores private keys with care. Try not encryption and access control for all data that is
to submit key over channels that are not guaranteed stored [17].
secure [21].
VI. Cross Site Request Forgery (CSRF)
In 2004 OWASP [29] highlighted this type of attack because
most web applications need to store sensitive and important Cross Site Request Forgery (CSRF) relies on XSS attack to
information such as passwords and account records in a file input dangerous code to the end user’s browser. This type of
system or database. Web applications developers thus resort to attack does not target the site that is implemented in these
encryption to protect this important information. However malicious codes but tricks the user to access other sites. CSRF
some developers have made mistakes whilst integrating affects web applications because it allows the attacker to
encryption into their web applications, they have also failed to change the victim’s stored information e.g. password
focus on other aspects of the site. There are several areas [13].Holovaty and Kaplan-Moss [19] show that CSRF occurs
43 http://sites.google.com/site/ijcsis/
ISSN 1947-5500
(IJCSIS) International Journal of Computer Science and Information Security,
Vol. 9, No. 6, June 2011
when the attacker tricks the users by loading an URL from an open to this type of attack.In the 2004 and 2007 [29, 30]
authentication site to take advantage of their sites. According versions, this type of attack is based on sending forged
to Kategorileri [21], Broken Authentication and Session requests by submitting images, XSS flaws and other
management cause privacy violations. These flaws might lead techniques to trick the user. Thus the attacker is able to
to hijacking of administrative or user accounts, given the fact implement and change data whilst the victim is unable to carry
that there is no protection for credentials and session tokens out permitted authorised functions. OWASP remarked that all
throughout a web applications lifecycle. multistep transactions are unsafe because attackers can access
a series of requests by using JavaScript or multiple tags.
A cross site request forgery is an intrusion that is a request for
a page that appears to be sent from a trusted user. One To verify whether the application is vulnerable it should be
common example of this is when an image on a page is checked. Each link and form includes tokens that help the
embedded; this contains a link to a PHP script [4, 15, 33]. attackers to predict a particular action detail for each user.
Such intrusions can be used to gain entry to password Therefore OWASP have recommended that unique tokens be
protected parts of a website. If an intruder has convinced a inserted per user sessions and per request, thus disabling the
user to log onto a web application, then it can be used to attacker’s ability to predict URL, HTML requests and user
access to malign JavaScript. This can take over the user’s sessions details for a particular action [27].
session by releasing a false POST, using the user’s existing
session [22]. The conclusions drawn by OWASP in 2010 [31] indicated that
where the token is not unique, JavaScript or multiple tags help
Cross site request forgery intrusion can also be initiated by the attackers to exploit the web application; this helps the
sending a fake HTTP request from the user’s session. This can attackers to predict URL, HTML requests and user sessions
send information such as the user’s session cookie and other details and acquire sensitive data. In addition, JavaScript or
authorisation information. This is then passed onto a multiple tags that enable all multistep transactions should be
vulnerable web application which then thinks the intrusions considered unsafe.
are genuine requests for access [31].
Recommendations
In 2007 OWASP [30] mentioned that most web applications
are only based on automatically submitted credentials, such as • Every form should have a special token [22].
session cookies, basic authentication credentials, source IP
addresses, SSL certificates, or Windows domain credentials. • Variables are filled with a good data in order to
Therefore web applications are at risk. In addition cross site escape them [25].
request forgery has several other names: Session Riding and
One-Click Attacks. All web application frameworks in 2007 • Crypt ion session [1].
were vulnerable to cross site request forgery attacks.
• Use POST rather than GET [34].
CSRF usually takes place against a forum because it directs
the user to invoke some function, such as a logged page. • Do not click any link you do not recognise because it
Attackers can force the user, without their consent, to make might be used to send malicious requests to other
changes to their DSL router. The user‘s authorisation applications the user is logged into [13].
credentials are the reason these attacks work typically the
session cookie, so if the attacker could not supply credentials • Use browser tools, such as TG, to avoid and block
then the attack would fail. any change of user authentication by the website [20].
OWASP mentioned Cross Site Scripting (XSS) flaws which VII. Broken Authentication and Session Managements
are not required to work with Cross Site Request Forgery
(CSRF). Any web application with XSS flaws is retractable Another weakness that could make one‘s website vulnerable is
and vulnerable to CSRF attack because CSRF attack exploits improper protection of the certification apparatus, which is
XSS flaws for stealing any non-automatically submitted described as broken authentication. Broken session
credential. Defences should be built against CSRF attack by management relates to functions such as logout, timeout etc.
eliminating XSS vulnerabilities in applications because XSS Application functions that relate to session management, if not
flaws can circumnavigate most CSRF defences. implemented properly allow intruders to generate passwords
and keys, consequently assuming the identity of the user [17].
OWASP recommended verifying a web application so as to be
protected from this attack by generating and then requiring Session management restricts the gateway to applications that
some type of authorisation token that is not automatically use the web and information, and is authorised to shield and
submitted by the browser. OWASP [30] therefore contended ideally capable of protecting administrator privileges, such as
that applications failing to use unique tokens in requests were the username and password details.Organisations can demand
44 http://sites.google.com/site/ijcsis/
ISSN 1947-5500
(IJCSIS) International Journal of Computer Science and Information Security,
Vol. 9, No. 6, June 2011
customised authentication, but this can lead to intruder VIII. Insecure Direct Object References
sessions being authorised, although this can be countermanded
by using built in security systems, such as SSL encryption These flaws resulted from developer error that exposes a direct
[28]. object reference such as a database, key or directory. A direct
object reference can occur when a developer leaves access to
In 2007, OWASP [30] identified flaws in the area of an object on the server such as a data file or database key. This
authentication and session management as related to the lack can be countered by means of an authorisation check; if not
of session token protection in the web application. These flaws performed this can enable intruders to alter references to these
can result in privacy violations through the hijacking of the files causing havoc to these systems [31]. When authorisation
user‘s administrative accounts. All authentication and session checks have been restricted or even stopped this vulnerability
management web application frameworks were found to be can appear. Where programmers usually use object references
vulnerable to this type of flaw at this time. directly in web interface, with no validation checks.
Weaknesses usually occur with ancillary authentication Insecure Direct Object Reference allows an attacker to access
functions such as logout, remember me and account update.In other objects in the web application without authorization by
2010, OWASP [31] stated that flaws within authentication and manipulating direct object references. Furthermore, this type
session management enable external attackers, and users who of attack occurs when there is exposure of reference, i.e. a
have accounts on the site, to steal information from other database record as well as form parameter or URL in an
accounts and hide their actions. Attackers impersonate users internal implementation object.
allowing access to exposed accounts, session IDs and
passwords by use of leaks in the session management OWASP [30] mentioned flaws that can occur when a direct
functions or authentication. object reference, such as a URL or form parameter and
database record is exposed by a developer. An attacker could
Recommendations access the object through manipulation of direct object
references, unless an access control check has been put in
• Do not accept from URL, or in requests, invalid or place without authorization. OWASP also mentioned that
new session identifiers. many applications expose internal object references to users,
enabling attackers through use of parameter tampering, to
• Limit or rid your code of custom cookies for violate access control policy by changing the references.
authentication or session management purposes.
In 2010, OWASP [31] mentioned flaws that occur when
• Use simple and more secure authentication developers expose references that take place within an internal
mechanisms. implementation object such as database key, directory and
files to the user. The attacker can therefore gain access to
• Use a strong password policy. unauthorized data through manipulation of references, due to
absence of protection or access control checks.The reason for
• Enable login process from an encrypted page. the continuation of these flaws in the web applications relates
to the fact that many applications which create web pages
• Make sure all client side cookies and server side utilize the actual name or key of an object and do not verify
session state are destroyed on logout. the user is authorized for the target object.
• Users should enter their old password when Recommendations
changing to a new password.
• Do not expose private object references to users.
• Use limited-time-only random numbers to reset
access and send a follow up e-mail as soon as the • Validate any private object references.
password has been reset. Beware self-registered
users changing their e-mail address - send a message • Verify authorization to all referenced objects.
to the previous e-mail address before enacting the
change [21]. • Verify from input that might include attack patterns
[21].
• Avoid authentication and session management
manipulation by the user to pass security control IX. Insecure Communications
[17].
OWASP highlighted the need to protect sensitive
communication because this will allow media sensitive data to
be exposed. Applications often fail to encrypt network traffic
45 http://sites.google.com/site/ijcsis/
ISSN 1947-5500
(IJCSIS) International Journal of Computer Science and Information Security,
Vol. 9, No. 6, June 2011
that expose an authentication or session token. Therefore checks are performed before request to access a sensitive
encryption should be used for all authenticated connections function is granted.
and web pages that are accessible.All web application
frameworks mentioned by OWASP, are vulnerable to this Recommendations
flaw.
• Design of the application and architecture should
Such deficiencies enable the attacker to sniffer network traffic include access control matrix.
and gain access or capture sensitive and important
information, including transmitted credentials or • An effective access control mechanism to protect all
conversations, since every single request can contain a session URL and business functions.
token or authentication credential [30].Security breaches are
also possible when Insecure Communications occur when the • Make a penetration test for the application to ensure
web application does not have encryption for all authenticated application security.
connections and sensitive data [21].
• Make sure that administration is protected [21].
Recommendations
XI. Insufficient Transport Layer Protection
• Use SSL for all connections that are authenticated or
transmitting sensitive or value data. Insufficient Transport Layer Protection allows an attacker to
steal sensitive data or set access to the web application, due to
• Protect communications between infrastructure vulnerability exposing communication [3]. This arise using
elements by using protocol level encryption or expired, invalid or incorrect certificates which lead to
transport layer security [21]. applications failing to protect network traffic. These flaws are
very dangerous because the application does not use SSL/TLS
• Encrypt data. elsewhere during authentication so it might expose sensitive
data; i.e. session IDs of users, leading to account theft [31].
X. Failure to Restrict URL Access
Recommendations
According to Kategorileri [21], Failure to Restrict URL
Access occurs as result of a lack of access control checks. This • Use strong algorithms.
is because the web application usually protects an URL to
avoid the page presenting links to unauthorized users.Web • Use SSL for all sensitive pages in the applications.
access to internet addresses or URLs is checked before any
images or buttons on the page appear; this requires web • Use encryption technologies or SSL with backend
applications to perform checks every time these pages are and other connections.
viewed, or intruders will be able to gain access by forging
their URL addresses. Tools such as these cannot identify • Make sure the server certificate has not expired or
whether the page is accessible to the user, and therefore it is been revoked [4].
difficult to identify whether an issue exists with access [31]
Scanners are tools that can be used to find hidden URLs, but XII. CONCLUSIONS
they are unable to determine whether these functions or pages
are to be protected by any controls or restrictions. In order to
find these hidden pages they use a number of methods such as This paper presents and discusses ten web application
vulnerabilities, Injection Flaw, Cross-Site Scripting (XSS),
fuzzing directory and file names, directory lists, and also
trying to find backup and file folders. Buffer Overflow, Insecure Cryptographic Storage, Cross Site
Request Forgery (CSRF), Broken Authentication and Session
This form of attack is called forced browsing and contained Managements, Insecure Direct Object References, Insecure
guessing links and brute force techniques to find unprotected Communications, Failure to Restrict URL Access and
pages [30]. This can result in applications which allow access Insufficient Transport Layer Protection. Detailing the
for control code to develop into a complex model for researcher’s opinions and OWASP regarding risk assessment
developers and security specialists to understand. and protection. As aadopting the OWASP Top Ten is perhaps
the most effective first step towards changing the software
In 2010 OWASP [31] identified further serious threats to web development culture within organization into one that
applications being that anyone can send a request to a web produces secure code the paper provides some
application and therefore gain access to the network. Certain recommendation for adapting these ten web application
applications do not protect page requests correctly; i.e. no vulnerabilities.
46 http://sites.google.com/site/ijcsis/
ISSN 1947-5500
(IJCSIS) International Journal of Computer Science and Information Security,
Vol. 9, No. 6, June 2011
REFERENCES http://serdarbuyuktemiz.blogspot.com/2008/09/owasp-top-
ten-2007-most-critical-web.html. [Accessed 29/08/2010].
[1]. Alameda, A. (2008). Foundation Rails 2. United States of [22]. Laurent, S.S. and Dumbill, E. (2007). Learning Rails.
America: Springer-Verlag New York, Inc, pp387-388. United States of America: O'Reilly Media, Inc.
[2]. Alqahtani, A. A. (2010) Security and Protection [23]. Lee, W. (2009). Windows 7: Up and Running: A Quick,
Information in Modern Web Application. Available from: Hands-On Introduction. United States of America: O'Reilly
http://coeia.edu.sa [Accessed 07/07/2010]. Media, Inc, p129.
[3]. Auger, R. (2010). Insufficient Transport Layer Protection. [24]. López, J. and Bernhard M. Hämmerli (2008). Critical
Available from: http://projects.webappsec.org/Insufficient- Information Infrastructures Security: Second International
Transport-Layer-Protection [Accessed 3/09/2010]. Workshop, CRITIS 2007, Benalmadena-Costa, Spain,
[4]. AUUGN (2005) The Conference for Unix, Linux and Open October 3-5, 2007. Germany: Springer-Verlag Berlin
Source Professionals. Available from: Heidelberg, p288.
http://books.google.co.uk/books?id=iJw5zAu7LncC&prints [25]. Makice, K. (2009). Twitter API: up and running. United
ec=frontcover&dq=AUUGN&hl=en&ei=HvKPTNfSE9CH States of America: O'Reilly Media, Inc, pp98-99.
4AbD3oyPDg&sa=X&oi=book_result&ct=result&resnum= [26]. McClure, S. and Scambray, J. and Kurtz, G. (2009).
1&ved=0CCoQ6AEwAA#v=onepage&q&f=false Hacking exposed 6: network security secrets & solutions.
[Accessed 25/08/2010]. United States of America: McGraw-Hill Companies, p592.
[5]. Belapurkar, A. et al. (2009). Distributed systems security: [27]. Mike Andrews, James A. Whittaker, J.A. (2006). How to
issues, processes, and solutions. United Kingdom: John break Web software: functional and security testing of Web
Wiley & Sons Ltd, pp105-106. applications and Web services, Volume 1. US: Pearson
[6]. Boyd, C. and Mao, W. (2003). Information security: 6th Education, Inc, pp66-67.
international conference, ISC 2003, Bristol, UK, October [28]. Overby, S. (2007) CIO. Available from:
1-3, 2003: proceedings, Volume 2851. Germany: Springer- http://books.google.co.uk/books?id=1woAAAAAMBAJ&p
Verlag Berlin Heidelberg New York, P367. g=PA68&dq=prevent+Broken+authentication+ans+session
[7]. Carey, M. et al. (2008). Nessus network auditing. United +management&hl=en&ei=_TB7TKCUF5GSswbomOSyD
States of America: Andrew Williams, p1. Q&sa=X&oi=book_result&ct=result&resnum=7&ved=0C
[8]. Clarke, J. (2009) SQL Injection Attacks and Defense. USA: FYEwBg#v=onepage&q&f=false [Accessed 28/08/2010].
Syngress Publishing, Inc. [29]. OWSAP (2004) The Ten Most Critical Web Application
[9]. Cole, E. (2002). Hackers beware. United Stated of Security Vulnerabilities. Available from:
America: New Riders Publishing, p248. http://ftp.ipv4.heanet.ie/
[10]. Cumming, A and Russell, G. (2007) SQL Hacks. USA: [30]. OWSAP (2007)The Ten Most Critical Web Application
O‘Reilly Media, Inc. Security Vulnerabilities. Available from:
[11]. Ciampa, M. (2008). Security+ Guide to Network Security http://www.owasp.org/images/e/e8/OWASP_Top_10_2007
Fundamentals. 3rd ed. Canada: Cengage Learning, p85. .pdf [Accessed 26/06/2010].
[12]. Dubrawsky, I. (2009). CompTIA Security+: Exam SYO [31]. OWSAP (2010) The Ten Most Critical Web Application
201, Study Guide and Prep Kit. United States of America: Security Vulnerabilities. Available from:
LanraColantoni, pp109-110. http://owasptop10.googlecode.com/files/OWASP%20Top
[13]. Dwivedi, H. and Clark, C. and Thiel, D. (2010). Mobile %2010%20-%202010.pdf [Accessed 26/06/2010].
Application Security. Unite States of America: The [32]. Peikari, C. And Chuvakin,A. (2004). Security warrior .
McGraw-Hill Companies, pp7-266. United States of America: O'Reilly Media, Inc, p167.
[14]. Flanagan, D. (2006). JavaScript: the definitive guide. 5th [33]. Powell, T.A. (2008). Ajax: the complete reference. unite
ed. United States of America: O'Reilly Media, Inc, pp267- States of America: The McGraw-Hill Companies, p322.
268. [34]. Shiflett, C. (2005). Essential PHP security. United States of
[15]. Ford, R. (2007). Infosecurity 2008 threat analysis. United America: O'Reilly Media, Inc, pp26-245.
States of America: Arnorette Pedersen. [35]. Wells, C. (2007). Securing Ajax applications. United States
[16]. Gama, J and Naughter, P. (2006) Super System: of America: O'Reilly Media, Inc, p51.
Turbocharge Database Performance. US: Rampant Teach
Press, Kittrell, NC, USA. AUTHORS PROFILE
[17]. Gregory, P. (2009). CISSP Guide to Security Essentials.
United States of America: Cengage Learning, p99. Fahad Alanazi is a PhD student in De Montfort University.
[18]. Grossman, J. and Hansen, R. (2007). XSS attacks: cross- Faculty of Technology.Software Technology Research
site scripting exploits and defense. United States of Laboratory (STRL). He received his B.Sc in computer science
from Tabouk University in Saudi Arabia and also received
America: Syngress Publishing, Inc.
MSc in Computer Security from De Montfort University. His
[19]. Holovaty, A. and Kaplan-Moss, J. (2009). The Definitive main research interests are Computer security and
Guide to Django: Web Development Done Right. United Computer forensic.
States of America: Springer-Verlag New York, Inc, p345.
[20]. Jakobsson, M. and Ramzan, Z. (2008). Crimeware: Dr. Mohamed Sarrab his Ph.D. degree in Computer Science
understanding new attacks and defenses. United Kingdom: from De Montfort University 2011. He received his B.Sc in
Symantec Press, p156. computer science from 7th April University Libya and also
[21]. Kategorileri, Y. (2008). OWASP Top Ten 2007 Most received M.Sc in Computer Science from VSB Technical
Critical Web Application Security Vulnerabilities University of Ostrava Czech Republic. His main research
interests are Computer security, Runtime Verification,
.Available from: Computer forensic.
47 http://sites.google.com/site/ijcsis/
ISSN 1947-5500
(IJCSIS) International Journal of Computer Science and Information Security,
Vol. 9, No.6, 2011
Improving the Performance of Translation Wavelet
Transform using BMICA
Janett Walters-Williams Yan Li
School of Computing & Information Technology Department of Mathematics & Computing,
University of Technology, Jamaica Centre for Systems Biology, University of Southern
Kingston 6, Jamaica W.I. Queensland, Toowoomba, Australia
jwalters@utech.edu.jm liyan@usq.edu.au
Abstract—Research has shown Wavelet Transform to be one of
the best methods for denoising biosignals. Translation-Invariant
form of this method has been found to be the best performance.
In this paper however we utilize this method and merger with our
newly created Independent Component Analysis method –
BMICA. Different EEG signals are used to verify the method
within the MATLAB environment. Results are then compared
with those of the actual Translation-Invariant algorithm and
evaluated using the performance measures Mean Square Error
(MSE), Peak Signal to Noise Ratio (PSNR), Signal to Distortion
Ratio (SDR), and Signal to Interference Ratio (SIR).
Experiments revealed that the BMICA Translation-Invariant
Wavelet Transform out performed in all four measures. This
indicates that it performed superior to the basic Translation-
Invariant Wavelet Transform algorithm producing cleaner EEG
signals which can influence diagnosis as well as clinical studies of
the brain. Figure 1: Collecting EEG signals
EEG is widely used by physicians and scientists to
Keywords-B-Spline; Independent Component Analysis; Mutual study brain function and to diagnose neurological disorders.
Information; Translation-Invariant Wavelet Transform Any misinterpretations can lead to misdiagnosis. These signals
must therefore present a true and clear picture about brain
activities as seen in Figure 2. EEG signals are however highly
I. INTRODUCTION attenuated and mixed with non-cerebral impulses called
The nervous system sends commands and communicates by artifacts or noise [15]. The presence of these noises
trains of electric impulses. When the neurons of the human introduces spikes which can be confused with neurological
brain process information they do so by changing the flow of rhythms. They also mimic EEG signals, overlaying these
electrical current across their membranes. These changing signals resulting in signal distortion (Figure3). Correct
currents (potentials) generate electric fields that can be analysis is therefore impossible; a true diagnosis can only be
recorded from the scalp. Studies are interested in these seen when all these noises are eliminated or attenuated. EEG
electrical potentials but they can only be received by direct recordings are really therefore a combination of noise and the
measurement. This requires a patient to under-go surgery for pure EEG signal defined mathematically below (using S as the
electrodes to be placed inside the head. This is not acceptable pure EEG signal, N the noise and E representing the recorded
because of the risk to the patient [25]. Researchers therefore signal):
collect recordings from the scalp receiving the global
descriptions of the brain activity. Because the same potential is =
E (t ) S (t ) + N (t ) (1)
recorded from more than one electrode, signals from the
electrodes are supposed to be highly correlated. Figure 1
shows how the potentials are collected from the scalp. These
are collected by the use of an electroencephalograph and
called electroencephalogram (EEG) signals.
48 http://sites.google.com/site/ijcsis/
ISSN 1947-5500
(IJCSIS) International Journal of Computer Science and Information Security,
Vol. 9, No.6, 2011
Figure 2: Clean pure EEG Signal appropriately, and then the WT is reversed (inverted) to obtain
a new image.
Figure 4. Demonstration of (a) a signal and (b) a wavelet
The second type is designed for signal analysis for study of
Figure 3: EEG Signal corrupted with EKG and line signals EEG or other biomedical signals. In these cases, a modified
form of the original signal is not needed and the WT need not
be inverted (it can be done in principle, but requires a lot of
computation time in comparison with the first type of WT).
Numerous methods have been proposed by researchers to WT decomposes a signal into a set of coefficients called the
remove artifacts in EEG and are reviewed in [6, 13, 20, 22, discrete wavelet transform (DWT) according to:
24]. The goal of these methods is to decompose the EEG
signals into spatial and temporal distinguishable components. C j , k = ∑ E (t ) g j , k (t ) (2)
t∈Z
After identification of components constituting noise, the EEG
is reconstructed without them. Methods include Principal
Components Analysis (PCA), the use of a dipole model and where Cj,k is the wavelet coefficient and gj,k is the scaling
more recently Independent Component Analysis (ICA) and function defined in [23] as:
Wavelet Transform (WT). Which method is considered the
best is not the topic of this research. Here we focus on − j
improving WT using a new ICA method called – B-Spline 2 2
g (2 − j t − k ) (3)
Mutual Information Independent Component Analysis
(BMICA). The wavelet and scaling functions depend on the chosen
wavelet family, such as Haar, Daubechies and Coiflet.
Compressed versions of the wavelet function match the high-
The paper is organized as follows: after this introduction of
frequency components, while stretched versions match the
EEG signals and the need to denoise Section 2 presents the
low-frequency components. By correlating the original signal
denoising methods utilized in the paper. We then review the
with wavelet functions of different sizes, the details of the
reasons for merger in Section 3 and describe the experiments
signal can be obtained at several scales or moments. These
conducted in Section 4. In Section 5 we present the results,
comparison of these results and a summary. Finally in Section correlations with the different wavelet functions can be
arranged in a hierarchical scheme called multi-resolution
6 we present the conclusion.
decomposition. The multi-resolution decomposition algorithm
separates the signal into “details” at different moments and
II. LITEATURE REVIEWE wavelet coefficients [19-20]. As the moments increase the
A. Wavelet Transform amplitude of the discrete details become smaller, however the
Wavelet Transform (WT) is a form of time-frequency analysis coefficients of the useful signals increase [27-28].
been used successfully in denoising biomedical signals by Considering Eq. (1) the wavelet transform of E(t) produces
decomposing signals in the time-scale space instead of time- wavelet coefficients of the noiseless signal S(t) and the
frequency space. It is so because it uses a method called coefficients of the noise N(t). Researchers found that wavelet
wavelet shrinkage proposed by Donoho and Johnstone [7]. denoising is performed by taking the wavelet transform of the
Each decomposed signal is called a wavelet. Figure 4 shows noise-corrupted E(t) and passing the detail coefficients, of the
the difference between a wave/signal and a wavelet. wavelet transform, through a threshold filter where the details,
There are two basic types of WT. One type is designed to be if small enough, might be omitted without substantially
easily reversible (invertible); that means the original signal can affecting the main signals. There are two main threshold filters
be easily recovered after it has been transformed. This kind of – soft and hard. Research as shown that soft-thresholding has
WT is used for image compression and cleaning (noise and better mathematical characteristics [27-29] and provides
blur reduction). Typically, the WT of the image is first smoother results [10]. Once discarded these coefficients are
computed, the wavelet representation is then modified replaced with zeroes during reconstruction using an inverse
49 http://sites.google.com/site/ijcsis/
ISSN 1947-5500
(IJCSIS) International Journal of Computer Science and Information Security,
Vol. 9, No.6, 2011
wavelet transform to yield an estimate for the true signal, where u is the estimated ICs. For this solution to work the
defined as: assumption is made that the components are statistically
independent, while the mixture is not. This is plausible since
biological areas are spatially distinct and generate a specific
(
S (t ) D ( E (t )]) W −1 Λ th (W ( E (t ) ) ) ) activation; they however correlate in their flow of information
^
= = (4)
[11].
ICA algorithms are suitable for denoising EEG signals
where Λ th is the diagonal thresholding operator that zeroes because
(i) the signals recorded are the combination of temporal
out wavelet coefficients less than the threshold, th. It has been
ICs arising from spatially fixed sources
shown that this algorithm offers the advantages of smoothness
(ii) the signals tend to be transient (localized in time),
and adaptation. It has been shown that this algorithm offers the
restricted to certain ranges of temporal and spatial
advantages of smoothness and adaptation however it may also
frequencies (localized in scale) and prominent over
result in a blur of the signal energy over several transform
certain scalp regions (localized in space) [20].
details of smaller amplitude which may be masked in the
noise. This results in the detail been subsequently truncated
when it falls below the threshold. These truncations can result B-Spline Mutual Information Independent Component
in overshooting and undershooting around discontinuities Analysis (BMICA)
similar to the Gibbs phenomena in the reconstructed denoised There have been many Mutual Information (MI) estimators
signal. Coifman and Donoho [4] proposed a solution by in ICA literature which are very powerful yet difficult to
designing a cycle spinning denoising algorithm which estimate resulting in unreliable, noisy and even bias
(i) shifts the signal by collection of shifts, within range of estimation. Most algorithms have their estimators based on
cycle spinning cumulant expansions because of ease of use [16]. B-Spline
(ii) denoise each shifted signal using a threshold (hard or estimators according to our previous research [26] however,
soft) have been shown to be one of the best nonparametric
(iii) inverse-shift the denoised signal to get a signal in the approaches, second to only wavelet density estimators. In
same phase as the noisy signal numerical estimation of MI from continuous microarray data,
(iv) Averaging the estimates. a generalized indicator function based on B-Spline has been
The Gibbs artifacts of different shifts partially cancel each proposed to get more accurate estimation of probabilities;
other, and the final estimate exhibits significantly weaker hence we have designed a B-Spline defined MI contrast
artifacts [4]. This method is called a translation-invariant (TI) function. Our MI function is expressed in terms of entropy as:
denoising scheme. Experimental results in [1] confirm that
single TI wavelet denoising performs better than the I ( X , Y ) = H ( X ) + H (Y ) − H ( X , Y )
traditional single wavelet denoising. Research has also shown (7)
that TI produces smaller approximation error when
approximating a smooth function as well as mitigating Gibbs where
H ( X ) = −∑ p ( xi ) log p ( xi )
artifacts when approximating a discontinuous function.
B. Independent Component Analysis i
Independent Component Analysis (ICA) is an approach for the H ( X , Y ) = −∑ p ( xi , y j ) log p ( xi , y j )
i, j
solution of the BSS problem [5]. It can be represented
mathematically according to Hyvarinen, Karhunen & Oja [12] (8)
as:
Eq. (6) contains the term −H(X, Y), which means that
maximizing MI is related to minimizing joint entropy. MI is
=
X As + n (5) better than joint entropy however because it includes the
marginal entropies H(X) and H(Y) [13]. Entropy in our design
where X is the observed signal, n is the noise, A is the mixing is based on probability distribution functions (pdfs) and our
matrix and s the independent components (ICs) or sources. (It design defines a pdf using a B-Spline calculation resulting in
can be seen that mathematically it is similar to Eq. 1). The
problem is to determine A and recover s knowing only the
N
1 ~
measured signal X (equivalent to E(t) in Eq. (1)). This leads to
finding the linear transformation W of X, i.e. the inverse of the
p ( xi ) =
N
∑B
u =1
i ,k ( xu )
(9)
mixing matrix A, to determine the independent outputs as:
where
u =
= WX WAs (6) n +1
B ( x ) = ∑ Di Bik k ( x )
−
i =1 (10)
50 http://sites.google.com/site/ijcsis/
ISSN 1947-5500
(IJCSIS) International Journal of Computer Science and Information Security,
Vol. 9, No.6, 2011
and D1 is calculated based on Cheney and Kincaid (1994). signals yi can be evaluated using the known source si.
MI was used to create our fixed-point Independent Biomedical signals however produce unknown source signals.
Component Analysis algorithm called B-Spline Mutual In this study therefore we utilize real data collected from four
Information Independent Component Analysis (BMICA). sites.
BMICA utilizes prewhitening strategies as well as possess the (i) http://sccn.ucsd.edu/~arno/fam2data/publicly_availab
linearity g(u) = tanh and a symmetric orthogonalization. le_EEG_data.html. All data are real comprised of
Unmixed signals are determined by: EEG signals from both human and animals. Data
were of different types.
= ( zg ( y )' / m − ∑ (1 − g ( y ) 2 ) × I ) / m (a) Data set acquired is a collection of 32-channel
'
B
data from one male subject who performed a
(11) visual task.
(b) Human data based on five disabled and four
where z is the result of prewhitening and y is the whitened healthy subjects. The disabled subjects (1-5)
signal determined by were all wheelchair-bound but had varying
communication and limb muscle control abilities.
=
y z' × B (12) The four healthy subjects (6-9) were all male
PhD students, age 30 who had no known
neurological deficits. Signals were recorded at
2048 Hz sampling rate from 32 electrodes placed
III. REASONS FOR MERGER at the standard positions of the 10-20
international system.
WT and ICA in recent years have often been used in Signal (c) Data set is a collection of 32-channel data from
Processing [21, 27]. More recently there has been research 14 subjects (7 males, 7 females) who performed
comparing the denoising techniques of both. It was found a go-nogo categorization task and a go-no
(i) if noise and signals are nearly the same or higher recognition task on natural photographs
amplitude, wavelets had difficultly distinguishing presented very briefly (20 ms). Each subject
them. ICA, on the other hand, looks at the underlying responded to a total of 2500 trials. The data is
distributions thus distinguishing each [29]. CZ referenced and is sampled at 1000 Hz.
(ii) ICA gives high performance when datasets are large. (d) Five data sets containing quasi-stationary, noise-
It suffers from the trade off between a small data set free EEG signals both in normal and epileptic
and high performance [13]. The larger the set, subjects. Each data set contains 100 single
however the higher the probability that the effective channel EEG segments of 23.6 sec duration.
number of sources will overcome the number of (ii) http://www.cs.tut.fi/~gomezher/projects/eeg/database
channels (fixed over time), resulting in an over s.htm. Data here contains
complete ICA. This algorithm might not be able to (a) Two EEG recordings (linked-mastoids reference)
separate noise from the signals. from a healthy 27-year-old male in which the
(iii) ICA algorithms cannot filter noise that is overlapping subject was asked to intentionally generate
with EEG signals without discarding the true signals artifacts in the EEG
as well. This results in data loss. With WT however (b) Two 35 years-old males where the data were
once wavelet coefficients are created, noise can be collected from 21 scalp electrodes placed
identified as they concentrate on scale 21 decreasing according to the international 10-20 System with
significantly when the scale increases, while EEG addition electrodes T1 and T2 on the temporal
concentrates on the 22-25 scales. Elimination of the region. The sampling frequency was 250 Hz and
smaller scales denoise the EEG signals [1]. WT an average reference montage was used. The
therefore removes any overlapping of noise and EEG electrocardiogram (ECG) for each patient was
signals that ICA cannot filter out. also simultaneously acquired and is available in
Research therefore shows that ICA and wavelets complement channel 22 of each recording.
each other, removing the limitations of each [21]. (iii) http://idiap.ch/scientific-research/resources/. Data
. here comes from 3 normal subjects during non-
IV. EXPERIMENT SETUP feedback sessions. The subjects sat in a normal chair,
relaxed arms resting on their legs
A. Data Sets (iv) sites.google.com/site/projectbci. Data here is from a
There are two types of data that can be used in experiments – 21 age year old right-handed male with no medical
real and synthetic. In synthetic data the source signals are conditions. EEG consists of actual random movement
known as well as the mixing matrix A. In these cases the of left and right hand recordings with eyes closed.
separation performance of the unmixing matrix W can be Each row represents one electrode. The order of
assessed using the known A and the quality of the unmixed electrode is FP1, FP2, F3, F4, C3, C4, P3, P4, 01, 02,
51 http://sites.google.com/site/ijcsis/
ISSN 1947-5500
(IJCSIS) International Journal of Computer Science and Information Security,
Vol. 9, No.6, 2011
F7, F8, T3, T4, T5, T6, F2, CZ, PZ. Recording was avg h∈H ( S − hTS h ( f ) ) (15)
done at 500Hz using Neurofax EEG system.
These four sites produce real signals of different sizes
where H is the range of shifts, T is the wavelet shrinkage
however all were 2D signals.
denoising operator, h the circular shift and the maximum of H
B. Methodology is the length of the signal N from Eq. (8).
In this paper we are comparing the merger of BMICA with C. Performance Matrix
TIWT with the results of the normal TIWT. In this research
The analysis of the algorithm performance consisted in
the TIWT method for both tests involves the following steps:
estimating (1) the accuracy with which each algorithm was
able to separate components, and (2) the speed with which
1. Signal Collection
each algorithm was able to reproduce EEG signals. For (1)
This algorithm is designed to denoise both natural and
experiments were mainly aimed at assessing the algorithms’
artificially noised EEG signals. They should therefore be
ability to perform ICA (extraction of ICs) and not blind source
mathematically defined based on Eq. (1).
separation (recovery of original sources). The performance
measures that will be used throughout are based on two
2. Apply CS to signal
categories of calculation:
The number of time shifts is determined; in so doing signals
1. Separation Accuracy Measures - Signal to Distortion
are forcibly shifted so that their features change positions
Ratio (SDR), Signal to Interference Ratio (SIR), and
removing the undesirable oscillations which result in pseudo-
Gibbs phenomena. The circulant shift by h is defined as: 2. Noise/Signal Measures - Mean Square Error (MSE),
Peak Signal to Noise Ratio (PSNR), Signal to Noise
( n) )
Sh ( f = f ( ( n + h) mod N ) (13) Ratio (SNR).
Testing on (2) was not executed.
where f(n) is the signal, S is time shift operator and N is the
number of signals. The time-shift operator S is unitary and
therefore invertible i.e. (Sh)-1 = S-h
V. RESULTS/DISCUSSION
Experiments were conducted using the above mentioned
3. Decomposition of Signal signals, in Matrix Laboratory (MATLAB) 7.10.0.499 (R2010)
The signals are decomposed into 5 levels of DWT using the on a laptop with AMD Athlon 64x2 Dual-core Processor
Symmlet family, separating noise and true signals. Symmlets 1.80GHz. Figure 5 shows one mixed EEG signal set where
are orthogonal and its regularity increases with the increase in there are overlays in signals Nos. 6-8 and Nos. 14-18. Figures
the number of moments [8]. After experiments the number of 6 and 7 show the same signal set after applying TIWT and
vanishing moments chosen is 8 (Sym8). BMICA-TIWT merger showing that the overlays have been
minimized – noise has been removed. With BMICA-TIWT it
4. Choose and Apply Threshold Value can be seen that more noise have been eliminated especially in
Denoise using the soft-thresholding method discarding all signals Nos. 14-18.
coefficients below the threshold value using HardShrink based
on the universal threshold defined by Donoho & Johnstone [7]
given as:
T = 2σ 2
log N (14)
where N is the number of samples and σ2 is the noise power.
5. Reconstruction of Signals
EEG signals are reconstructed using inverse DWT.
6. Apply CS
Revert signals to their original time shift and average the
results obtained to produce the denoised EEG signals.
The proposed algorithm can be expressed as Avg [Shift –
Figure 5: Raw EEG
Denoise -Unshift] i.e. using Eq. (8) it is defined as:
52 http://sites.google.com/site/ijcsis/
ISSN 1947-5500
(IJCSIS) International Journal of Computer Science and Information Security,
Vol. 9, No.6, 2011
n n | pij |
1
= SIR ( dB ) ∑∑
n i =1 j max k | pij |
− 1
(16)
1.4
BMICA/WT TIWT
1.2
1
0.8
0.6
0.4
Figure 6: WT
0.2
0
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18
Figure 8: SIR relations between BMICA-WT and TIWT
SIR takes into account the fact that, in general, BSS is able
to recover the sources only up to (a permutation and) a gain
factor α. It is easy to check that if ˆ si = αsi the SIR is infinite.
By contrary, when the estimated source is orthogonal to the
true source, the SIR is equal to zero.
Investigations on the EEG data sets described above showed
that BMICA-WT produced higher SIR calculations than
TIWT. This can be seen in Figure 8 where for 18 signal sets
Figure 7: BMICA-WT BMICA-WT produced SIR higher 94% of the time. This
suggests that when merger with BMICA, TIWT achieved
better separation of EEG signals.
A. Separation Accuracy Measures
SDR
While SIR assesses the quality of the estimated sources, and
SIR the Amari Index assess the accuracy of the estimated mixing
The most common situation in many applications is the matrix, the accuracy of the separation of an ICA algorithm in
degenerate BSS problem, i.e. n < m. This is most likely the terms of the signals (i.e. the overall separation performance) is
case when we try to separate the underlying brain sources calculated by the total Signal to Distortion Ratio (SDR)
from electroencephalographic (EEG) or defined as:
L
magnetoencephalographic (MEG) recordings using a reduced
set of electrodes. In degenerate demixing, the accuracy of a= = 1,...m,
∑ xi (n) 2
n =1
(17)
SDR ( xi , yi ) L
i
BSS algorithm cannot be described using only the estimated ∑ ( yi (n) − xi (n) )
2
mixing matrix. In this case it becomes of particular importance n =1
to measure how well BSS algorithms estimate the sources with
adequate criteria. The most commonly used index to assess the where xi (n) is the original source signal and yi (n) is the
quality of the estimated sources is the Signal to Interference reconstructed signal. The SDR is expressed in decibels (dB).
Ratio (SIR) [14] The higher the SDR value, the better the separation of the
signal from the noise. When the SDR is calculated if it is
found to be below 8-10dB the algorithm is considered to have
failed separation.
Examinations of experiment results show that BMICA-WT
tends to produce higher SDRs. In Table 1 it can be seen that
53 http://sites.google.com/site/ijcsis/
ISSN 1947-5500
(IJCSIS) International Journal of Computer Science and Information Security,
Vol. 9, No.6, 2011
BMICA-TIWT produces higher SDR 65% of the time. This
indicates that almost every TIWT testing there is a BMICA-
TIWT test which produces a more accurate separation of
BMICA/WT TIWT
signal and noise.
50
TABLE I: SDR FOR 19 EEG SIGNAL SETS
40
BMICA-WT TIWT
30
3.54E+03 2.14E+03
-88.843 -1.27E+02
-57.376 -80.281 20
-112.4126 -121.4977
-564.4613 -640.939
10
-217.66 -260.2769
-2.48E+03 -3.40E+03
-8.62E+04 -8.57E+04 0
27.0891 -0.002 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18
4.77E+04 1.39E+03
6.80E+02 7.67E+02 -10
2.38E+03 786.5632 Figure 9: PSNR relations between BMICA-WT and TIWT
2.73E+02 269.5584
1.83E+03 1.66E+03
1.12E+00 7.08E+02
6.50E+02 9.97E+02
8.55E+02 8.81E+02
4.74E+02 9.95E+02
2.71E+04 2.13E+04 MSE
The Mean Square Error (MSE) measures the average of the
square of the “error” which is the amount by which the
B. Noise/Signal Measures estimator differs from the quantity to be estimated. The
difference occurs because of the randomness or because the
estimator doesn't account for information that could produce a
PSNR more accurate estimate. MSE thus assesses the quality of an
Peak Signal-to-Noise Ratio, often abbreviated as PSNR, is estimator in terms of its variation and unbiasedness. Note that
an engineering term for the ratio between the maximum the MSE is not equivalent to the expected value of
possible power of a signal and the power of the absolute error.
corrupting noise that affects the fidelity of its representation.
Because many signals have a very wide dynamic range, PSNR 1 N
is usually expressed in terms of the logarithmic decibel scale.
=MSE
N
∑ [ I ( x, y ) − I '( x, y )] 2
.
y =1 (19)
MAX 2
=
PSNR 10 × log10 ( ).
MSE (18) Since MSE is an expectation, it is a scalar, and not a random
variable. It may be a function of the unknown parameter θ, but
Figure 9 shows the relationship between BMICA-TIWT and it does not depend on any random quantities. However, when
TIWT for PSNR. Close examinations show that for all 18 MSE is computed for a particular estimator of θ the true value
signal sets the PSNR for BMICA-TIWT were higher than of which is not known, it will be subject to an estimation error.
those of TIWT. BMICA-TIWT therefore produces a better In a Bayesian sense, this means that there are cases in which it
quality of the reconstructed signal i.e. it produces a signal of a may be treated as a random variable.
higher quality and therefore can be considered a better
algorithm for denoising. Examination of the experiments shows that BMICA-WT
In this research MAX takes the value of 255. Unlike MSE produces smaller MSE than TIWT; see Table 2. Normally
which represents the cumulative squared error between the MSE is indirectly proportional to PSNR, i.e. when MSE
denoised and mixed signal, PSNR represents a measure of the calculated is equal to zero, then PSNR is infinite. A good
peak error i.e. when the two signals are identical the MSE will algorithm will therefore have a small MSE and a large PSNR.
be equal to zero, resulting in an infinite PSNR. The higher the Investigations show that BMICA-TIWT produces smaller
PSNR, therefore, the better the quality of the reconstructed MSE and larger PSNR than TIWT – better algorithm as it
signal i.e. a higher PSNR indicates that the reconstruction is of produces results closer to the actual data.
a higher quality and therefore the algorithm is considered
good.
54 http://sites.google.com/site/ijcsis/
ISSN 1947-5500
(IJCSIS) International Journal of Computer Science and Information Security,
Vol. 9, No.6, 2011
[1] M. Alfaouri, and K. Daqrouq, “ECG Signal Denoising By Wavelet
Transform Thresholding,” American Journal of Applied Sciences vol. 5
no. 3, pp 276-281, 2008.
[2] M Alfaouri, K. Daqrouq, I. N. Abu-Isbeih, E. F. Khalaf, A. Al-Qawasmi
TABLE II. SDR FOR 19 EEG SIGNAL SETS and W. Al-Sawalmeh, “Quality Evaluation of Reconstructed Biological
Signals”, American Journal of Applied Sciences vol. 6 no.1, pp 187-
193, 2009.
BMICA-WT TIWT
[3] M.I. Bhatti, A. Pervaiz, and M.H. Baig, “EEG Signal Decomposition
1.33E+03 2.27E+04 and Improved Spectral Analysis Using Wavelet Transform”, in
Proceedings of the 23rd Engineering in Medicine and Biology Society 2,
44.032 583.0681 pp 1862-1864, 2001.
21.863 529.9447 [4] R.R. Coifman, and D.L. Donoho, “Translation-Invariant De-Noising”,
Lecture Notes in Statistics: Wavelets and Statistics, 1995 pp 125-150.
25.8048 501.1608 [5] P. Comon, “Independent Component Analysis, a New Concept?” Signal
Processing, Elsevier, vol. 36 no.3, pp 287-314, 1994.
5.404 1.24E+03
[6] R.J. Croft, and R.J. Barry, “Removal of Ocular Artifacts from the EEG:
2.8685 917.3362 A Review”, Clinical Neurophysiology vol. 30 no. 1, pp. 5-19, 2000..
15.7071 1.57E+03 [7] D.L. Donoho, and I.M. Johnstone, “ Adapting to unknown smoothness
via Wavelet Shrinkage”, Journal of the American Statistical Association,
53.7782 3.25E+05 vol. 90 no. 32, pp. 1200-1224, 1995.
[8] B. Ferguson, and D. Abbott, “Denoising Techniques for Terahertz
3.67E+03 5.94E+11 Response of Biological Samples”, Microelectronics Journal 32, pp 943-
1.04E+04 4.24E+04 953, 2001.
[9] R. Gribonval, E. Vincent, and C. Févotte, “Proposals for Performance
6.74E+03 4.01E+04 Measurement In Source Separation.” , In the Proceedings of the 4th
International Symposium on Independent Component Analysis and Blind
1.90E+04 3.16E+04
Signal Separation (ICA2003), Nara, Japan, pp 763–768, 2003.
1.10E+02 4.33E+03 [10] Y.M. Hawwar, A.M. Reza, and R.D. Turney, Filtering(Denoising) in the
Wavelet Transform Domain, Department of Electrical Engineering And
1.84E+04 2.13E+04 Computer Science, University of Wisconsin-Milwaukee, 2002.
Unpublished
6.06E+03 4.05E+04
[11] S. Hoffman, and M. Falkenstien , “The Correction of Eye Blink
2.98E+03 4.08E+04 Artefacts in the EEG: A Comparison of a Two Prominent Methods”,
PLoS One 3(8):e3004, 2008
2.75E+03 3.72E+04
[12] A. Hyvarinen, J. Karhunen and E. Oja, “Independent Component
6.32E+03 3.15E+04 Analysis”, eds. Wiley & Sons 2001
[13] G. Inuso, F. La Foresta, N. Mammone, and F.C. Morabito, “Wavelet-
ICA methodology for efficient artifact removal from
Electroencephalographic recordings”, in the Proceedings of the
International Joint Conference on Neural Networks, pp. 1524-1529,
2007.
[14] V. Krishnaveni, S Jayaraman, A. Gunasekaran, and K Ramadoss,
VI. CONCLUSIONS “Automatic Removal of Ocular Artifacts using JADE Algorithm and
Neural Network”, International Journal of Intelligent Systems and
Research have found that WT is the best suited for Technologies, vol. 1 no. 4, pp. 322-333, 2006.
denoising as far as performance goes because of its properties [15] T.L. Lee-Chiong, Sleep: A Comprehensive Handbook eds John Wiley &
like sparsity, multiresolution and multiscale nature. Non- Sons, 2006.
orthogonal wavelets such as UDWT and Multiwavelets [16] M. Lennon, G. Mercier, M.C. Mouchot, and L. Hubert-Moy,
improve the performance at the expense of a large overhead in “Curvilinear Component Analysis for non-linear dimensionality
reduction of hyperspectral images”, in the Proceedings of the SPIE
their computation [28]. Research also shows that TIWT is Symposium on Remote Sensing Conference on Image and Signal
considered to be an improvement on WT, removing Gibbs Processing for Remote Sensing VII 4541, p 157, 2001.
phenomena. In this work we have found that the addition of [17] M.C. Motwani, M.C. Gadiya R.C., Motwani and F.C. Harris Jr.,
BMICA to TIWT has been found to improve its performance. “Survey of Image Denoising Techniques”, In the Proceedings of the
With the BMICA merger the separation accuracy of TIWT Global Signal Processing Expo and Conference (GSPx), pp 27-30, 2004.
increased although it was not so 100% of time with SDR. As [18] V.V.K.D.V. Prasad, P. Siddaiah, and B. Prabhaksrs Rao, “A New
Wavelet Based Method for Denoising of Biological Signals”,
far as the noise/signal separation goes however the merger International Journal of Computer Science and Network Security
produces a better quality reconstructed signal 100% of the (IJCSNS), vol. 8, no. 1, pp. 238-244, 2008.
time. [19] N. Ramachandran, and A.K. Chellappa, “Feature extraction from EEG
using wavelets: spike detection algorithm”, In the Proceedings of the 7th
. International Conference on Mathematics in Signal Processing, 2006
[20] R. Romo-Vazquez, R. Ranta, V. Louis-Dorr, and D. Maquin, “Ocular
REFERENCES Artifacts Removal in Scalp EEG: Combining ICA and Wavelet
Denoising”, Physics in Signal and Image Processing (PSISP 07), 2007
[21] P. Senthil Kumar, R. Arumuganathan, K. Sivakumar, and C. Vimal, “A
Wavelet based Statistical Method for De-noising of Ocular Artifacts in
55 http://sites.google.com/site/ijcsis/
ISSN 1947-5500
(IJCSIS) International Journal of Computer Science and Information Security,
Vol. 9, No.6, 2011
EEG Signals”, International Journal of Computer Science and Network
Security (IJCSNS) vol. 8 no. 9, pp. 87-92, 2008.
[22] P. Senthil Kumar, R. Arumuganathan, K. Sivakumar, and C. Vimal,
“Removal of Ocular Artifacts in the EEG through Wavelet Transform
AUTHORS PROFILE
without using an EOG Reference Channel”, International Journal of
Open Problems in Computer Science and Mathematics (IJOPCM) vol. 1
no. 3, pp 189-198, 2008.
[23] P. Senthil Kumar, P., R. Arumuganathan, K. Sivakumar, and C. Vimal Janett Walters-Williams received the B.S.
“An Adaptive method to remove ocular artifacts from EEG signals using and M.S. degrees, from the University of the
Wavelet”, Journal of Applied Sciences Research, vol. 5 no. 7, pp. 741- West Indies in 1994 and 2001, respectively.
745, 2009. She is presently a Doctoral student at the
[24] L. Su and G Zhao, “Denoising of ECG Signal Using Translation University of Southern Queensland. After
Invariant Wavelet Denoising Method with Improved Thresholding”, In working as an assistant lecturer (from 1995),
the 27th Annual Conference IEEE Engineering in Medicine and Biology, in the Dept. of Computer Studies, in the
pp 5946-5949, 2005. University of Technology, she has been a
[25] Ungureanu, M., Bigan, C., Strungaru, R., and Lazarescu, V. 2004. lecturer in the School of Computing & Information Technology, since 2001.
Independent Component Analysis Applied in Biomedical Signal Her research interest includes Independent Component Analysis, Neural
Processing", in proceedings of Measurement Science Review 4(2). Network Applications, signal/image processing, bioinformatics and artificial
intelligence.
[26] J. Walters-Williams, and Y. Li. “Estimation of Mutual Information: A
Survey”. 4th International Conference on Rough Set and Knowledge
Technology (RSKT2009), pp.389-396, 2009.
[27] W. Zhou, and J. Gotman, “Removal of EMG and ECG Artifacts from
EEG Based on Wavelet Transform and ICA”. In the Proceedings of the
26th Annual International Conference on the IEEE EMBS, 2004, pp. 392-
395.
[28] W. Zhou, and J. Gotman, “Removing Eye-movement Artifacts from the
EEG during the Intracarotid Amobarbital Procedure” In Epilepsia vol. Yan Li received the B.E., M. E., and Dr. Eng.
46 no.3, pp. 409-411, 2005. degrees from Hiroshima Univ. in 1982, 1984,
[29] G. Zouridakis, and D. Iyer, “Comparison between ICA and Wavelet- and 1990, respectively. She has been an
based Denoising of single-trial evoked potentials” In the Proceedings of associate professor at the University of
the 26th Annual International Conference of the IEEE Engineering in Queensland since 2008. She is the winner of
Medicine and Biology Society, pp. 87-90, 2004. the 2008 Queensland Smart Woman-Smart
State Awards in ICT as well as one of the
Head of Department awardees for research
publications in 2006 and 2008. She is an
Australian Reader to assess Australia Research
Council Discovery and Linkage Project
Proposals and has organized the RSKT 2009
and CME 2010 international conferences. Her research interest includes
signal/image processing, independent component analysis, Biomedical
Engineering, Blind Signal Separation and artificial intelligence
56 http://sites.google.com/site/ijcsis/
ISSN 1947-5500
(IJCSIS) International Journal of Computer Science and Information Security,
Vol. 9, No. 6, June 2011
Hole Filing IFCNN Simulation by Parallel RK(5,6)
Techniques
(Hole Filing by Parallel RK(5,6))
Sukumar Senthilkumar* Abd Rahni Mt Piah
Universiti Sains Malaysia Universiti Sains Malaysia
School of Mathematical Sciences School of Mathematical Sciences
11800 USM Pulau Pinang 11800 USM Pulau Pinang
MALAYSIA MALAYSIA
E-mail: ssenthilkumar1974@yahoo.co.in E-mail: arahni@cs.usm.my
ssenthilkumar@usm.my
Abstract— This paper concentrates on employing different developed by Butcher [8-10] to solve many computational
parallel RK(5,6) techniques for hole-filing via unique problems. Evans and Sanugi [11] developed parallel
characteristics of improved fuzzy cellular neural network integration techniques of Runge-Kutta for step by step solution
(IFCNN) simulation to improve the performance of an of ordinary differential equations to obtain results.
image or handwritten character recognition. Results are Ponalagusamy and Ponammal [12-14] developed new parallel
presented according to the range of template selected for fifth order algorithm to solve robot arm model, time varying
simulation. network for first order initial value problems and new
generalised plasticity equation for compressible powder
Keywords- Parallel 5-order 6-stage numerical integration metallurgy materials with results on stability region for test
techniques, Improved fuzzy cellular neural network, Hole filing, equation. Keyes et al. [15] provided a survey towards
Simulation, Ordinary differential equations. applications requiring memories and processing rates of large-
scale parallelism, leading algorithmicist applications of
I. INTRODUCTION parallel numerical algorithms. Further, focused on practical
medium-granularity parallelism, approachable through
Parallel computing techniques are used to carry out traditional programming languages. Gear [16] gave the
computations simultaneously, operating on the principle that potentiality behavior for parallelism in solving real time
large problems are often can be divided into smaller ones, problems using ordinary differential equations. A survey of
which can then be solved concurrently. It is a simultaneous potential for parallelism in Runge-Kutta techniques and
process of multiple computing resources to solve a parallel numerical techniques for initial value problems for
computational problem easily and quickly. In real time it is ordinary differential equations are demonstrated by Norsett
practically believed by researchers that a possible way of and Jackson [17] and Jackson [18]. Using fourth order explicit
solving many significant computationally intensive problems Runge-Kutta method, a parallel mesh chopping algorithm for a
in science and engineering is by employing parallel algorithms class of initial value problem is illustrated by Katti and
effectively. Srivastava [19]. Harrer et al. [20] introduced explicit Euler,
predictor-corrector and fourth-order Runge-Kutta algorithms
From the literature, it is observed that most of the real time for simulating cellular neural networks. The RK-Butcher
problems are solved by adapting Runge-Kutta (RK) methods algorithm has been introduced by Bader [21, 22] for finding
which in turn are applied to compute numerical solutions for truncation error estimates, intrinsic accuracies and early
various problems, which are modeled in terms of initial value detection of stiffness in coupled differential equations that
problems as in Alexander and Coyle [3], Evans [4], Hung [5], arises in theoretical chemistry problems. Senthilkumar and
Shampine and Watts [6] and Shampine and Gordon [7]. Piah [23] implemented parallel Runge-Kutta arithmetic mean
Shampine and Watts [6] developed mathematical codes for algorithm to obtain a solution to a system of second order
Runge-Kutta fourth order method to solve many numerical robot arm. In this paper a new attempt has been made to
problems. Runge-Kutta formula of fifth order has been employ parallel RK(5,6) algorithm for hole filing problem
This research work is carried out by the first author under a post doctoral under IFCNN environment. Oliveira [24] introduced a popular
fellow scheme at the School of Mathematical Sciences, Universiti Sains sequential RK-Gill algorithm to evaluate effectiveness factor
Malaysia, 11800 USM Pulau Pinang, MALAYSIA. of immobilized enzymes.
*Corresponding Author.
57 http://sites.google.com/site/ijcsis/
ISSN 1947-5500
(IJCSIS) International Journal of Computer Science and Information Security,
Vol. 9, No. 6, June 2011
Computing value is easy in case of implementing VLSI sufficiently utilized. FCNN is a locally connected network
CNN chips, thereby making real-time operations possible. [37] and the output of a neuron is connected to the inputs of
Roska [28] and Roska et al. [29] have presented the first every neuron/cell in its r × r neighborhood, and similarly the
widely used simulation system which allows simulation of a inputs of a neuron are only connected to the outputs of every
large class of CNN and is especially suited for image neuron in its r × r neighborhood. It is apparent that feedback
processing applications. It also includes signal processing, (not recurrent) connections are presented in detail. The
pattern recognition and solving ordinary and partial architecture of IFCNN is shown in Figure 1.
differential equations, as in Gonzalez et al. [30]. The existing
RK-Butcher fifth order method hole filing problem has been
studied by Murugesh and Badri [32] via CNN simulation
model. Similarly, hole filing problem has been analyzed by
Murugesan and Elango [50] by means of existing RK fourth
order method under CNN simulation. Dalla Betta et al. [46]
implemented CMOS implementation of an analogy
programmed cellular neural network. Anguita et al. [31]
discussed in detail about parameter configurations for hole
extraction in cellular neural networks.
Zadeh [35] and Zadeh et al. [36] introduced the concept of
fuzzy sets (FSs) theory. Different notions of higher-order FSs
have been proposed by different researchers. Recently, fuzzy
cellular neural network (CNN) model [43-45] has attracted a
great deal of interest among researchers from different
disciplines. A locally interconnected, regularly repeated,
analogue (continuous- or discrete-time) circuits with a one-or-
two-or three-dimensional grid architecture called CNNs
introduced by Chua and Yang [25-26] and Chua [27]. Each
cell (neuron) in CNN is a non-linear dynamic system coupled
only to its nearest neighbors. Because of this local
interconnection property, CNNs have been considered
specifically suitable for very-large-scale integration
implementations. Shitong et al. [37] proposed improved fuzzy
cellular networks to incorporate the novel fuzzy status
containing the useful information beyond a white blood cell
into its state equation, resulting in enhancing the boundary
integrity. Laiho et al. [38] proposed template design for CNNs Figure 1. Architecture of IFCNN
with 1-bit weights.
The state equation of IFCNN is given by,
This paper is ordered as follows. A brief introduction on dxij −1
improved fuzzy cellular non-linear network is presented in c = xij + ∑ A(i, j; k , l ) ykl +
section 2. Section 3 deals with the performance of hole-filler dt Rx c ( k ,l )∈N r ( i , j )
template design and simulation results. Section 4 discusses
parallel RK(5,6) numerical integration techniques. Finally, ∑
c ( k ,l )∈N r ( i , j )
B(i, j; k , l )ukl
concluding remarks is presented in section 5.
+ I ij + ∧
% ( Af min (i, j; k , l ) + ykl ) +
II. A BRIEF OVERVIEW OF IFCNN c ( k ,l )∈N r ( i , j )
∨
% ( Af max (i, j; k , l ) + ykl ) + (1)
c ( k ,l )∈N r ( i , j )
The capability of the conventional cellular neural network
to solve different kinds of image processing problems and the + ∧
% ( B f min (i, j; k , l )ukl ) +
c ( k ,l )∈N r ( i , j )
capability of fuzzy logic to cope with uncertainty in images
are the inherent features of FCNN [37]. Moreover, it also has ∨
% ( B f max (i, j; k , l )ukl ) +
c ( k ,l )∈N r ( i , j )
inbuilt connections with mathematical morphology. The
unique characteristic of IFCNN is incorporating novel fuzzy ∧
% ( Ff min (i, j; k , l ) xkl ) +
status with feed-forward and feedback templates in FCNN c ( k ,l )∈N r ( i , j )
such that the useful information beyond the region can be ∨
% ( Ff max (i, j; k , l ) xkl )
c ( k ,l )∈N r ( i , j )
58 http://sites.google.com/site/ijcsis/
ISSN 1947-5500
(IJCSIS) International Journal of Computer Science and Information Security,
Vol. 9, No. 6, June 2011
and the input equation of Cij is given by, considered in the above template to congregate the IFCNN’s
symmetric requirements.
uij = Eij ≥ 0, (2)
III. A BRIEF SKETCH ON HOLE-FILLER AND SIMULATION
1 ≤ i ≤ M; 1 ≤ j ≤ N. RESULTS
In a bipolar image, all the holes are filled and remains
unaltered outside the holes, in case of hole filing IFCNN
the output equation of Cij is given by,
simulation [46-50]. Allow R x = 1, C = 1 and take +1 to
1⎡
yij = f ( xij ) = xij + 1 − xij − 1 ⎤ , (3) represent the black pixel and –1 for the white pixel. If the
2⎣ ⎦
{ }
bipolar image is input with U = u ij into IFCNN and images
1 ≤ i ≤ M; 1 ≤ j ≤ N. having holes are enclosed by the black pixels, then initial state
values are set to be xij (0) = 1 . The output values are obtained
The constraints /conditions are given by
as y ij (0) = 1,1 ≤ i ≤ M ,1 ≤ j ≤ N from equation (1).
Afmax(i,j;k,l) = Afmin(k,l;i,j); Consider the templates A, B and independent current source I
as
Afmax(i,j;k,l) = Afmax(k,l;i,j);
Ffmax(i,j;k,l) = Ffmin(k,l;i,j); ⎡0 a 0⎤
A = ⎢a b a ⎥ ,
⎢ ⎥ a > 0, b > 0
Ffmax(i,j;k,l) = Ffmax(k,l;i,j);
⎢0 a 0⎥
⎣ ⎦
1 ≤ i ≤ M; 1 ≤ j ≤ N. (4)
(6)
xij (0) ≤ 1 ;1 ≤ i ≤ M; 1 ≤ j ≤ N. ⎡0 0 0 ⎤
B = ⎢0 4 0 ⎥ ,
⎢ ⎥ I = -1
u ij (0) ≤ 1 ;1 ≤ i ≤ M; 1 ≤ j ≤ N. ⎢0 0 0 ⎥
⎣ ⎦
A(i, j; k , l ) = A(k , l ; i, j )
where the template parameters a and b are to be determined. In
~ ~ order to make the outer edge cells become the inner ones,
From the above Eqs. (1) - (4), ∧ , ∨ , Nr(i,j), and A are normally auxiliary cells are added along the outer boundary of
identical as in FCNN. Comparing (4) with FCNN, the only the image and their state values are set to be zeros by circuit
one discrepancy between the equation is the novel fuzzy realization resulting in zero output values. The state equation
status. (1) can then be rewritten as
~ dxij
( ∧
ckl ∈N r ( i , j )
( Ff min (i, j; k , l ) + xkl ) +
dt
= − xij + ∧
%
c ( k ,l )∈N r ( i , j )
( Af min (i, j; k , l ) + ykl ) +
(7)
~ ∨
% ( Af max (i, j; k , l ) + ykl ) + 4uij (t ) − I .
∨
ckl ∈N r ( i , j )
( Ff max (i, j; k , l ) + xkl ) + (5) c ( k ,l )∈N r ( i , j )
xkl )) For instance, here the cells C(i+1,j), C(i-1,j), C(i,j+1) and
C(i,j-1) are non-diagonal cells. Designing of hole-filler
is adhered to Eq. (1), which obviously reflects the required template [31] and its various sub-problems are discussed using
information where Ffmin(i,j;k,l) and Ffmax(i,j;k,l) indicates the CNN simulations [46-50]. Figures 2 and 3 show the hole filing
connected weights between cell Cij and Ckl respectively. of an image (before and after) by employing a parallel
Hence, the complete template determines the connection RK(5,6) type-III technique. The settling time Ts and
between cell and its neighbors, consists of (2r × 1) and (2r × computation time Tc for different step sizes are considered for
1) matrices A, B, Ffmin and Ffmax. The symmetric matrices are the purpose of comparison. The settling time Ts is the time
59 http://sites.google.com/site/ijcsis/
ISSN 1947-5500
(IJCSIS) International Journal of Computer Science and Information Security,
Vol. 9, No. 6, June 2011
from start of computation until the last cell leaves the interval which is based on a specified limit (e.g., |dx/dt|< 0.01). The
[-1.0, 1.0] computation time Tc is the time taken for settling the network
and adjusting the cell for proper position once the network is
settled. The simulation shows the desired output for every
neuron/cell. Specifically, note that +1 and -1 indicate the black
and white pixels, respectively. The marked selected template
parameters a and b are restricted to the shaded area, as shown
in figure 4 for the simulation.
IV. PARALLEL RUNGE-KUTTA FIFTH ORDER TECHNIQUES: A
BRIEF OVERVIEW
A. Parallel Runge-Kutta 5-Order 6-Stage Type-I Technique
Figure 2(a). Original image and hole filed image
A parallel Runge-Kutta 5-order 6-stage type-I technique [12-
14] is one of the simplest method used to solve ordinary
differential equations. It is an explicit formula which adapts
the Taylor’s series expansion in order to obtain the
approximation. A parallel Runge-Kutta 5-order 6-stage type-I
technique is used to determine yj and y j , j = 1, 2,3,....m such
&
that
7 32 2 32 7
y n +1 = y n + [ k1 + k 3 + k 4 + k 5 + k 6 ]
Figure 2(b). Original image and hole filed image
90 90 90 90 90
(8)
Figure 2. Hole filing before and after adapting type-III parallel RK(5,6) Thus, the corresponding parallel Runge-Kutta 5-order 6-stage
technique type –I technique of Butcher array represents
0
2 2
5 5
1 11 5
4 64 64
Figure 3. Hole filing before and after employing type-III parallel RK(5,6)
technique 1 3 5
2 16 16
3 9 − 27 3 9
4 32 32 4 16
−9 35 − 12 8
1 0
28 28 7 7
7 32 2 32 7
0
Figure 4. Range of the template
90 90 90 90 90
60 http://sites.google.com/site/ijcsis/
ISSN 1947-5500
(IJCSIS) International Journal of Computer Science and Information Security,
Vol. 9, No. 6, June 2011
The 6 stage 5th order algorithm with 5 parallel and 2 0
processors, by selecting a43 = 0 to evaluate k3 and k4
simultaneously is given by 1 1
3 3
. k1ij = Δtf ( xij (t n )) ,
2 4 6
2 Δt 2 .
k = Δtf ( xij (tn +
ij
2 ) ) + k1ij , 5 25 25
5 5
1 1 15
Δt 11 5 ij -3
k3 = Δtf ( xij (tn + ) ) + k1ij +
ij *
k2 = k3 ij , 2 4 4
4 64 64
2 6 − 90 50 8
Δt 3 5 ij 3 81 81 81 81
k = Δtf ( xij (tn + ) ) + k1ij +
ij
4
*
k2 = k4 ij ,
4 16 16
4 −6 36 10 8
Δt 9 275 ij 3 ij 9 ij 5 75 75 75 75
k = Δtf ( xij (tn ) ) + k1ij −
ij
5 k 2 + k3 + k 4
2 32 32 4 16
*ij
= k5 ,
23 125 − 812 125
0 0
3 9 35 ij 12 ij 8 ij 192 192 192 192
k6 = Δtf ( xij (tn + Δt ) − k1ij + k2 − k4 + k5
ij
4 28 28 7 7
*ij
= k6 . (9) Therefore, the final integration is a weighted sum of four
calculated derivatives per time step which is given by
Therefore, the final integration is a weighted sum of the five
calculated derivatives which is given as 23 125 81 125
y n +1 = y n + h[ k1 + k3 − k5 + k6 ] .
90 192 192 192
Δt
tn+1
(12)
∫ f ( x(t ))dt = [7k1ij + 32k3ij + 12k4 + 32 k5 + 7 k6 ].
ij ij ij
tn
90 The 6 stage 5th order algorithm with 5 parallel and 2
(10) processors by selecting a65 = 0 to evaluate k5 and k6
simultaneously is given by
B. Parallel Runge-Kutta 5-Order 6-Stage Type-II Technique . k1ij = Δtf ( xij (t n )) ,
A parallel Runge-Kutta 5-order 6-stage type-II technique [12- Δt 1
14] is also one of the simplest method used to solve ordinary k 2 = Δtf ( xij (t n +
ij
) ) + k1ij ,
differential equations. It is an explicit formula which adapts 3 3
the Taylor’s series expansion in order to obtain the
approximation. A parallel Runge-Kutta 5-order 6-stage type-II 2 Δt 4 6 ij
k 3 = Δtf ( xij (t n +
ij
) ) + k1ij + k2 ,
technique determines yj and y j , j = 1,2,3,....m such that
& 3 25 25
23 125 81 125
yn +1 = yn + h[ k1 + k3 − k5 + k6 ]. Δt k1ij 15k 3
ij
90 192 192 192 k = Δtf (t n + ) +
ij
4 − 3k 2 +
ij
(11) 2 4 4
2 6 ij 90 ij 50 ij 8 ij
Thus, the corresponding parallel Runge-Kutta 5-order 6-stage k 5 = Δtf ( xij (t n + Δt ) + k1 − k 2 − k 3 + k 4
ij
technique of type-II Butcher array represents
3 81 81 81 81
*ij
= k5
61 http://sites.google.com/site/ijcsis/
ISSN 1947-5500
(IJCSIS) International Journal of Computer Science and Information Security,
Vol. 9, No. 6, June 2011
4 Δt 6
k6 = Δtf ( xij (tn +
ij
) ) − k1ij + Therefore, the final integration is a weighted sum of five
5 75 *ij
= k6 . (12) calculated derivatives per time step which is given by
36 ij 10 ij 8 ij
k2 + k3 + k 4
75 75 75 17 250
yn +1 = yn + h[ k1 − k3 +
306 153
Therefore, the final integration is a weighted sum of the five (15)
442 8192 31
calculated derivatives which is given by k4 + k5 + k6 ].
tn+1
Δt 255 9945 234
∫ f ( x(t ))dt = [23k1ij + 125k3 − 81k5ij + 125k6 ].
ij ij
tn
192
(13) The 6 stage 5th order algorithm with 5 parallel and 2
processors by selecting a54 = 0 to evaluate k5 and k4
C. Parallel Runge-Kutta 5-Order 6-Stage Type-III Technique simultaneously is given by
A parallel Runge-Kutta 5-order 6-stage type-III technique [12- . k1ij = Δtf ( xij (t n )) ,
14] is another simple method used to solve ordinary
differential equations. It is also an explicit formula which
adapts the Taylor’s series expansion for an approximation. A Δt 1
k 2 = Δtf ( xij (t n +
ij
) ) + k1ij
parallel Runge-Kutta 5-order 6-stage type-III technique 5 5
determines yj and y j , j = 1, 2,3,....m such that
&
17 250 2 Δt 39 ij 5 ij
yn +1 = yn + h[ k1 − k3 + k 3 = Δtf ( xij (t n +
ij
)) + k1 + k2 ,
306 153 5 160 32
(14)
442 8192 31
k4 + k5 + k6 ]. Δt k1ij 5k 2
ij
2k 3
ij
255 9945 234 k = Δtf (t n ) +
ij
4 − + *ij
= k4
2 24 24 3
Thus, the corresponding parallel Runge-Kutta 5-order 6-stage
technique of type-III Butcher array represents 3 1 3 ij 1 ij
k 5 = Δtf ( xij (t n + Δt
ij
) + k1ij − k 2 − k 3 = k5 ij
*
16 8 16 4
0
9 ij
k1 +
k6 = Δtf ( xij (tn + Δt ) ) −
ij
1 1 14 (16)
5 5 15 ij 8 ij 12 ij 8 ij
k 2 + k 3 − k 4 + k5 .
14 7 7 7
2 39 5
. Therefore, the final integration is a weighted sum of the five
5 160 32 calculated derivatives which is given by
1 1 −5 2 tn+1
17 k1ij 250k3ij
2 24 24 3 ∫
tn
f ( x(t ))dt = Δt[−
306
−
153
+
(17)
3 1 −3 1 442k 8192k
ij
31k ij ij
16 8 16 4 + 4
+ ]. 5 6
255 9945 234
−9 15 8 12 8
1
14 14 7 7 7 V. CONCLUDING REMARKS
− 17 − 250 442 8192 31
0 In this paper, hole filing problem is addressed under IFCNN
306 153 255 9945 234 model using parallel RK(5,6) techniques and its validity is
illustrated by simulation results. It is observed that the hole is
62 http://sites.google.com/site/ijcsis/
ISSN 1947-5500
(IJCSIS) International Journal of Computer Science and Information Security,
Vol. 9, No. 6, June 2011
[15] D.E. Keyes, A. Sameh and V.V. Krishnan, “Parallel Numerical
filled and the outside image remains unaffected, that is, the Algorithm”, Kluwer Academic Publishers, 1997.
edges of the images are preserved and are intact. The [16] G.W. Gear, “The potential for parallelism in ordinary differential
templates of the cellular neural network are not unique and equations”, Technical Report UIUCDCS-R-86-1246, Computer
this is important in its implementation. The significance of this Science Department, University of Illinois, Urbana, IL, 1986.
[17] K.R. Jackson and S.P. Norsett, “The potential for parallelism in
work is to improve the performance of handwritten character Runge-Kutta methods: Part-I: RK formulas in standard form”,
recognition because in many language scripts, numerals and in SIAM Journal on Numerical Analysis, Vol. 32, pp. 49-82, 1995.
images etc., there are many holes and the CNN described [18] K.R. Jackson, “A survey of parallel numerical methods for initial
above can be used in addition to the connected component value problems for ordinary differential equations”, IEEE
Transactions on Magnetics, Vol. 27, pp. 3792-3797, 1991.
detector. It is also noticed that IFCNN preserves the boundary [19] C.P. Katti and D.K. Srivastava, “On a parallel mesh chopping
integrity. algorithm for fourth order explicit Runge-Kutta method”, Applied
Mathematics and Computation, Vol. 143, pp. 563-570, 2003.
[20] H. Harrer, A. Schuler and E. Amelunxen, “Comparison of different
ACKNOWLEDGMENT numerical integrations for simulating cellular neural networks”, In
CNNA-90 Proceedings of IEEE International Workshop on
Cellular Neural Networks and their Applications, pp. 151-159,
The first author would like to extend his sincere gratitude 1990.
[21] M. Bader, “A comparative study of new truncation error estimates
to Universiti Sains Malaysia for supporting this work under its and intrinsic accuracies of some higher order Runge-Kutta
post doctoral fellowship scheme. Much of this work was algorithms”, Computers & Chemistry, Vol. 11, pp.121-124, 1987.
carried out during his stay at Universiti Sains Malaysia in [22] M. Bader, “A new technique for the early detection of stiffness in
coupled differential equations and application to standard Runge-
2011. He wishes to acknowledge Universiti Sains Malaysia’s Kutta algorithms”, Theoretical Chemistry Accounts, Vol. 99, pp.
financial support. 215-219, 1988.
[23] S. Senthilkumar and A.R.M. Piah, “Solution to a system of second
REFERENCES order robot arm by parallel Runge-Kutta arithmetic mean
algorithm”, InTechOpen, pp. 39-50, 2011.
[24] S.C. Oliveira, “Evaluation of effectiveness factor of immobilized
[1] M. Korch, “Simulation-based analysis of parallel Runge-Kutta enzymes using Runge-Kutta-Gill: How to solve mathematical
solvers”, LNCS 3732, pp. 1105-1114, 2006. undetermination at particle center point?”, Bio Process
[2] Z. Jia, “A parallel multiple time-scale reversible integrator for Engineering, Vol. 20, pp. 185-187, 1999.
dynamics simulation”, Future Generation Computer Systems”,Vol. [25] L.O. Chua and L. Yang, “Cellular neural networks: Theory”, IEEE
19, pp. 415-424, 2003. Transactions on Circuits and Systems, Vol. 35, pp. 1257-1272,
[3] R.K. Alexander and J.J. Coyle, “Runge-Kutta methods for 1988.
differential-algebraic systems”, SIAM Journal of Numerical [26] L.O. Chua and L. Yang, “Cellular neural networks: Applications”,
Analysis, Vol. 27, pp. 736-752, 1990. IEEE Transactions on Circuits and Systems, Vol. 35, pp. 1273-
[4] D.J. Evans, “A new 4th order Runge-Kutta method for initial value 1290, 1988.
problems with error control”, International Journal of Computer [27] L. O. Chua, “CNN: A Paradigm for Complexity”, World Scientific
Mathematics, Vol.139, pp. 217-227, 1991. Series on Nonlinear Science, Series A, Vol. 31, 1998.
[5] C. Hung, “Dissipativity of Runge-Kutta methods for dynamical [28] T. Roska, “CNN Software Library”, Hungarian Academy of
systems with delays”, IMA Journal of Numerical Analysis, Vol.20, Sciences, Analogical and Neural Computing Laboratory, [Online].
pp. 153-166, 2000. Available:http://lab.analogic.sztaki.hu/Candy/csl.html, 1.1. 2000.
[6] L.F. Shampine and H.A. Watts, “The art of a Runge-Kutta code. [29] Roska et al. “CNNM Users Guide”, Version 5.3x, Budapest, 1994.
Part-I”, Mathematical Software, Vol. 3, pp. 257-275, 1977. [30] R.C. Gonzalez, R.E. Woods and S.L. Eddin, “Digital Image
[7] L.F. Shampine and M.K. Gordon, “Computer solutions of ordinary Processing using MATLAB”, Pearson Education Asia, Upper
differential equations”, W.H. Freeman, San Francisco. p. 23, 1975. Saddle River, N.J, 2009.
[8] J.C. Butcher, “On Runge processes of higher order”, Journal of [31] M. Anguita, F.J. Fernandez, A.F. Diaz, A. Canas and F.J. Pelayo,
Australian Mathematical Society, Vol. 4, p. 179, 1964. “Parameter configurations for hole extraction in cellular neural
[9] J.C. Butcher, “The Numerical Analysis of Ordinary Differential networks”, Analog Integrated Circuits and Signal Processing, Vol.
Equations: Runge-Kutta and General Linear Methods”, John 32, pp. 149–155, 2002.
Wiley & Sons, Chichester, 1987. [32] V. Murugesh and K. Badri, “An efficient numerical integration
[10] J.C. Butcher, “On order reduction for Runge-Kutta methods algorithm for cellular neural network based hole-filler template
applied to differential-algebraic systems and to stiff systems of design”, International Journal of Computers, Communications and
ODEs”, SIAM Journal of Numerical Analysis, Vol. 27, pp. 447- Control, Vol. 2, pp. 367-374, 2007
456, 1990. [33] K.K. Lai and P.H.W. Leong, “Implementation of time-multiplexed
[11] D.J. Evans and B.B. Sanugi, “A parallel Runge-Kutta integration CNN building block cell”, IEEE Proceedings of Microwave, pp.
method”, Parallel Computing, Vol. 11, pp. 245-251, 1989. 80-85, 1996.
[12] R. Ponalagusamy and K. Ponammal, “Investigations on robot arm [34] K.K. Lai and P.H.W. Leong, “An area efficient implementation of
model using a new parallel RK-fifth order algorithm”, International a cellular neural network”, NNES '95 Proceedings of the 2nd New
Journal of Computer, Mathematical Sciences and Applications, Zealand Two-Stream International Conference on Artificial Neural
Vol. 2, pp. 155-164, 2008. Networks and Expert Systems, pp. 51-54, 1995.
[13] R. Ponalagusamy and K. Ponnammal,“A new parallel RK-fifth [35] L.A. Zadeh, “Fuzzy sets”, Information and Control, Vol. 8, No. 3
order algorithm for time varying network and first order initial pp. 338-353, 1965.
value problems”, Journal of Combinatorics, Information & System [36] L.A. Zadeh, K. Fu, K. Tanaka and M. Shimura (eds), “Fuzzy Sets
Sciences, Vol. 33, pp. 397-409, 2008. and Their Applications to Cognitive and Decision Processes”,
[14] R. Ponalagusamy and K. Ponnammal, “New generalised plasticity Academic Press, New York, 1975.
equation for compressible powder metallurgy materials: A new [37] W. Shitong, K.F.L. Chung and F. Duan, “Applying the improved
parallel RK-Butcher method”, International Journal of fuzzy cellular neural network IFCNN to white blood cell
Nanomanufacturing, Vol. 6, pp. 395-408, 2010. detection”, Neurocomputing, Vol. 70, pp. 1348-1359, 2007.
63 http://sites.google.com/site/ijcsis/
ISSN 1947-5500
(IJCSIS) International Journal of Computer Science and Information Security,
Vol. 9, No. 6, June 2011
[38] M. Laiho, A. Paasio, J. Flak and K. A. I. Halonen, “Template Senthilkumar was born in Neyveli Township,
design for cellular nonlinear networks with 1-bit weights”, IEEE Cuddalore District, Tamilnadu, India on 18th July
Transactions on Circuits and Systems-I: Regular Papers, Vol. 55, 1974. He received his B.Sc in Mathematics from
No. 3, pp. 904-913, 2008. Madras University in 1994, M.Sc in Mathematics
[39] T. Yang, L. B. Yang, C. W. Wu, L. O. Chua, “Fuzzy cellular from Bharathidasan University in 1996, M.Phil
neural networks: Theory”, In Proceedings of IEEE International in Mathematics from Bharathidasan University in
Workshop on Cellular Neural Networks and Applications, pp.181- 1999 and M.Phil in Computer Science &
186, 1996. Engineering from Bharathiar University in 2000.
[40] T. Yang, L. B. Yang, “The global stability of fuzzy cellular neural He also has a PGDCA and PGDCH in Computer
networks”, IEEE Transactions on Circuit and Systems-I, Vol. 43, Science and Applications and Computer Hardware from Bharathidasan
pp. 880-883, 1996. University which he obtained in 1996 and 1997, respectively. He has a
[41] T. Yang and L.B. Yang, “Fuzzy cellular neural network: A new doctoral degree in Mathematics and Computer Applications from National
paradigm for image processing”, International Journal of Circuit Institute of Technology [REC], Tiruchirappalli, Tamilnadu, India. Currently,
Theory and Applications, Vol. 25, pp. 469-481, 1997. he is a post doctoral fellow at the School of Mathematical Sciences, Universiti
[42] T. Yang and L.B. Yang, “Application of fuzzy cellular neural Sains Malaysia, 11800 USM Pulau Pinang, Malaysia. Prior to this
networks to Euclidean distance transformation”, IEEE appointment, he was a lecturer/assistant professor in the Department of
Transactions on Circuits and Systems-I, CAS-44, pp. 242-246, Computer Science at Asan Memorial College of Arts and Science, Chennai,
1997. Tamilnadu, India. He has published many good research papers in
[43] A. Kandel, “Fuzzy Techniques in Pattern Recognition”, John international conference proceedings and peer-reviewed/refereed international
Wiley, New York, 1982. journals with high impact factor. He has made significant and outstanding
[44] R.R. Yager and L.A. Zadeh (eds), “An Introduction to Fuzzy
contributions to various activities related to research work. He is also an
Logic in Intelligent Systems”, Kluwer, Boston, 1992.
associate editor, editorial board member, reviewer and referee for many
[45] J.A. Nossek, G. Seiler, T. Roska and L.O. Chua, “Cellular neural
scientific international journals. His current research interests include
networks: Theory and circuit design”, International Journal of
advanced cellular neural networks, advanced digital image processing,
Circuit Theory and Applications, Vol. 20, pp. 533-553, 1992.
advanced numerical analysis and methods, advanced simulation and
[46] G. F. Dalla Betta, S. Graffi, M. Kovacs and G. Masetti, “CMOS
computing and other related areas.
implementation of an analogy programmed cellular neural
network”, IEEE Transactions on Circuits and Systems-Part–II,
Vol. 40, pp. 206–214, 1993. Abd Rahni Mt Piah was born in Baling, Kedah Malaysia on 8th May 1956. He
[47] C.L. Yin, J.L. Wan, H. Lin and W.K. Chen, “Brief received his B.A. (Cum Laude) in Mathematics
Communication: The cloning template design of a cellular neural from Knox College, Illinois, USA in 1979. He
network”, Journal of the Franklin Institute, Vol. 336, pp. 903-909, received his M.Sc in Mathematics from
1999. Universiti Sains Malaysia in 1986. He obtained
[48] L. O. Chua and P. Thiran, “An analytic method for designing his Ph.D in Approximation Theory from the
simple cellular neural networks”, IEEE Transactions on University of Dundee, Scotland UK in 1993. He
Circuitsand Systems-I, Vol. 38, pp. 1332-1341, 1991. has been an academic staff member of the School
[49] T. Matsumoto, L.O. Chua and R. Furukawa, “CNN cloning of Mathematical Sciences; Universiti Sains
template: hole filler”, IEEE Transactions on Circuits and Systems, Malaysia since 1981 and at present is an
Vol. 37, pp. 635-638, 1990. Associate Professor. He was a program chairman and deputy dean in the
[50] K. Murugesan and P. Elango, “CNN based hole filler template School of Mathematical Sciences, Universiti Sains Malaysia for many years.
design using numerical integration technique”, LNCS 4668, pp. He has published various research papers in refereed national and international
490-500, 2007. conference proceedings and journals. His current research areas include
Computer Aided Geometric Design (CAGD), Medical Imaging, Numerical
Analysis and Techniques and other related areas.
64 http://sites.google.com/site/ijcsis/
ISSN 1947-5500
(IJCSIS) International Journal of Computer Science and Information Security,
Vol. 9, No. 6, June 2011
Location Estimation and Mobility Prediction Using
Neuro-fuzzy Networks
In Cellular Networks
Maryam Borna Mohammad Soleimani
Department of Electrical Engineering Department of Electrical Engineering
Iran University of Science and Technology Iran University of Science and Technology
Tehran, Iran Tehran, Iran
maryam.borna@gmail.com soleimani@iust.ac.ir
Abstract- In this paper an approach is proposed for location However for managing networks resources consumption
estimation, tracking and mobility prediction in cellular and reducing the costs of location update and call delivery
networks in dense urban areas using neural and neuro-fuzzy procedures, prediction of user's next probable location can be
networks. In urban areas with high buildings, due to the effects helpful. This is done by analyzing some patterns of his
of multipath fading and Non-Line-of-Sight conditions, the mobility behavior. Therefore searching for users will be done
accuracy of positioning methods based on direction finding and in smaller groups of cells avoiding expensive queries to
ranging degrades significantly. Also in these areas, due to high Home Location Register (HLR). This is also useful in other
user traffic there's a need for network resources management. wireless networks such as Ad-Hoc networks for efficient
Knowing the next possible position of user would be helpful in
bandwidth allocation and uninterrupted hand over between
this case. Here using fingerprint positioning concept, after
choosing appropriate parameters for fingerprinting in GSM
access points.
cellular networks, MLP and RBF neural networks were used Next sections are as follows: section II describes the
for position estimation. Then by the use of neuro-fuzzy problem of positioning in dense urban areas and related
networks a tracking and post-processing method is applied to studies in the literature. In section III proposed approach of
estimated locations. For mobility prediction purpose the use of this paper for positioning and mobility prediction is
ANFIS neuro-fuzzy is implemented. explained and contains 3 subsections: fingerprint based
positioning in subsection A, post processing of estimated
Keywords-position estimation; neuro-fuzzy; prediction;
cellular networks.
path in subsection B and path prediction in subsection C are
discussed. Section IV includes the results of evaluating the
I. INTRODUCTION proposed approach on database collected from GSM mobile
phone network in city of Tehran. Results were discussed in
Positioning in wireless networks is estimating a node's section V and last section concludes the paper.
distance with reference to a fixed node or locating it by its
geographical coordinates. Positioning is based on parameters II. PROBLEM DEFINITION
used by mobile or fixed nodes for communication such as
Received Signal Strength (RSS), Time of Arrival (TOA) and With developments in cellular phone networks different
Angle of Arrival (AOA). According to the type of wireless methods were considered for facilitating user positioning,
network and transmission protocols, different parameters are such as Cell-ID, Cell-ID+TA, A-GPS, AOA, … the more
used for communication. accuracy increases the more expensive the deployment
would be and the need for hardware and software changes in
Among several types of wireless networks, cellular phone both cell phone device and network infrastructure rises.
networks due to increasing usage of cell phones for Moreover most of these methods are sensitive to Non Lin of
communications are more distributed with more subscribers Sight (NLOS) communication between transmitter and
so it can be said that one of the most probable items found in receiver and multipath fading, conditions that dense urban
everyone's pocket is his cell phone. Having location areas are involved with. Although everyday there are more
information in cellular phone networks, various services can and more mobile phone devices equipped with GPS receivers
be provided based on user's location ranging from with positioning accuracy up to few meters, but in urban
commercial and advertising services to routing, navigation areas with high buildings where it's less likely to have line of
and emergency calls. In cellular phone networks these sight communication with at least 3 GPS satellites, or inside
services are referred to as Location Based Services (LBS). buildings where the signals attenuate significantly passing
Also by locating a user's exact position network's resources through the walls, positioning accuracy degrades
can be efficiently managed and allocated leading to proper considerably. In such cases there's a need for an auxiliary
handover between cells and reduced co-channel interference. method to overcome these problems.
65 http://sites.google.com/site/ijcsis/
ISSN 1947-5500
(IJCSIS) International Journal of Computer Science and Information Security,
Vol. 9, No. 6, June 2011
The difficulty in these areas is the complexity of standards and data collection tool. Data collection and
propagation model of electromagnetic waves caused by pattern learning are done offline before online real-time
multipath fading, diffraction and scattering that makes it hard positioning.
for geometrical and statistical positioning methods relying on
relations between signal parameters and Tx-Rx separation. Input parameters of neural networks, fingerprint of a
point, must be measurable, collectable and different from
Fingerprint based positioning methods are better for place to place. Here we aim to attain intended fingerprints
mentioned cases [1] [2]. In these methods first a database of from information provided by mobile phone device without
signal parameters in certain places is collected with no software or hardware changes in the device and additional
knowledge of propagation model of the environment and signaling between MS and BTS. The data can be obtained
position estimation will be done upon these information and from mobile phone routing table. This table is used for
possible mapping among them. selecting the best cell to reside and is resorted every few
seconds. In GSM900 standard for mobile phone networks,
One way to find the mapping relations in the fingerprint this table contains a list of 30 radio channels (ARFCN)
database is to use Artificial Neural Networks. These sorted in descending order based on received power. In
networks are able to estimate the complex nonlinear addition to received power other parameters of currently
functions like mapping relations by parallel processing of selected and neighboring cells are available like cell name,
neurons. Position estimation can be considered as a function absolute radio frequency number for broadcasting cell's
approximation problem in neural networks that aims to find status, received power level, received signal quality and
the nonlinear mapping between inputs (fingerprints) and timing advance (TA). Also the attributes of BTS antennas of
outputs or targets (mobile phone's coordinates). In each cell like its height and installation coordinates are
comparison with other database lookup methods like K- accessible.
Nearest Neighbor (KNN) that uses fingerprint parameters to
find its nearest Euclidean neighbors, neural networks are From the mentioned parameters those were chosen for
better. On the other hand since neural networks approximate fingerprint that own following properties:
functions and fingerprint details are somewhat related to
delay and power loss of arrived signals and in turn these are • Being sensitive to spatial changes. Therefore
dependent on Tx-Rx separation, it seems that neural fixed parameters within a cell boundary like
networks combine both features of RSS and TOA-TDOA radio channel numbers, cell antenna height and
based systems [3] [4]. Two common models of neural similar parameters are not suitable for
networks in function approximation, multi layer perceptron fingerprinting.
(MLP) and Radial Basis Function (RBF) networks, are used • Parameters should be representative of
more [2] [3] [5] [6]. multipath fading effects in propagation
After user localization, history of his travelled places can environment. Received signal level, received
be considered as a time series that by recognizing his signal quality and Timing Advance (TA) are
mobility pattern, his next location can be predicted. In such parameters. However TA is a discrete
literature for user's path prediction in wireless networks with value of estimated BTS-MS separation with an
different standards, Recurrent Neural Networks (RNN), accuracy about 550 meters, say if TA=1, MS is
Bayesian Neural Networks (BNN) or neuro-fuzzy networks in a radius of 550 meters from BTS and TA=2
were employed that some used user's behavioral pattern in a means MS is in a radius of 550 to 1100 meters
long period of time and different situations to learn his from cell antenna. Hence in a cell with radius
mobility pattern and similar users then predict their next less than 550 meters-like most urban cells- TA's
location [8] [9] [10]. value isn't much helpful in positioning.
In this paper after gathering enough fingerprints with • For less signaling between mobile phone and
appropriate parameters in GSM cellular phone network, BTS it's better to acquire fingerprint in IDLE
positioning of mobile phone device is done by searching for mode rather that ACTIVE mode. TA and
the best architecture for MLP and RBF neural networks in a Received signal quality are determined in
dense urban area. Afterwards using tracking and prediction ACTIVE mode while RSS is monitored
feature of ANFIS neuro-fuzzy network, the estimated path is periodically even in IDLE mode.
post processed and user's upcoming path is predicted. We chose parameters that fulfill mentioned requirements
namely Received signal strength from cell antenna beside its
III. PROPOSED APPROACH
coordinates for serving cell and two of neighboring cells. So
A. Fingerprint based positioning there are 9 parameters to be recorded in a single fingerprint
In fingerprint based positioning methods first a database beside the coordinates of data gathering location. By
of fingerprints in a certain area is collected. A fingerprint of a collecting these fingerprints in sufficient data points of the
certain point includes particular information like designated area, we have a suitable database for further
geographical coordinates of the point which is a specification analysis by fingerprint based positioning methods. As
of that point. This information includes estimated signal discussed in previous section, for database processing, neural
parameters that are different depending on wireless network networks predominates other methods so is employed here.
Training set tuples are mentioned 9 parameters as input and
66 http://sites.google.com/site/ijcsis/
ISSN 1947-5500
(IJCSIS) International Journal of Computer Science and Information Security,
Vol. 9, No. 6, June 2011
latitude, longitude of respective data point as target or output
for neural networks.
One of the problems in neural networks design,
particularly MLP networks is the lack of certain equations
for determining the perfect architecture of the network and
number of neurons in hidden layers. In the training phase of
a NN while evaluating its ability to learn, its response to new
untrained data should also be considered for the network to
generalize well. In order that 84% of data set were used for Figure 2. Proposed architecture for RBF neural network
training and the remaining for testing the trained network.
For finding the best architecture for MLP NN first an
upper limit regarding members of training set is considered
for maximum number of network parameters i.e. total
number of weights and biases in neural network then by
modifying the number of hidden layers and neurons in each
layer in the defined range, the architecture yielding less
positioning error for training, testing and the whole data set
is chosen. For available database the best architecture for
MLP was one hidden layer with 23 neurons. Input layer
neurons were set to 9 and output layer neurons for estimated
latitude and longitude of mobile phone's location are 2. Fig. 1
In standard Radial Basis Function NN a ruling parameter Figure 3. Neuro-fuzzy network for Sugeno's fuzzy inference structure
in the design is the radial neuron's spread that determines its
sensitivity to the resemblance between network's inputs and Inputs of ANFIS network were estimated path with 5
weights. Searching for spread parameter resulted in value of delays and the same path with no delays as output. For initial
196 leading to less error for testing data set. The number of FIS generation fed to ANFIS, we used subtractive clustering
neurons has been set to its maximum i.e. the same as training that the influence radius of every cluster for all 11
set members. Fig.2 dimensions of data was set to 0.5.
Another type of RBF networks employed here is ANFIS is used here to estimate the user's movement
Generalized Regression Neural Network (GRNN) that has a function and smooth the NN estimated path so it can be
fixed structure with little difference to standard RBF specified which road the user is moving on, useful in map
networks. Spread parameter for radial neuron has been routing and navigation purpose.
obtained like former case and set to 2. C. Predicting user's next location
After position estimation with designed neural networks, Subsequent locations travelled by mobile phone user can
there were rather big errors in few points probably caused by be assumed as a time series. Here we used the prediction
inadequate members of training set and unavailability of RSS ability of ANFIS network. The structure is the same as
in some points. To lessen this error we applied a post before. For training the network, 20% of the beginning of the
procession on estimated path coming in next subsection. travelled path with 2 delays was selected as input and the
same path with one precession as output. The remaining 80%
B. Post-processing the estimated path of the path was used for testing. In this way trained network
In this section we use ANFIS (Adaptive Neuro-Fuzzy would be able to predict next location by knowing the
Inference Structure) for processing the previously NN present and one previous location of user. Here we've used
estimated path of user's travelled places. Employed neuro- the estimated path by neural network in section III and
fuzzy network is the neural network equivalent for Sugeno calculated the error with respect to real path.
FIS (Fuzzy Inference Structure). In comparison with MLP,
ANFIS has a fixed architecture and no searching for best IV. EVALUATION OF THE PROPOSED APPROACH
structure is needed. It responds faster with less computational For data collection we've used TEMS® drive tester tool
resource consumption. Fig. 3 that is used for optimization and troubleshooting of mobile
phone network by monitoring its status. It represents
network's data intercepted by mobile phone in a computer
interface for further processing and also able to record data
collection point coordinates via GPS receiver.
We've used this tool for fingerprint database collection in
GSM communication network in city of Tehran for about
250 data points. For network training and simulation we used
MATLAB® neural network and fuzzy logic toolboxes.
Figure 1. Proposed architecture for MLP neural network
67 http://sites.google.com/site/ijcsis/
ISSN 1947-5500
(IJCSIS) International Journal of Computer Science and Information Security,
Vol. 9, No. 6, June 2011
Simulation results of trained neural network showed that 51.48
Tajrish-Zarrabkhane
designed MLP network performs better than RBF networks, Real track
however RBF networks are faster and easier to design but are 51.475 Trained Track
Predicted Track
suitable when training set members are very high.
51.47
51.49 51.465
Real track
Longitude
Neural Network estimated 51.46
ANFIS post-processing
51.48
BTS position 51.455
51.45
51.47
longitude
51.445
51.46
51.44
35.74 35.75 35.76 35.77 35.78 35.79 35.8 35.81 35.82
Latitude
51.45
Figure 7. Proposed user's mobility prediction with ANFIS
51.44 Fig.4 displays estimated location after post processing
with ANFIS indicating mitigation of high errors. Fig.5 is
35.75 35.76 35.77 35.78 35.79 35.8 35.81 Cumulative Error Probability after and before post-
latitude
processing showing alleviation of high errors. In Fig.6 the
Figure 4. Position estimation and post-processing result in a road
same path is displayed on the map by Google Earth®. It can
be seen after ANFIS post-processing the road the user is
travelling can be defined more accurately.
1
0.9
Fig.7 displays real, trained and predicted path by ANFIS.
CEP-60% is less than 115 meters which makes this
Cumulative Distribution Function (CDF)
0.8
prediction useful in determination of user's next probable cell
0.7
to reside leading to better management of network resources
0.6
and successful cell reselection.
0.5
0.4 V. CONCLUSION
0.3 In proposed approach for location estimation in cellular
0.2 networks by neural networks in dense urban areas, mean
0.1
NN estimated positioning error less than 80 meters and CEP-60% of 65m
ANFIS post-processing
0
were obtained in a 3 by 4 km area that in comparison with
0 100 200 300 400 500
Positioning Error(m)
600 700 800 900 most commercial positioning methods implemented in
cellular networks like E-CGI, E-OTD and AOA with 200 m
Figure 5. Cumulative Error Probability before and after post-processing positioning error in such conditions, 50% improvement was
achieved. Meanwhile this method can be a complement to
GPS positioning in cases GPS signals are weak. In this
method there is no additional signaling or extra hardware-
software installation in both phone device and network.
We've used a fingerprint database of RSS parameters which
is available in most wireless networks. In comparison with
other positioning methods based on neural networks, we've
avoided a fixed structure for MLP NNs by searching for the
best one that suits certain database with a simple script.
Applying ANFIS post-processing by approximating user's
movement function, decreased high errors. The accuracy of
proposed mobility prediction by ANFIS with respect to
radius of cells in most cities that are about 100 to 150 m
makes it useful in anticipation of user's next cell to be
causing decrement in costs of location update and paging
procedure.
Figure 6. Part of the path of Fig. 4 on the map of Tehran
68 http://sites.google.com/site/ijcsis/
ISSN 1947-5500
(IJCSIS) International Journal of Computer Science and Information Security,
Vol. 9, No. 6, June 2011
REFERENCES AUTHORS PROFILE
[1] Ines Ahriz, Yacine Oussar, Bruce Denby, and Gerard Dreyfus, "Full- Maryam Borna has received her Bachelor of Science in Electrical
Band GSM Fingerprints for Indoor Localization Using a Machine Engineering with major of Telecommunications from Shahed University,
Learning Approach," International Journal of Navigation and Tehran, Iran and her Master of Science in IT Engineering with major of
Observation, 2010. Secure Communications from Dept. of Electrical Engineering of Iran
[2] Claude Takenga and Kyandoghere Kyamakya, "A Low-cost University of Science and Technology,Tehran,Iran. Her research interests
Fingerprint Positioning System in Cellular Networks," in Second include mobile phone networks, neural networks, microstrip antennas.
International Conference on Communications and Networking in Mohammad Soleimani received the B.S. degree in electrical
China,CHINACOM '07. , 2007. engineering from the University of Shiraz, Shiraz, Iran, in 1978 and the
[3] Anthony Taok, Nahi Kandil, and Sofiene Affes, "Neural Network for M.S. and Ph.D. degrees from Pierre and Marie Curio University, Paris,
Fingerprinting-Based Indoor Localization Using Ultra-Wideband," France, in 1981 and 1983, respectively. He is working as a Professor with
Journal of Communications, vol. 4, no. 4, 2009. the Iran University of Sciences and Technology, Tehran, Iran. His research
[4] M.H Hung, Shi-Shung Lin, Jui-Yu Cheng, and Wu-Lung Chien, "A interests are in antennas, small satellites, electromagnetic, and radars. He
ZigBee Indoor Positioning Scheme using Signal-Index-Pair Data has served in many executive and research positions including: Minister of
Preprocess Method to Enhance Precision," in IEEE International ICT, Student Deputy of Ministry of Science, Research and Technology,
Conference on Robotics and Automation (ICRA), 2010. Head of Iran Research Organization for Science and Technology, Head of
[5] Aylin Aksu, Joseph Kabara, and Michael B.Spring, "Reduction of Center for Advanced Electronics Research Center; and Technology
Location Estimation Error using Neural Networks," in Proceedings of Director for Space Systems in Iran Telecommunication Industries.
the first ACM international workshop on Mobile entity localization
and tracking in GPS-less environments,MELT'08 , 2008.
[6] C Laoudias et al., "Ubiquitous Terminal Assisted Positioning
Prototype," in IEEE Wireless Communications and Networking
Conference, WCNC , 2008.
[7] Hani Kaaniche and Farouk Kamoun, "Mobility Prediction in Wireless
Ad Hoc Networks Using Neural Networks," Journal of
Telecommunications, vol. 2, no. 1, 2010.
[8] Sherif Akoush and Ahmed Sameh, "Mobile User Movement
Prediction Using Bayesian Learning for Neural Networks," in
IWCMC '07 Proceedings of the 2007 international conference on
Wireless communications and mobile computing, 2007.
[9] J Amar Prathap Singh and M Karnan, "Intelligent location
management for UMTS networks using Fuzzy Neural Networks,"
Journal of Engineering and Technology Research, vol. 2, 2010.
69 http://sites.google.com/site/ijcsis/
ISSN 1947-5500
(IJCSIS) International Journal of Computer Science and Information Security,
Vol. 9, No. 6, 2011
A Fuzzy Clustering Based Approach for Mining
Usage Profiles from Web Log Data
Zahid Ansari1, Mohammad Fazle Azeem2, A. Vinaya Babu3 and Waseem Ahmed4
1,4
Dept. of Computer Science Engineering
2
Dept. of Electronics and Communication Engineering
P.A. College of Engineering
Mangalore, India
1
zahid.ansari@acm.org
2
mf.azeem@gmail.com
4
waseem@computer.org
3
Dept. of Computer Science Engineering
Jawaharlal Nehru Technological University
Hyderabad, India
dravinayababu@jntuh.ac.in
Abstract— The World Wide Web continues to grow at an assignment tasks. Finally we compare our soft computing based
amazing rate in both the size and complexity of Web sites and is approach of session weight assignment with the traditional hard
well on it’s way to being the main reservoir of information and computing based approach of small session elimination.
data. Due to this increase in growth and complexity of WWW,
web site publishers are facing increasing difficulty in attracting Keywords- web usage mining; data preprocessing, fuzzy
and retaining users. To design popular and attractive websites Clustering, knowledge discovery;
publishers must understand their users’ needs. Therefore
analyzing users’ behaviour is an important part of web page I. INTRODUCTION
design. Web Usage Mining (WUM) is the application of
datamining techniques to web usage log repositories in order to Due to the digital revolution and advancements in computer
discover the usage patterns that can be used to analyze the user’s hardware and software technologies, digitized information is
navigational behavior [1]. WUM contains three main steps: easy to capture and fairly inexpensive to store [6], [7]. As a
preprocessing, knowledge extraction and results analysis. The result huge amount of data have been collected and stored in
goal of the preprocessing stage in Web usage mining is to databases. The rate at which such data is stored is growing at a
transform the raw web log data into a set of user profiles. Each phenomenal rate. The fast growing tremendous amount of data
such profile captures a sequence or a set of URLs representing a collected and stored in large and numerous data repositories,
user session. has far exceeded our human ability for comprehension without
powerful tools. The abundance of data, coupled with the need
This sessionized data can be used as the input for a variety of for powerful data analysis tools has been described as a “data
data mining tasks such as clustering [2], association rule mining rich but information poor” situation. Hence, there is an urgent
[3], sequence mining [4] etc. If the data mining task at hand is
need for a new generation of computational techniques and
clustering, the session files are filtered to remove very small
sessions in order to eliminate the noise from the data [5]. But
tools to assist humans in extracting useful information
direct removal of these small sized sessions may result in loss of a (knowledge) from the rapidly growing volumes of data [8].
significant amount of information especially when the number of Data mining is the process of exploration and analysis, by
small sessions is large. We propose a “Fuzzy Set Theoretic” automatic or semi-automatic means, of large quantities of data
approach to deal with this problem. Instead of directly removing in order to discover meaningful patterns or rules. It deals with
all the small sessions below a specified threshold, we assign the “knowledge in the database” [8]. The term KDD refers to
weights to all the sessions using a “Fuzzy Membership Function” the overall process of knowledge discovery in databases. Data
based on the number of URLs accessed by the sessions. After mining is a particular step in this process, involving the
assigning the weights we apply a “Fuzzy c-Mean Clustering” application of specific algorithms for extracting patterns from
algorithm to discover the clusters of user profiles. In this paper, data. The additional steps in the KDD process, such as data
we discuss our methodology to preprocess the web log data preparation, data selection, data cleaning, incorporation of
including data cleaning, user identification and session appropriate prior knowledge, and proper interpretation of the
identification. We also describe our methodology to perform results of mining, ensures that useful knowledge is derived
feature selection (or dimensionality reduction) and session weight from the data [9].
70 http://sites.google.com/site/ijcsis/
ISSN 1947-5500
(IJCSIS) International Journal of Computer Science and Information Security,
Vol. 9, No. 6, 2011
Data mining often builds on an interdisciplinary bundle of is a complicated task. By filtering out useless data, we can
specialized techniques from fields such as statistics, artificial reduce log file size to enhance the upcoming mining tasks.
intelligence, machine learning, data bases, pattern recognition,
computer-based visualization etc. The more common model
functions in current data mining practice include classification,
regression clustering, rule generation, discovering association,
summarization and sequence analysis [10]. The World Wide
Web as a large and dynamic information source, that is
structurally complex and ever growing, is a fertile ground for
data mining principles or Web Mining. Web mining is
primarily aimed at deriving actionable knowledge from the
Web through the application of various data mining techniques
[11]. Web data is typically unlabelled, distributed,
heterogeneous, semi-structured, time varying, and high
dimensional. Web data can be grouped into the following
categories [12]: i) Contents of actual Web pages, ii) Intra-page
structures of the web pages, iii) Inter page structures specifying
linkage structures between Web pages, iv) Web usage data
describing how Web pages are accessed and v) User profiles
which include demographic and registration information about
users. Web Usage Mining is the discovery of user access Figure 2. Web Log Processing to Discover Weighted Sessions.
patterns from Web servers [1]. Web Usage Mining analyzes
results of user interactions with a Web server, including Web User identification refers to the process of identifying
logs, click streams, and database transactions at a Web site or a unique users from the user activity logs. Usually the log file in
group of related sites. Web usage mining includes clustering Extended Common Log format provides only the computer’s
(e.g. finding natural groupings of users, pages etc.), address and the user agent. For Web sites requiring user
associations (e.g. which URLs tend to be requested together), registration, the log file also contains the user login. In such
and sequential analysis (the order in which URLs tend to be cases this information can be used for user identification. For
accessed) [13]. As with any knowledge, discovery and data those cases where user login information is not available, we
mining (KDD) process, WUM performs three main steps: consider each IP as a user. User Session identification is the
preprocessing, pattern extraction and results analysis. Figure 1 process of segmenting the user activity log of each user into
describes the WUM process. sessions, each representing a single visit to the site.
Identification of user sessions from the web log file is a
complicated task, due to the existence of proxy servers,
dynamic addresses, and cases of multiple users access the same
computer [23][2][25][26]. It is also possible that one user might
be using multiple browsers or computers. This sessionized data
can be used as the input for a variety of data mining algorithms.
Once user sessions are discovered, this sessionized data can
be used as the input for a variety of data mining tasks such as
clustering, association rule mining, sequence mining etc. If the
Figure 1. Web Usage Mining Process. data mining task at hand is clustering, the session files are
filtered to remove very small sessions in order to eliminate the
The goal of the preprocessing stage in Web usage mining is noise from the data. But direct removal of these small sized
to transform the raw click stream data into a set of user sessions may result in loss of a significant amount of
profiles. Each such profile captures a sequence or a set of information especially when the number of small sessions is
URLs representing a user session. Web usage data large. We propose a ”Fuzzy Set Theoretic” approach to deal
preprocessing exploit a variety of algorithms and heuristic with this problem. Instead of directly removing all the small
techniques for various preprocessing tasks such as data fusion sessions below a specified threshold, we assign weights to all
and cleaning, user and session identification etc. Figure 2 the sessions using a ”Fuzzy Membership Function” based on
depicts the primary tasks involved in web log data the number of URLs accessed by the sessions. After assigning
preprocessing in order to discover the user sessions. the weights we apply a ”Fuzzy c-Mean Clustering” algorithm
to discover the clusters of user profiles. Fuzzy clustering
Data fusion refers to the merging of log files from several
techniques perform non-unique partitioning of the data items
Web servers. This requires global synchronization across these
where each data point is assigned a membership value for each
servers [14]. Data cleaning involves tasks such as, removing
of the clusters. This allows the clusters to grow into their
extraneous references to embedded objects, style files, graphics,
natural shapes [15]. A membership value of zero indicates that
or sound files, and removing references due to spider
the data point is not a member of that cluster. A non-zero
navigations. Popular Web sites generate the log file of the size
membership value shows the degree to which the data point
measured in gigabytes per hour. Manipulating such large files
represents a cluster. Fuzzy clustering algorithms can handle the
71 http://sites.google.com/site/ijcsis/
ISSN 1947-5500
(IJCSIS) International Journal of Computer Science and Information Security,
Vol. 9, No. 6, 2011
outliers by assigning them very small membership degree for explicitly request all of the graphics that are on a Web page,
the surrounding clusters. Thus fuzzy clustering is more robust they are automatically downloaded due to the HTML tags.
method for handling natural data with vagueness and Since the main purpose of Web Usage Mining is to get a
uncertainty. picture of the user’s behavior, it does not make sense to include
file requests that the user did not explicitly request. During the
Rest of the paper is organized as follows: in section-II, we Data cleaning process we removed the extraneous references to
describe the techniques to preprocess the web log data embedded objects, style files, graphics and sound files.
including data cleaning, user and session identification. In Elimination of the irrelevant items was accomplished by
Section III, we describe our methodology for feature selection checking the suffix of the URL name. All log entries with
(or dimensionality reduction) and session weight assignment. filename suffixes such as, gif, jpeg, GIF, JPEG, jpg, JPG, and
In this section we also discuss our work to apply Fuzzy c- map were removed. Default list of suffixes were used to
Mean Clustering algorithms to weighted user sessions. Section remove undesired files. Another main activity of the cleaning
IV provides the experimental results of our methodology process is removal of robots’ requests. Web Robots or spiders
applied to a real Web site access logs. Finally section V scan a Web site to extract its content. Web robots automatically
discusses the conclusion and future work. access all the hyperlinks from a Web page. The number of
requests from a web robot is at least the number of the site’s
II. PREPROCESSING OF WEB LOG DATA URLs. Removing WR-generated log entries removes
The primary data sources used in Web usage mining are the uninteresting sessions from the log file and simplifies
server log files, which include Web server access logs and subsequent the mining tasks. In order to identify WR hosts we
application server logs. used as list of all user agents known as robots as suggested by
[16]. We obtained this list from the site
“http://www.robotstxt.org”. Figure 4 describes the algorithm
1212265085.247 741 192.168.23.62 TCP MISS/200 10858 GET for data cleaning and transformation.
http://www.pace.edu.in/index.php - DEFAULT PARENT/192.168.20.1
Mozilla/5.0
Input: Access log file W
Figure 3. A Sample Web Log Entry. Output: Cleaned file C
For each line L ε W do
A sample web server log file entry in Extended Common
Log Format (ECLF) is given in Figure 3 and description of 1) Split L and extract various fields
various fields is given in Table I. 2) If the URL includes the query string then remove it
3) Remove all the irrelevant requests whose URL suffix specified
TABLE I. DESCRIPTION OF LOG FIELDS in the irrelevant suffix list
4) Remove all WR-generated requests
Field Value Description
5) Encrypt IP address to hide user’s identity
1212265085.247 The time of request, in 6) Store URL in a URL map along with corresponding URL
coordinated universal time number
741 The elapsed time for HTTP
request 7) Print required fields in to the output file
192.168.23.62 IP address of the client Figure 4. A Sample Web Log Entry.
TCP_MISS/200 HTTP reply status code
Table II describes the format of the output file C generated as
10858 Bytes sent by the server in a result of cleaning and transformations of the web logs. The
response to the request.
output file shows that client IP addresses are replaced with
GET The requested action
aliases in order to hide the identity of the user. The URL
http://www.pace.edu.in/index.php URI of the object being requested column of the table shows that URL strings are replaced by
- client user name, lf disabled, it is numbers in order to enhance further processing. We maintain a
logged as - map of URL strings and corresponding URL numbers.
DEFAULT_PARENT/192.168.20 Hostname of the machine where
we got the object. TABLE II. FILE FORMAT AFTER DATA CLEANING
- Content Type of the object
User Elapsed
Time IP Bytes URL
Agent Time
20080601014805 IP1 UA1 741 10858 1
A. Data Cleaning 20080601014806 IP1 UA1 1735 19247 2
20080601014808 IP2 UA2 239 209 1
A user’s request to view a particular page often results in 20080601014809 IP1 UA3 674 156 3
several log entries since graphics and scripts are down-loaded 20080601014813 IP2 UA2 680 179 4
in addition to the HTML file. In most cases, only the log entry
of the HTML file request is relevant and should be kept for the
user session file. This is because, in general, a user does not
72 http://sites.google.com/site/ijcsis/
ISSN 1947-5500
(IJCSIS) International Journal of Computer Science and Information Security,
Vol. 9, No. 6, 2011
B. User Identification Time-oriented heuristic TOH1 uses an upper bound on the
Once web log files have been cleaned, next step in the data time spent in the entire site during a visit. The timestamp of
preparation is the identification of the user. Since the log files every URL access request is compared with that of the first
of web server we are working on do not contain the user login access request of the current session. If the time difference is
information, we consider each unique IP and User-Agent larger than β, this request becomes the first request of the new
combination as a separate user. Next we separate out all the session; otherwise it belongs to the current session. On the
requests corresponding to each individual user. Figure 5 other hand Time-oriented heuristic TOH2 uses an upper bound
describes the algorithm to generate requests corresponding to on page-stay time. The timestamp of every URL access request
each individual user. is compared with that of the previous access request. If the time
difference is larger than β, this request becomes the first
request of the new session; otherwise it belongs to the current
Input: File C, the cleaned access log file session. We have selected 30 minutes as the value of threshold
time β for both of the above schemes.
Output: File U that contains user wise list of URLs accessed by them
1) For each line L ε C do Input: File U, containing access logs of various users.
b) Split L to get required fields
c) Store them in a map M1 with IP, UserAgent as the Output: File S, the file that contains different sessions based on TOH1
key and another map M2 as value. Key of the map For each line L ε U do
M2 is time and value is rest of the fields 1) if L represents a user then
2) Sort the inner map M2 based on the time key 2) UserId ← L
3) Print contents of the map M1 to the output file U 3) Output L to file S
4) else if L is the first accessed log of the user then
Figure 5. Algorithm to separate requests for each individual user 5) T1 ← L.time
6) else
7) T2 ← L.time
The format of the output file U generated after user // Compare the timestamps of current and the first request
identification is depicted in Table III below: 8) if T2 - T1 ≤ β then
9) Output L to file S
10) else
TABLE III. FILE FORMAT AFTER USER IDENTIFICATION 11) Output UserId to file S
12) Output L to file S
Elapsed 13) T1 ← L.time
User Time Bytes URL
Time
U1 20080601014805 741 10858 1
20080601014806 1735 19247 2 Figure 6. Algorithm to generate User Sessions based on TOH1
… … … …
U2 20080601014809 674 156 3 Algorithm to generate the users sessions based on the time
… … … … oriented heuristics TOH1 is specified in Figure 6.
U3 20080601014808 239 209 1
20080601014813 680 179 4
TABLE IV. FILE FORMAT AFTER USER SESSION IDENTIFICATION
C. User Session Identification
User Session identification is the process of segmenting the User Elapsed
Time Bytes URL
user activity log of each user into sessions, each representing a Session Time
single visit to the site. Web sites without user authentication U1-S1 20080601014805 741 10858 1
information mostly rely on heuristics methods for 20080601014806 1735 19247 2
sessionization. The sessionization heuristic helps in extracting … … … …
the actual sequence of actions performed by one user during U1-S2 … … … …
… … … …
one visit to the site. In order to identify user sessions we …
experimented with two different time oriented heuristics (TOH) …
as described below:
U2-S1 20080601014809 674 156 3
• TOH1 : The time duration of a session must not exceed … … … …
a threshold β. Let timestamp of the first URL request in …
a session is T1. A URL request with timestamp Ti is U3-S1 20080601014808 239 209 1
20080601014813 680 179 4
assigned to this session if and only if Ti – T1 ≤ β. The
first URL request with timestamp larger than T1 + β is
considered as the first request of the next session. Table IV shows the format the of the output file S containing
• TOH2: The time spent on a page visit must not exceed user sessions. Once user sessions are generated we scan each
a threshold β. Let Ti be the timestamp of the URL most session and remove the duplicate URLs from each session. For
recently assigned to a session. The next URL request each unique URL within a user session a single copy of the
with timestamp Ti+1 belongs to the same session if and URL is kept along with it’s frequency of occurrence. We also
only if Ti+1 – Ti ≤ β. Otherwise, this URL is considered maintain the count of the total number of unique URLs in each
to be the first of the next session. session.
73 http://sites.google.com/site/ijcsis/
ISSN 1947-5500
(IJCSIS) International Journal of Computer Science and Information Security,
Vol. 9, No. 6, 2011
III. DISCOVERY OF USER SESSION CLUSTERS
W ( si ) = 0, if si ≤ LB
A. Feature Subset Selection of User Sessions
Each user session can be thought of a single transaction of W ( si ) = 1, if si ≥ LB . (1)
many URL references. We map the user sessions as vectors of
s − LB
URL references in a n-dimensional space. Let U be a set of n W ( si ) = i , otherwise
UB − LB
unique URLs appearing in the preprocessed log then
U = { u1 , u 2 , … , un } and let S be a set of m user sessions
discovered by preprocessing the web log data. Then C. Clustering the User Sessions
S = { s1 , s 2 , … , s m } where each user session si ∈ S can be Once use sessions are represented in the form of a vector,
represented as a bit vector s = { wu1 , wu2 , … , wum } where wui =1; if clustering algorithm can be run against them. The goal of this
process is to discover session clusters that represent similar
w i ∈s; and wui = 0; otherwise.
u URL access patterns. For example, two session vectors are
similar if the Euclidean distance between them is short enough.
Instead of binary weights, feature weights can also be used Clustering aims to divide a data set into groups or clusters
to represent a user session. These feature weights may be based where inter-cluster similarities are minimized while the intra
on frequency of occurrence of a URL reference within the user cluster similarities are maximized. Details of various clustering
session, the time a user spends on a particular page or the techniques can be found in survey articles [18][19][20]. The
number of bytes downloaded by the uses from a page. However, ultimate goal of clustering is to assign data points to a finite
the URLs appearing in the access logs and could number in the system of k clusters. Union of these clusters is equal to a full
thousands. Distance-based clustering methods often perform dataset with the possible exception of outliers.
very poor when dealing with very high dimensional data.
Therefore filtering the logs by removing references to low The k-means clustering algorithm is one of the most
support URLs (i.e. that are not supported by a specified number commonly used methods for partitioning the data. This
of user sessions) can provide an effective dimensionality algorithm partitions a set of m objects into k clusters. The
reduction method while improving clustering. algorithm proceeds by computing the distances between a data
point and each cluster center in order to assign the data item to
B. Assiging Weights to User Sessions one of the clusters so that intra-cluster similarity is high but
inter-cluster similarity is low. Euclidian distance can be used as
If the data mining task at hand is clustering, the session files a measure to calculate the distance between various data points
can be filtered to remove very small sessions in order to and cluster centers.
eliminate the noise from the data [5]. But direct removal of
these small sized sessions may result in loss of a significant n
2
∑
amount of information especially when the number of small i
d ( xi , v j ) = x k − vkj (2)
sessions is large. We propose a “Fuzzy Set Theoretic”
k =1
approach to deal with this problem. Instead of directly
where ,
removing all the small sessions below a specified threshold, we
assign weights to all the sessions using a “Fuzzy Membership xi is the i th data point
Function” based on the number of URLs accessed by the v j is the j th cluster center
sessions.
d ( xi , v j ) is the distance between xi and v j
n is the number of dimensions of each data point
i
xk is the value of k th dimensions of xi
vkj is the value of k th dimensions of v j
The k-means clustering first initializes the cluster centers
randomly. Then each data point xi is assigned to some cluster vj
which has the minimum distance with this data point. Once all
the data points have been assigned to clusters, cluster centers
are updated by taking the weighted average of all data points in
Figure 7. Fuzzy membership function for session weight assignment that cluster. This recalculation of cluster centers results in
better cluster center set. The process is continued until there is
Figure 7 depicts a linear Fuzzy membership function for no change in cluster centers. Although k-means clustering
session weight assignment. Here LB represents a lower bound algorithm is efficient in handling the crisp data which have
on the number of URLs accessed in a session and UB clear cut boundaries, but in real world data clusters have ill
represents an upper bound on the number of URLs accessed in defined boundaries and often overlapping clusters. This
a session. Let si be the number of URLs accessed in session happens because many times the natural data suffer from
Ambiguity, Uncertainty and Vagueness [21].
si then the fuzzy membership function takes the following
Fuzzy c-means clustering incorporates fuzzy set theoretic
values:
concept of partial membership and may result in the formation
74 http://sites.google.com/site/ijcsis/
ISSN 1947-5500
(IJCSIS) International Journal of Computer Science and Information Security,
Vol. 9, No. 6, 2011
of overlapping clusters. The algorithm calculates the cluster 1 / (q −1)
1
centers and assigns a membership value to each data item
corresponding to every cluster within a range of 0 to 1. The
algorithm utilizes a fuzziness index parameter q where uij =
2
(
d ij x i , v j )
(6)
1 / (q −1)
q ∈ [1, ∞] [22] which determines the degree of fuzziness in the n
∑ 1
clusters. As the value of q reaches to 1, the algorithm works
like a crisp partitioning algorithm. Increase in the value of q k =1
2
(
d ij x i , v j )
results in more overlapping of the clusters. In order to decide the number of optimum clusters for the
Let X = {xi | i = 1L m} be a set of n-dimensional data point data set X we use a validity function S which is the ratio of
vectors where m is the number of data points and each compactness to separation [22] as given below:
xi = {x1i , x 2 ,L, xn }∀i = 1L m . Let V = {x j | j = 1L c} represent a
i i
c m 2
set of n-dimensional vectors corresponding to the cluster center
corresponding to each of the c clusters and each
∑∑j =1 i =1
2
uij xi − v j
S= (7)
v j = {v1j , v2j ,L , vnj }∀j = 1L c Let uij represent the grade of 2
membership of data point xi in cluster j. m. min v l − v k
l ≠k
u ij ∈ [1,0] ∀i = 1L m and ∀j = 1L c . The n × c matrix U = [u ij ] is a for each c = cmin ,L, cmax
fuzzy c-partition matrix, which describes the allocation of the data
points to various clusters and satisfies the following conditions: Let c denote the optimal candidate at each c then, the
solution to the following minimization problem yields the most
c valid fuzzy clustering of the data set.
∑u ij = 1, ∀i = 1L m
min min S
j =1
(3) (8)
c cmin ≤c≤cmax Ωc
0< ∑ uij < m, ∀j = 1Lc
Clusters formed by the applications clustering algorithms
j =1
represent a group of user sessions that are similar based on co-
The performance index J(U,V,X) of fuzzy c-mean clustering occurrence patterns of URL references. Clustering of user
can be specified as the weighted sum of distances between the sessions results in a set C = { c1 , c2 , … , ck } of clusters, where
data points and the corresponding centers of the clusters. In each ci is a subset of S, i.e., a set of user sessions. Each cluster
general it takes on the form: represents a group of users with similar navigational patterns.
∑∑ u d (x , v )
c m
q 2
J (U ,V , X ) = ij ij i j (4) IV. EXPERIMRNTAL RESULTS
j =1 i =1 In order to discover the clusters that exist in user accesses
where , sessions of a web site, we carried out a number of experiments.
q ∈ [1, ∞ ] is the fuzziness index of the clustering The Web access logs were taken from the P.A. College of
2
( )
d ij x i , v j is the disatnce between x i and v j Engineering, Mangalore web site, at URL
http://www.pace.edu.in. The site hosts a variety of information,
(x , v ) = ∑ w(x ) x
n
2 i j including departments, faculty members, research areas, and
d ij i j i k − vk
course information. The Web access logs covered a period of
k =1
one month, from February 1, 2011 to March 1, 2011. There
w( xi ) is the weight of the data point xi were 74,924 logged requests in total.
Minimization of the performance Index J(U,V,X) is usually After performing the cleaning step the output file contains
achieved by updating the grade of memberships of data points 30720 entries. Number of the site URLs with access count
and centers of the clusters in an alternating fashion until greater than or equal to 5 are 159. Total numbers of unique
convergence. This performance Index is based on the sum of users identified are 24. Table V depicts the results of cleaning
the squares criterion. During each of the iterations, the cluster and user identification steps.
centers are updated as follows:
m TABLE V. RESULTS OF CLEANING AND USER IDENTIFICATION
∑u
i =1
q
ij x i
Items Count
vj = m
(5)
∑u q Initial No of Log Entries 74924
ij
i =1 Log Entries after Cleaning 30720
Membership values are calculated by the following No. of site ULRs 159
formula: No of Users Identified 24
75 http://sites.google.com/site/ijcsis/
ISSN 1947-5500
(IJCSIS) International Journal of Computer Science and Information Security,
Vol. 9, No. 6, 2011
As far as clustering of the User Sessions is concerned those
URLs which are accessed only once do not play any significant
role in forming the clusters since they appear in only one of the
user sessions. Therefore we eliminate all such URL requests
from our further analysis. This type of URL filtering is
important in removing noise from the data. Since a user session
is represented by an n-dimensional vector, where n represents
the number of the site URLs accessed in the log files.
Reduction in the number of URLs also reduces the session
vector dimensions. The count of the URLs which are accessed
only once is 5372. After eliminating them the total number of
unique URLs for sub sequent analysis is 1478. In order to
identify the user sessions we applied two different kinds of
time oriented heuristics TOH1 and TOH2. Details of these
Figure 8. Percentage of URLs versus URL Access Frequency results and the comparisons of these approaches can be found
from our previous work [17]. The result of application of TOH1
is given in Table VI. Graph in Figure 9 depicts the results of
TABLE VI. RESULTS OF CLEANING AND USER IDENTIFICATION application of Time oriented heuristics TOH1 and TOH2.
Items Count Figure 10 shows the number of URLs and their
No. of User Sessions 968 968 corresponding session support count. Our result shows that 396
Minimum no. of URLs accessed in a session 1 URLs have a session support count of one. We eliminate these
Maximum no. of URLs accessed in a session 545
URLs since they can’t play any significant role clusters
formation. This type of session support filtering provides a
Average no. of URLs accessed in a session 26.12 form of dimensionality reduction in subsequent clustering tasks
Minimum no. of unique URLs accessed in a session 1 where URLs appearing in the session file are used as features.
Maximum unique URLs Accessed in a session 158 Table 4 shows the results of user session identification after the
Average unique URLs Accessed in a session 6.5 elimination of these low support URLs.
Total number of unique URLs of the Web Site present in
the log file entries is 6850. Figure 6 shows the percentage of
the URLs against how many times they are accessed in the log
file. It is clear from the graph that 78% of URLs were accessed
only once, 16% of them were accessed twice and only 6% of
them are accessed three or more times. Maximum access count
for a URL is 2234. On average each URL is accessed 4.47
times.
Figure 10. No. of URLs Versus No. of Sessions They are Associated with
Figure 9. Sessionization results for TOH1 and TOH2
Figure 11. No. of Sessions Versus No. of URLs
76 http://sites.google.com/site/ijcsis/
ISSN 1947-5500
(IJCSIS) International Journal of Computer Science and Information Security,
Vol. 9, No. 6, 2011
Figure 11 depicts the session counts against various URL
counts. Our results show that there are quite a large number of
user sessions containing only few URLs. For example there are
67 sessions containing one only URL, 134 containing two
URLs and 56 sessions containing three URLs. User sessions
with smaller number of URLs are less significant for the
purpose of clustering.
We are interested in only those sessions that access more
than a certain number of URLs, say MinURLs. For example, it
is not very useful to cluster user sessions which just access the
URL for home page and leave. Therefore we impose certain
constraints desirable for better clustering performance and
outcome by using a Fuzzy set theoretic approach to assign the
weights to various user sessions based on the number of URLs Figure 12. No. of Clusters Versus Performance Index
they contain. Instead of directly removing all the small sessions
below a specified threshold, we assign weights to all the In order to decide the number of optimum clusters we
sessions using a “Fuzzy Membership Function” based on the calculated the validity index (S), which is the ratio of
number of URLs accessed by the sessions. compactness to separation using the equation (7).
Based on the sessionization result as shown in graph of
figure 11, we choose the lower bound on the number of URLs
accessed in a session (LB) as 1 and an upper bound on the
number of URLs accessed in a session (UB) as 6. Using
equation (1) weights assigned to various sessions are specified
in Table VII.
TABLE VII. SESSION WEIGHTS BASED ON THE URL COUNT
Session URL Count Session Weight
1 0
2 0.2
3 0.4
4 0.6
5 0.8
Figure 13. Validity Index Versus No. of Clusters for Weighted Sessions
6 or more 1
Once use sessions are assigned the weights based on the
URL count, Fuzzy c-Mean clustering algorithm is applied to
discover session clusters that represent similar URL access
patterns. Application of the Fuzzy c-means clustering algorithm
resulted in the formation of overlapping clusters. The
performance Index J(U,V,X) of fuzzy c-mean clustering is
calculated using equation (4). It is the weighted sum of
distances between the data points and the corresponding centers
of the clusters. Minimization of the performance Index
J(U,V,X) is achieved by updating the grade of memberships of
data points and centers of the clusters in an alternating fashion
using the equations (6) and (5) respectively, until convergence.
Fuzzy c-Mean clustering is first applied by choosing the Figure 14. Validity Index Vs. No. of Clusters for Non-Weighted Sessions
number of clusters as 4. During each of the iterations we
increased the number of clusters by 1 till the number of clusters Figures 13 and 14 provide the graphs of validity index (S)
is reached to 60. We repeated the above process for weighted versus number of clusters for weighted and non-weighted
as well as non-weighted sessions. Graph is figure 12 shows the sessions respectively. Our results show that for the weighted
performance index (J) versus number of clusters for weighted sessions validity index is minimized when value chosen for the
as well as non-weighted sessions. From the graph it is clear that number of clusters is 8. On the other hand for the case of non-
“Fuzzy Set Theoretic” weighted session approach results in weighted sessions, validity index is minimized when the
better minimization of the performance index than non- number of clusters is 21. Thus the optimal number of clusters
weighted session approach. for weighted sessions is 8 and for non-weighted sessions it is
21.
77 http://sites.google.com/site/ijcsis/
ISSN 1947-5500
(IJCSIS) International Journal of Computer Science and Information Security,
Vol. 9, No. 6, 2011
V. CONCLUSION AND FUTURE WORK [11] P. Kolari and A. Joshi, “Web mining: research and practice,” Computing
in Science and Engineering, vol. 6, no. 4, pp. 49–53, 2004.
In this paper, we discussed our methodology to preprocess [12] W. Tong and H. Pi-lian, “Web log mining by an improved aprioriall
the web log data including data cleaning, user identification algorithm,” in In proceeding of world academy of science, engineering,
and session identification. We also discussed the details about and technology, 2005, pp. 97–100.
how to apply the Fuzzy c- Mean Clustering algorithm in order [13] A. Joshi and R. Krishnapuram, “Robust fuzzy clustering methods to
to cluster the user sessions. support web mining,” 1998.
[14] D. Tanasa and B. Trousse, “Advanced data preprocessing for intersites
In order improve the clustering results; we proposed a web usage mining,” IEEE Intelligent Systems, vol. 19, no. 2, pp. 59–65,
“Fuzzy Set Theoretic” approach for the removing the sessions 2004.
with very few URLs. Instead of directly removing all the small [15] F. Klawonn and A. Keller, “Fuzzy clustering based on modified distance
sessions below a specified threshold, we assign weights to all measures,” in Advances in Intelligent Data Analysis, ser. Lecture Notes
the sessions using a “Fuzzy Membership Function” based on in Computer Science, D. Hand, J. Kok, and M. Berthold, Eds. Springer
Berlin / Heidelberg, 1999, vol. 1642, pp. 291–301.
the number of URLs accessed by the sessions. We described
[16] D. Tanasa and B. Trousse, “Data preprocessing for wum,” Intelligent
our methodology to perform feature subset selection of session Systems, IEEE, vol. 23, no. 3, pp. 22–25, 2004.
vectors and session weight assignment. Finally we compared [17] Z. Ansari, M. F. Azeem, A. V. Babu, and W. Ahmed, “Preprocessing
our soft computing based approach of session weight users web page navigational data to discover usage patterns,” in The
assignment with the traditional hard computing based approach Seventh International Conference on Computing and Information
of small session elimination. Our results show that the “Fuzzy Technology, Bangkok, Thailand, May 2011, proceeding vol. 1 pp. 18-
Set Theoretic” approach of session weight assignment results in 189.
better minimization of clustering performance index than [18] P. Berkhin, “Survey of clustering data mining techniques,” Springer,
2002.
without session weight assignment.
[19] B. Pavel, “A survey of clustering data mining techniques,” in Grouping
We believe that the above results can be further improved if Multidimensional Data. Springer Berlin Heidelberg, 2006, pp. 25–71.
we use fuzzy set theoretic approach for the inclusion of a URL [20] R. Xu and I. Wunsch, D., “Survey of clustering algorithms,” Neural
in user session instead of using crisp time threshold β. In our Networks, IEEE Transactions on, vol. 16, no. 3, pp. 645–678, May 2005.
current strategy a URL is not included in the current sessions if [21] M. Chau, R. Cheng, B. Kao, and J. Ng, “Uncertain data mining: An
it comes even one second later then the specified time example in clustering location data,” in Advances in Knowledge
Discovery and Data Mining, ser. Lecture Notes in Computer Science, W.
threshold. We can apply a similar Fuzzy set theoretic approach Ng, M. Kitsuregawa, J. Li, and K. Chang, Eds. Springer Berlin /
to the assign the weights to the URLs based on how many Heidelberg, 2006, vol. 3918, pp. 199–204.
times they are accessed. [22] X. L. Xie and G. Beni, “A validity measure for fuzzy clustering,” IEEE
Trans. Pattern Analysis and Machine Intelligence, vol. PAMI-13, p.
841847, 1987.
REFERENCES
[23] R. Cooley, B. Mobasher, J. Srivastava et al., “Data preparation for
mining world wide web browsing patterns,” Knowledge and Information
[1] R. Cooley, B. Mobasher, and J. Srivastava, “Web mining: Information Systems, vol. 1, no. 1, pp. 5–32, 1999.
and pattern discovery on the world wide web,” in Ninth IEEE [24] B. Berendt, B. Mobasher, M. Nakagawa, and M. Spiliopoulou, “The
International Conference on Tools with Artificial Intelligence, impact of site structure and user environment on session reconstruction
Proceedings, 1997, pp. 558–567. in web usage analysis,” in WEBKDD 2002 - MiningWeb Data for
[2] Y. Fu, K. Sandhu, and M. Shih, “A generalization-based approach to Discovering Usage Patterns and Profiles, ser. Lecture Notes in Computer
clustering of web usage sessions,” Lecture Notes in Computer Science, Science. Springer Berlin / Heidelberg, 2003, vol. 2703, pp. 159–179.
pp. 21–38, 2000. [25] L. D. Catledge and J. E. Pitkow, “Characterizing browsing strategies in
[3] H. L. T. Mobasher, B.and Dai and M. Nakagawa, “Effective the world-wide web,” Computer Networks and ISDN Systems, vol. 27,
personalization based on association rule discovery from web usage no. 6, pp. 1065–1073, 1995, proceedings of the Third International
data.” in In: Proceedings of the 3rd ACM Workshop on Web World-Wide Web Conference.
Information and Data Management (WIDM01), Atlanta, Georgia [26] B. Berendt and M. Spiliopoulou, “Analysis of navigation behaviour in
November, 2001. web sites integrating multiple information systems,” The VLDB
[4] M. Spiliopoulou and L. C. Faulstich, “Wum: A web utilization miner,” Journal,vol.9, pp. 56-75, 2000.
in In Proceedings of EDBT Workshop WebDB98, Valencia, Spain,
LNCS 1590, Springer Verlag., 1999.
AUTHORS PROFILE
[5] B. Mobasher, R. Cooley, and J. Srivastava, “Automatic personalization
based on web usage mining,” Commun. ACM, vol. 43, pp. 142–151, Zahid Ansari is a Ph.D. candidate in the Department of CSE,
August 2000. Jawaharlal Nehru Technical University, India. He received his ME
[6] U. Fayyad and R. Uthurusamy, “Data mining and knowledge discovery from Birla Institute of Technology, Pilani, India. He has worked at
in databases,” Communications of ACM, vol. 39, pp. 24–27, 1996. Tata Consultancy Services (TCS) where he was involved in the
[7] W. H. Inmon, “The data warehouse and data mining,” Communications development of cutting edge tools in the field of model driven
of ACM, vol. 39, pp. 49–50, 1996. software development. His areas of research include data mining, soft
[8] M. K. Jiawei Han, Data Mining: Concepts and Techniques. Academic computing and model driven software development. He is currently
Press, Morgan Kaufmarm Publishers, 2001. with the P.A. College of Engineering, Mangalore as a Faculty. He is
[9] P. S. U. M. Fayyad, G. Piatetsky-Shapiro and E. R. Uthurusamy, also a member of ACM.
“Advances in knowledge discovery and data mining,” in CA:
AAAI/MIT Press, 1996. Mohammad Fazle Azeem is working as Professor and Director of
[10] J. H. Ming-Syan Chen, “Data mining an overview from database department of Electronics and Communication Engineering, P.A.
perspective,” Knowledge and data Engineering, IEEE Transactions on, College of Engineering, Mangalore. He received his B.E. in electrical
vol. 8, 1996. engineering from M.M.M. Engineering College, Gorakhpur, India,
78 http://sites.google.com/site/ijcsis/
ISSN 1947-5500
(IJCSIS) International Journal of Computer Science and Information Security,
Vol. 9, No. 6, 2011
M.S. from Aligarh Muslim University, Aligarh, India and Ph.D. from His current research interests are algorithms, information retrieval
Indian Institute of Technology (IIT) Delhi, India. His interests and data mining, distributed and parallel computing, Network
include robotics, soft computing, evolutive computation, clustering security, image processing etc.
techniques, application of neuro-fuzzy approaches for the modeling,
and control of dynamic system such as biological and chemical Waseem Ahmed is a Professor in CSE at P.A. College of
processes. Engineering, Mangalore. He obtained his BE from RVCE, Bangalore,
MS from the University of Houston, USA and PhD from the Curtin
A.Vinaya Babu is working as Director of Admissions and Professor University of Technology, Western Australia. His current research
of CSE at J.N.T. University Hyderabad, India. He received his interests include multicore/multiprocessor development for HPC and
M.Tech. and PhD in Computer Science Engineering from JNT embedded systems, and data mining. He has been exposed to
University, Hyderabad. He is a life member of CSI, ISTE and academic/work environments in the USA, UAE, Malaysia, Australia
member of FIE, IEEE, and IETE. He has published more than 35 and India where he has worked for more than a decade. He is a
research papers in International/National journals and Conferences. member of the IEEE.
79 http://sites.google.com/site/ijcsis/
ISSN 1947-5500
(IJCSIS) International Journal of Computer Science and Information Security,
Vol. 9, No. 6, 2011
Inception of Hybrid Wavelet Transform using Two
Orthogonal Transforms and It’s use for Image
Compression
Dr. H.B.Kekre, Dr.Tanuja K. Sarode Sudeep D. Thepade
Senior Professor, Assistant Professor Associate Professor
Computer Engineering Department, Computer Engineering Computer Engineering Department,
SVKM’s NMIMS (Deemed-to-be Department, SVKM’s NMIMS (Deemed-to-be
University) Thadomal Shahani Engineering University)
Vile Parle(W), Mumbai, India. College, Bandra(W), Mumbai, India. Vile Parle(W), Mumbai, India.
hbkekre@yahoo.com, tanuja_0123@yahoo.com sudeepthepade@gmail.com
Abstract—The paper presents the novel hybrid wavelet transform century [19,20]. Generally, wavelets are purposefully crafted to
generation technique using two orthogonal transforms. The have specific properties that make them useful for image
orthogonal transforms are used for analysis of global properties processing. Wavelets can be combined, using a "shift, multiply
of the data into frequency domain. For studying the local and sum" technique called convolution, with portions of an
properties of the signal, the concept of wavelet transform is unknown signal(data) to extract information from the unknown
introduced, where the mother wavelet function gives the global signal. Wavelet transforms are now being adopted for a vast
properties of the signal and wavelet basis functions which are number of applications, often replacing the conventional
compressed versions of mother wavelet are used to study the local Fourier transform [23,24,25,26]. They have advantages over
properties of the signal. In wavelets of some orthogonal
traditional fourier methods in analyzing physical situations
transforms the global characteristics of the data are hauled out
better and some orthogonal transforms might give the local
where the signal contains discontinuities and sharp
characteristics in better way. The idea of hybrid wavelet spikes[27,28,29]. In fourier analysis the local properties of the
transform comes in to picture in view of combining the traits of signal are not detected easily. STFT(Short Time Fourier
two different orthogonal transform wavelets to exploit the Transform)[29] was introduced to overcome this difficulty.
strengths of both the transform wavelets. However it gives local properties at the cost of global
The paper proves the worth of hybrid wavelet properties. Wavelets overcome this shortcoming of Fourier
transforms for the image compression which can further be analysis [28,29] as well as STFT. Many areas of physics have
extended to other image processing applications like seen this paradigm shift, including molecular dynamics,
steganography, biometric identification, content based image astrophysics, optics, quantum mechanics etc. This change has
retrieval etc. Here the hybrid wavelet transforms are generated also occurred in image processing, blood-pressure, heart-rate
using four orthogonal transforms alias Discrete Cosine transform and ECG analyses, DNA analysis, protein analysis,
(DCT), Discrete Hartley transform (DHT), Discrete Walsh climatology, general signal processing, speech, face
transform (DWT) and Discrete Kekre transform (DKT). Te recognition, computer graphics and multifractal analysis.
comparison of the hybrid wavelet transforms is also done with Wavelet transforms are also starting to be used for
the original orthogonal transforms and their wavelet transforms. communication applications. One use of wavelet
The experimentation results have shown that the transform approximation is in data compression. Like other transforms,
wavelets have given better quality of image compression than the wavelet transforms can be used to transform data then, encode
respective original orthogonal transforms but for hybrid
the transformed data, resulting in effective compression [24].
transform wavelets the performance is best. Here the hybrid of
DCT and DKT gives the best results among the combinations of
Wavelet compression can be either lossless or lossy. The
the four mentioned image transforms used for generating hybrid wavelet compression methods are adequate for representing
wavelet transforms. high-frequency components in two-dimensional images.
Earlier wavelets of only Haar transform have been studied.
Keywords-Orthogonal transform; Wavelet transform; Hybrid In recent work [4,7,11,13] the wavelets of few orthogonal
Wavelet transform; Compression. transforms alias Walsh [16,17,18], DCT [14,15], Kekre [21,22]
and Hartley[1,2,3] are proposed. The wavelet transforms in
I. INTRODUCTION many applications are proven to be better than respective
Wavelets are mathematical tools that can be used to extract orthogonal transforms [8,9,10,12]. The paper presents the
information from many different kinds of data, including innovative hybrid wavelet transform generation method, which
images [4,5,6,7]. Sets of wavelets are generally needed to generates hybrid wavelet transform of any two orthogonal
analyze data fully. A set of "complementary" wavelets will transforms. This concept of hybrid wavelet transform can
reconstruct data without gaps or overlap so that the acquire the positive traits from both the orthogonal transforms
deconstruction process is mathematically reversible and is with used to generate it. The hybrid wavelet generation concept
minimal loss. The wavelets are results of the thought process of opens up new avenues of selection of orthogonal transforms for
many people starting with with Haar's work in the early 20th hybrid and their use in particular image processing application
80 http://sites.google.com/site/ijcsis/
ISSN 1947-5500
(IJCSIS) International Journal of Computer Science and Information Security,
Vol. 9, No. 6, 2011
to gain some upper edge over individual orthogonal transforms
or respective wavelet transforms. The paper presents the use of [K ] ∗ [K ]t = [D ] (3)
hybrid wavelet transforms generated using Discrete Walsh where, D is the diagonal matrix. The hybrid wavelet
Transform (DWT), Discrete Kekre Transform (DKT), Discrete transform of size NxN generated from any two orthogonal
Hartley Transform (DHT) and Discrete Cosine Transform transforms satisfies this property and hence it is orthogonal.
(DCT) for image compression. The experimental results prove
that the hybrid wavelet transforms are better than the respective
B. Non Involutional
orthogonal transforms as well as their wavelet transforms.
An involutionary function is a function that is it’s own
inverse. So involutional transform is a transform which is
II. GENERATION OF HYBRID WAVELET TRANSFORM inverse transform of itself. The Hybrid wavelet transform is
non involutional transform
⎡ a11 a12 L a1 p ⎤ ⎡b11 b12 L b1q ⎤
⎢a L a2 p ⎥ ⎢b
a 22 b22 L b2 q ⎥ C. Transform on Vector
A=⎢ ⎥ B=⎢ ⎥
21
(1)
21
⎢ M M M M ⎥ ⎢ M M M M ⎥ The hybrid wavelet transform (say ‘K’) of one-dimensional
⎢ ⎥ ⎢ ⎥
⎢a p1
⎣ a p2 L a pp ⎥
⎦ ⎢bq1 bq 2 L bqq ⎥
⎣ ⎦ vector q is given by.
Q = K ∗q [] (4)
And inverse is given by
Q
q = K [ ]t ∗ ij
(5)
μ ∗μ
Ti Tj
Where Qij is the value at ith row, jth column of matrix Q and
(2) the term μ in normalization factor can be computed as given
below through equations 6, 7 and 8.
t
μ =T T (6)
T AB AB
Such that
μ =μ μ
T1 A1 B1 ,
μ =μ μ
The hybrid wavelet transform matrix of size NxN (say T2 A1 B2 ,
‘TAB’) can be generated from two orthogonal transform μ =μ μ
matrices ( say A and B respectively with sizes pxp and qxq, T3 A1 B3
where N=p*q=pq) as given by equations 1 and 2.Here first ‘q’ M
number of rows of the hybrid wavelet transform matrix are
calculated as the product of each element of first row of the μ =μ μ (7)
Tq A1 Bq
orthogonal transform A with each of the columns of the
orthogonal transform B. For next ‘q’ number of rows of hybrid μ =μ =L=μ =μ
Tq +1 Tq + 2 T2q A2
wavelet transform matrix the second row of the orthogonal
transform matrix A is shift rotated after being appended with μ =μ =L=μ =μ
zeros as shown in equation 2. Similarly the other rows of T2q +1 T2q + 2 T3q A3
hybrid wavelet transform matrix are generated (as set of q rows M
each time for each of the ‘p-1’ rows of orthogonal transform
matrix A starting from second row upto last row). μ =μ =L=μ =μ
T(p −1)q +1 T2q + 2 T3q A3
III. PROPERTIES OF HYBRID WAVELET TRANSFORM
Where with reference to equation 1, μ A and μ B can be
The crossbreed of two orthogonal transforms results into
given as equation 8.
hybrid wavelet transform, which itself satisfies the following
properties. ⎡μ A1 0 L 0 ⎤
t ⎢ 0 μ
A2
L 0 ⎥
A. Orthogonal AA = μ
A
= ⎢ M M M M
⎥ (8)
The transform matrix K is said to be orthogonal if the ⎢ 0 ⎥
following condition is satisfied. ⎣ 0 L μ
Ap ⎦
81 http://sites.google.com/site/ijcsis/
ISSN 1947-5500
(IJCSIS) International Journal of Computer Science and Information Security,
Vol. 9, No. 6, 2011
⎡μ B1 0 L 0 ⎤ Where Iij is the pixel intensity value at ith row, jth column of
image I and the calculation of the term μ in normalization
t ⎢ 0 μ
B2
L 0 ⎥
BB = μ
B
= ⎢ ⎥ factor is as given above through equations 6, 7 and 8.
M M M M
⎢ 0 ⎥
⎣ 0 L μ
Bq ⎦ IV. RESULTS AND DISCUSSION
D. Transform on Two-Dimensional Image The test bed used in experimentation for proving the worth
The hybrid wavelet transform of two-dimensional image I is of hybrid wavelet transform consists of 11 color images of size
given by. 256x256x3and is shown in figure 1. On each image all the
three alias orthogonal transform, wavelet transform and hybrid
[]
Q = K ∗I∗ K [ ]t (9) wavelet transform are applied. In transform domain the high
frequency data is removed and the images are transformed
inversely back to spatial domain. To judge the performance of
And inverse is given by the orthogonal transform, wavelet transform and hybrid
wavelet transform in compression; the original images are
I
compared with these modified images (having the data loss as
q = K [ ]t ∗ ij
[]
∗ K
(10) compression) using mean squared error (MSE). In all size data
⎛μ ∗ μ ⎞
⎜ ⎟
⎝ Ti Tj ⎠ compression percentages are considered as 95%, 90%, 85%,
80%, 75% and 70%. The average of such MSEs for all images
for respective transform and considered percentage of data
compression is taken for performance analysis.
Figure 1: The test bed of eleven original color images belonging to different categories and namely (from left to right and top to bottom) Aishwarya, Balls, Bird,
Boat, Flower, Dagdusheth-Ganesh, TajMahal, Strawberry, Scenery, Tiger and Viharlake-Powai.
82 http://sites.google.com/site/ijcsis/
ISSN 1947-5500
(IJCSIS) International Journal of Computer Science and Information Security,
Vol. 9, No. 6, 2011
Figure 2: Performance comparison of Image compression using Discrete Cosine transform (DCT), cosine wavelet transform (DCT wavelets) and the hybrid
wavelet transforms of DCT taken with Hartley (DCT_DHT) and Kekre transforms (DCT_DKT) with respect to 95% to 70% of data compression.
Figure 2 shows the average of mean squared error (MSE) differences of the original and respective compressed image pairs
plotted against percentages of data compression from 955 to 70% for image compression done using Discrete Cosine transform
(DCT), cosine wavelet transform (DCT wavelets) and the hybrid wavelet transforms of DCT taken with Hartley (DCT_DHT) and
Kekre transforms (DCT_DKT). Here the performance of hybrid wavelet transforms (DCT_DKT and DCT_DHT) is the best as
indicated by minimum MSE values over the respective DCT and DCT wavelet transform.
Figure 3: Performance comparison of Image compression using Discrete Walsh transform (DWT), Walsh wavelet transform (Walsh wavelets) and the hybrid
wavelet transforms of Walsh transform taken with Hartley (DWT_DHT) and Cosine transforms (DWT_DCT) with respect to 95% to 70% of data compression
The average of mean squared error (MSE) differences of the original and respective compressed image pairs for image compression
done using Discrete Walsh transform (DWT), Walsh wavelet transform (Walsh wavelets) and the hybrid wavelet transforms of
Walsh transform taken with Hartley (DWT_DHT) and Cosine transforms (DWT_DCT) with respect to 95% to 70% of data
compression are plotted in figure 3. Here the performance of hybrid wavelet transforms (DWT_DHT and DWT_DCT) are better
than the Walsh transform and are almost similar to the Walsh wavelet transform. The DWT_DCT hybrid wavelet transform
marginally performs better in case of 95% and 80% data compression.
Figure 4: Performance comparison of Image compression using Discrete Hartley transform (DHT), Hartley wavelet transform (Hartley wavelets) and the hybrid
wavelet transforms of Hartley transform taken with Walsh (DHT_DWT) and Cosine transforms (DHT_DCT) with respect to 95% to 70% of data compression.
83 http://sites.google.com/site/ijcsis/
ISSN 1947-5500
(IJCSIS) International Journal of Computer Science and Information Security,
Vol. 9, No. 6, 2011
Figure 5: Performance comparison of Image compression using Discrete Kekre transform (DKT), Kekre wavelet transform (Kekre wavelets) and the hybrid
wavelet transforms of Kekre transform taken with Cosine transforms (DKT_DCT) with respect to 95% to 70% of data compression
Figure 4 gives the average of mean squared error (MSE) differences of the original and respective compressed image pairs for
image compression done using Discrete Hartley transform (DHT), Hartley wavelet transform (Hartley wavelets) and the hybrid
wavelet transforms of Hartley transform taken with Walsh (DHT_DWT) and Cosine transforms (DHT_DCT) with respect to 95%
to 70% of data compression. Here except 95% data compression in al other percentages of data compression, the performance of
hybrid wavelet transforms (DHT_DWT and DHT_DCT) are better than the Hartley transform and are almost similar to the
Hartley wavelet transform with DWT_DCT proved to be marginally better.
In case of image compression using hybrid wavelet transform (DKT_DCT) generated using discrete Kekre transform (DKT) and
discrete Cosine transform (DCT), the performance is almost similar to the Kekre wavelet transform but better than the Kekre
transform as shown in figure 5.
Figure 6: Overall performance analysis of Image compression using the orthogonal transforms, their respective wavelet transforms and newly introduced hybrid
wavelet transforms for Cosine, Kekre, Walsh and Hartley transforms with respect to 95% to 70% of data compression
Figure 6 gives overall performance comparison of image compression using all the proposed hybrid wavelet transforms with
respective orthogonal transform and wavelet transform based compression methods for various percentages of data compression
from 70% to as high as 95%. Overall the best performance is given by DCT_DKT (hybrid wavelet transform of Cosine transform
with Kekre transform) followed by DCT_DWT and DCT_DHT (hybrid wavelet transform of Cosine transform taken respectively
with Walsh transform and Hartley transform). In all the respective orthogonal transforms the hybrid wavelet transforms have shown
better quality of image compression.
84 http://sites.google.com/site/ijcsis/
ISSN 1947-5500
(IJCSIS) International Journal of Computer Science and Information Security,
Vol. 9, No. 6, 2011
Figure 7: The compression of flower image using the hybrid wavelet transform (DCT_DHT Wavelet) generated using Discrete Cosine transform and Discrete
Hartley transform with respect to 95% to 70% of data compression
Figure 8: The compression of flower image using the hybrid wavelet transform (DCT_DKT Wavelet) generated using Discrete Cosine transform and Discrete
Kekre transform with respect to 95% to 70% of data compression
85 http://sites.google.com/site/ijcsis/
ISSN 1947-5500
(IJCSIS) International Journal of Computer Science and Information Security,
Vol. 9, No. 6, 2011
Figure 9: The compression of flower image using the hybrid wavelet transform (DCT_DWT Wavelet) generated using Discrete Cosine transform and Discrete
Walsh transform with respect to 95% to 70% of data compression.
Figures 7, 8 and 9 have shown the compression of flower in a Greyscale Image”, International Journal of Computer
image for various hybrid wavelet transforms with respect to Applications (IJCA), Volume 1, Number 11, December 2010, pp 32-
38.
the 955 to 70 % of data compression. The subjective quality
[5] Dr. H.B.kekre, Sudeep D. Thepade, Adib Parkar “Storage of Colour
of compression in all cases is quite acceptable as negligible Information in a Greyscale Image using Haar Wavelets and Various
distortion is observed in original and compressed images Colour Spaces”, International Journal of Computer Applications
even at the 95% data compression. Even the objective (IJCA), Volume 6, Number 7, pp.18-24, September 2010.
criteria (i.e. mean squared error) values of differences [6] Dr.H.B.Kekre, Sudeep D. Thepade, Juhi Jain, Naman Agrawal,
between the original and compressed images are minimal. “IRIS Recognition using Texture Features Extracted from Walshlet
Pyramid”, ACM-International Conference and Workshop on
Emerging Trends in Technology (ICWET 2011),Thakur College of
V. CONCLUSION Engg. And Tech., Mumbai, 26-27 Feb 2011. Also will be uploaded
on online ACM Portal.
The innovative concept of the hybrid wavelet transforms [7] Dr.H.B.Kekre, Sudeep D. Thepade, Akshay Maloo, “Face
generation using any two orthogonal transforms is proposed Recognition using Texture Features Extracted form Walshlet
in the paper. Here the hybrid wavelet transforms are Pyramid”, ACEEE International Journal on Recent Trends in
Engineering and Technology (IJRTET), Volume 5, Issue 1,
generated using Discrete Walsh Transform (DWT), Discrete www.searchdl.org/journal/IJRTET2010
Kekre Transform (DKT), Discrete Hartley Transform [8] Dr.H.B.Kekre, Sudeep D. Thepade, Juhi Jain, Naman Agrawal,
(DHT) and Discrete Cosine Transform (DCT) for image “Performance Comparison of IRIS Recognition Techniques using
compression. The experimental results prove that the hybrid Wavelet Pyramids of Walsh, Haar and Kekre Wavelet Transforms”,
International Journal of Computer Applications (IJCA), Number 2,
wavelet transforms are better than the respective orthogonal Article 4, March 2011,
transforms as well as their wavelet transforms. The various http://www.ijcaonline.org/proceedings/icwet/number2/2070-aca386
orthogonal transforms can be considered for crossbreeding [9] Dr.H.B.Kekre, Sudeep D. Thepade, Akshay Maloo, “Face
to generate the hybrid wavelet transform based on the Recognition using Texture Features Extracted from Haarlet
expected behavior of the hybrid wavelet transform for Pyramid”, International Journal of Computer Applications (IJCA),
Volume 12, Number 5, December 2010, pp 41-45. Available at
particular application. After proving the worth of hybrid www.ijcaonline.org/archives/volume12/number5/1672-2256
wavelet transforms for the image compression future work [10] Dr.H.B.Kekre, Sudeep D. Thepade, Juhi Jain, Naman Agrawal,
could include the extension of the concept to other image “IRIS Recognition using Texture Features Extracted from Haarlet
processing applications like steganography, biometric Pyramid”, International Journal of Computer Applications (IJCA),
Volume 11, Number 12, December 2010, pp 1-5, Available at
identification , content based image retrieval etc. www.ijcaonline.org/archives/volume11/number12/1638-2202.
[11] Dr.H.B.Kekre, Sudeep D. Thepade, Akshay Maloo, “Performance
VI. REFERENCES Comparison of Image Retrieval Techniques using Wavelet Pyramids
[1] R. V. L. Hartley, "A more symmetrical Fourier analysis applied to of Walsh, Haar and Kekre Transforms”, International Journal of
transmission problems," Proceedings of IRE 30, pp.144–150, 1942. Computer Applications (IJCA) Volume 4, Number 10, August 2010
Edition, pp 1-8,
[2] R. N. Bracewell, "Discrete Hartley transform," Journal of Opt. Soc.
http://www.ijcaonline.org/archives/volume4/number10/866-1216
America, Volume 73, Number 12, pp. 1832–183 , 1983.
[12] Dr.H.B.Kekre, Sudeep D. Thepade, Akshay Maloo, “Query by
[3] R. N. Bracewell, "The fast Hartley transform," Proc. of IEEE
image content using color texture features extracted from Haar
Volume 72, Number 8, pp.1010–1018 ,1984.
wavelet pyramid”, International Journal of Computer Applications
[4] Dr. H.B.kekre, Sudeep D. Thepade, Adib Parkar, “A Comparison of (IJCA) for the special edition on “Computer Aided Soft Computing
Haar Wavelets and Kekre’s Wavelets for Storing Colour Information Techniques for Imaging and Biomedical Applications”, Number 2,
86 http://sites.google.com/site/ijcsis/
ISSN 1947-5500
(IJCSIS) International Journal of Computer Science and Information Security,
Vol. 9, No. 6, 2011
Article 2, August 2010. head in the Department of Computer Engg. at Thadomal Shahani
http://www.ijcaonline.org/specialissues/casct/number2/1006-41 Engineering. College, Mumbai. Now he is Senior Professor at MPSTME,
[13] Dr.H.B.Kekre, Sudeep D. Thepade, “Image Retrieval using Color- SVKM’s NMIMS. He has guided 17 Ph.Ds, more than 100 M.E./M.Tech
Texture Features Extracted from Walshlet Pyramid”, ICGST and several B.E./ B.Tech projects. His areas of interest are Digital Signal
International Journal on Graphics, Vision and Image Processing processing, Image Processing and Computer Networking. He has more than
(GVIP), Volume 10, Issue I, Feb.2010, pp.9-18, Available online 270 papers in National / International Conferences and Journals to his
www.icgst.com/gvip/Volume10/Issue1/P1150938876.html credit. He was Senior Member of IEEE. Presently He is Fellow of IETE
[14] N. Ahmed, T. Natarajan and K. R. Rao, “Discrete Cosine and Life Member of ISTE Recently 11 students working under his guidance
Transform”, IEEE Transaction Computers, C-23, pp. 90-93, January have received best paper awards. Two of his students have been awarded
1974. Ph. D. from NMIMS University. Currently he is guiding ten Ph.D. students.
[15] W. Chen, C. H. Smith and S. C. Fralick, “A Fast Computational
Algorithm For The Discrete Cosine Transform”, IEEE Transaction Dr. Tanuja K. Sarode has Received Bsc.(Mathematics) from Mumbai
Communications, Com-25, pp.: 1004-1008, Sept. 1977. University in 1996, Bsc.Tech.(Computer
[16] George Lazaridis, Maria Petrou, “Image Compression By Means of Technology) from Mumbai University in 1999,
Walsh Transform”, IEEE Transaction on Image Processing, Volume M.E. (Computer Engineering) degree from
15, Number 8, pp.2343-2357, 2006. Mumbai University in 2004, Ph.D. from Mukesh
Patel School of Technology, Management and
[17] J. L. Walsh, “A Closed Set of Orthogonal Functions”, American
Engineering, SVKM’s NMIMS University,
Journal of Mathematics, Volume 45, pp. 5-24, 1923.
Vile-Parle (W), Mumbai, INDIA. She has more
[18] Zhibin Pan, Kotani K., Ohmi T., “Enhanced fast encoding method than 12 years of experience in teaching.
for vector quantization by finding an optimally-ordered Walsh Currently working as Assistant Professor in
transform kernel”, ICIP 2005, IEEE International Conference, Dept. of Computer Engineering at Thadomal
Volume 1, pp I - 573-6, Sept. 2005. Shahani Engineering College, Mumbai. She is life member of IETE,
[19] Charles K. Chui, “An Introduction to Wavelets”, Academic Press, member of International Association of Engineers (IAENG) and
1992, San Diego, ISBN 0585470901. International Association of Computer Science and Information
[20] Ingrid Daubechies, “Ten Lectures on Wavelets”, SIAM, 1992. Technology (IACSIT), Singapore. Her areas of interest are Image
[21] Dr.H.B.Kekre, Sudeep D. Thepade, “Image Retrieval using Non- Processing, Signal Processing and Computer Graphics. She has 90 papers
Involutional Orthogonal Kekre’s Transform”, International Journal in National /International Conferences/journal to her credit.
of Multidisciplinary Research and Advances in Engineering
(IJMRAE), Ascent Publication House, 2009, Volume 1, No.I, pp Sudeep D. Thepade has Received B.E.(Computer) degree from North
189-203, 2009. Abstract available online at www.ascent- Maharashtra University with Distinction in
journals.com 2003. M.E. in Computer Engineering from
[22] Dr.H.B.Kekre, Sudeep D. Thepade, Archana Athawale, Anant S., University of Mumbai in 2008 with Distinction,
Prathamesh V., Suraj S., “Kekre Transform over Row Mean, currently submitted thesis for Ph.D. at SVKM’s
Column Mean and Both using Image Tiling for Image Retrieval”, NMIMS, Mumbai. He has more than 08 years
International Journal of Computer and Electrical Engineering of experience in teaching and industry. He was
(IJCEE), Volume 2, Number 6, October 2010, pp 964-971, is Lecturer in Dept. of Information Technology at
available at www.ijcee.org/papers/260-E272.pdf Thadomal Shahani Engineering College,
[23] K. P. Soman and K.I. Ramachandran. ”Insight into WAVELETS Bandra(w), Mumbai for nearly 04 years.
From Theory to Practice”, Printice -Hall India, pp 3-7, 2005. Currently working as Associate Professor in
Computer Engineering at Mukesh Patel School
[24] Raghuveer M. Rao and Ajit S. Bopardika. “Wavelet Transforms –
of Technology Management and Engineering, SVKM’s NMIMS, Vile
Introduction to Theory and Applications”, Addison Wesley
Parle(w), Mumbai, INDIA. He is member of International Association of
Longman, pp 1-20, 1998.
Engineers (IAENG) and International Association of Computer Science
[25] C.S. Burrus, R.A. Gopinath, and H. Guo. “Introduction to Wavelets and Information Technology (IACSIT), Singapore. He is member of
and Wavelet Transform” Prentice-hall International, Inc., New International Advisory Committee for many International Conferences. He
Jersey, 1998. is reviewer for various International Journals. His areas of interest are
[26] Amara Graps, ”An Introduction to Wavelets”, IEEE Computational Image Processing Applications, Biometric Identification. He has about 110
Science and Engineering, vol. 2, num. 2, Summer 1995, USA. papers in National/International Conferences/Journals to his credit with a
[27] Julius O. Smith III and Xavier SerraP“, An Analysis/Synthesis Best Paper Award at International Conference SSPCCIN-2008, Second
Program for Non-Harmonic Sounds Based on a Sinusoidal Best Paper Award at ThinkQuest-2009 National Level paper presentation
Representation'', Proceedings of the International Computer Music competition for faculty, Best paper award at Springer international
Conference (ICMC-87, Tokyo), Computer Music Association, 1987. conference ICCCT-2010 and second best research project award at
‘Manshodhan-2010’.
[28] S. Mallat, "A Theory of Multiresolution Signal Decomposition: The
Wavelet Representation," IEEE Trans. Pattern Analysis and Machine
Intelligence, vol. 11, pp. 674-693, 1989.
[29] Strang G. "Wavelet Transforms Versus Fourier Transforms." Bull.
Amer. Math. Soc. 28, 288-305, 1993.
AUTHORS PROFILE
Dr. H. B. Kekre has received B.E. (Hons.) in Telecomm. Engineering.
from Jabalpur University in 1958, M.Tech
(Industrial Electronics) from IIT Bombay in
1960, M.S.Engg. (Electrical Engg.) from
University of Ottawa in 1965 and Ph.D.
(System Identification) from IIT Bombay
in 1970 He has worked as Faculty of
Electrical Engg. and then HOD Computer
Science and Engg. at IIT Bombay. For 13
years he was working as a professor and
87 http://sites.google.com/site/ijcsis/
ISSN 1947-5500
(IJCSIS) International Journal of Computer Science and Information Security,
Vol. 9, No. 6, 2011
A Model for the Controlled Development of
Software Complexity Impacts
Ghazal Keshavarz Nasser Modiri Mirmohsen Pedram
Computer department Computer department Computer Engineering department
Science and Research Branch, Islamic Azad University Tarbiat Mollem University
Islamic Azad University Zanjan, Iran Karaj/Tehran, Iran
Tehran, Iran nassermodiri@yahoo.com pedram@tmu.ac.ir
ghazalkeshavarz@gmail.com
Abstract— Several researches have shown software complexity Recent surveys suggest that 44% to 80% of all defects are
has affected different features of software. The most important inserted in the requirements phase [2]. Thus, if errors are not
ones are productivity, quality and maintenance of software. Thus, identified in the requirements phase, it is leading to make
measuring and controlling of complexity will have an important mistakes, wrong development product and loss valuable
influence to improve these features. So far, most of the proposed resource.
approaches to control and measure complexity are in code and
design phase and mainly have based on code and cognitive However, it will not be possible to develop better quality
methods; But measuring and control the complexity in these requirements without a well-defined Requirement Engineering
phases (design and code) is too late. In this paper, with emphasis (RE) process. Since RE is the starting point of software
on requirement engineering process, we analyze the factors engineering and later stages of software development rely
affecting complexity in the early stages of software life cycle and heavily on the quality of requirements, there is a good reason to
present a model. This model enables software engineering to pay close attention to it.
identify the complexity reasons that are the origin of many costs
in later phases (especially in maintenance phase) and prevent According to CHAOS report that published by the Standish
error publishing. We also specify the relationship between Group [3], good RE practices contribute more than 42%
software complexity and important features of software, namely towards the overall success of a project, much more than other
quality, productivity and maintainability and present a model factors (see Table 1).
too.
TABLE I. PROJECT SUCCESS FACTOR
Keywords- Requirement Engineering, Software Complexity,
Software Quality Factors
% of
Strongly
Project Success Factors Respon
Related to
I. INTRODUCTION ses
RE
In decades, software complexity has created a new era in User involvement 15.9%
computer science. Executive Management
13.9%
Support
Software complexity could be defined as the main driver of
Clear Statement of
cost, reliability and performance of software systems. 13.0%
Requirement
Nonetheless, there is no common agreement on software
Proper Planning 9.6%
complexity definition, but most of them is based on Zuse's
view of software complexity [1]," software complexity is the Realistic Expectation 8.2%
degree of difficulty in analyzing, maintaining, testing,
Smaller Project Milestones 7.7%
designing and modifying software". In other words, software
complexity is an issue that is in the entire software Competent Staff 7.2%
development process and every stage of product life cycle.
Ownership 5.3%
In the development phases of software, complexity strongly Clear Vision and
influences the required effort to analyze and describe 2.9%
Objectives
requirements, design, code, test and debugging the system. In Hard-working, focused
2.4%
maintenance phases, complexity specifies the difficulty in error Staff
correction and the required effort to change different software Other 13.9%
module.
Requirements form the foundation of the software In following chapters of this article, both complexity and
development process. Loose foundation brings down the whole Requirement Engineering field will introduce and provide our
structure and weak requirements documentation (the result of proposed model to control the complexity by identifying
Requirement Engineering process) result in project failure.
88 http://sites.google.com/site/ijcsis/
ISSN 1947-5500
(IJCSIS) International Journal of Computer Science and Information Security,
Vol. 9, No. 6, 2011
influential factors on complexity in early stages of the life Many researchers believe that software complexity is made
cycle. up of the following complexity [8]:
Problem complexity, which measures the complexity
II. REQUIREMENT ENGINEERING of the underlying problem. This type of complexity can
Before discussing RE activities, it is worth having a be traced back to the requirement phase, when the
definition of Requirement Engineering. Zave [4] provides one problem is defined.
of the clearest definitions: "Requirement engineering is the
Algorithmic complexity, which reflects the complexity
branch of software engineering concerned with the real world
of the algorithm implemented to solve the problem.
goals for, functions of, and constraints on software systems. It
is also concerned with the relationship of these factors to Structural complexity reflects the complexity of the
precise specifications of software behavior, and to their algorithm implemented to solve the problem.
evolution overtime and across software families."
Cognitive complexity measures the effort required to
Brooks F.P [5] said that, "The hardest single part of understand the software.
building a software system is deciding what to build. No other
part of the conceptual work is as difficult as establishing the Since most of the activities have been in identifying and
detailed technical requirements, including all the interfaces to measuring the algorithmic, structural and cognitive complexity.
people, to machines, and to other software systems". Algorithmic complexity measured implemented algorithm to
solve the problem and is based on mathematical methods. This
The Requirement Engineering consists of five sub process: complexity is measurable as soon as an algorithm of a solution
Requirement Elicitation, Requirement Analysis, Requirement is created, usually during the design phase.
Specifications, Requirement Validation, and Requirement
Management. Structural complexity is composed of data flow, control
flow and data structure. Some metrics are proposed to measure
Capturing of user requirements and analyzing them forms this type of complexity, for example McCabe cyclomatic
the first two phases of the Requirement Engineering process. complexity[9] (that directly measures the number of linear
After elicitation, these requirements are categorized and independent paths within a module and considered as a correct
prioritized in the requirements analysis phase. Grouping and reliable metric), Henry and Kafura metric[10] (measures
requirements into logical entities help in planning, reporting, the information flow to/from the module are measured, high
and tracking them. Prioritization specifies the relative value of information flow represent the lack of cohesion in the
importance and risk of each requirement that help in managing design that will cause higher complexity) and Halstead
the project effectively. At the requirements specification stage, metric[11] (which is based on the principle of count of
the collected information during requirements elicitation is operators and operand and their respective occurrences in the
structured into a set of functional and non-functional code among the primary metrics, and is the strongest indicator
requirements for the system and SRS is provided as the output in determining the code complexity).
of the Requirement Engineering process. Note that a good SRS
must have special circumstances that are expressed in the There are some metrics based on cognitive methods such as
standard IEEE [6], for example, no ambiguity, complete, KLCID [12] complexity metric (It defines identifiers as the
verifiable, adaptation, variability Traceability, etc. However, programmer defined variables and based on identifier density.
the customers cannot always specify accurate and complete To calculate it, the number of unique program lines is
requirements at the start of the process. Removing obsolete considered).
requirements, adding new ones, and changing them are part of Identifying and controlling complexity in code or design
a never ending process during the software development life stages of development is too late and leads error publishing in
cycle. Traceability aids in assessing the impact of changes and the whole system.
is fundamental action for Requirements Management process.
On the other hand, Requirements Management ensures that So to prevent wasting valuable resources and complexity, it
changes are maintained throughout the software development is better to focus on early stages of the software life cycle.
life cycle (SDLC). Therefore, the result of identifying complexity factors is low
costs and high quality in software development and especially
III. CURRENT WORK IN THE SOFTWARE COMPLEXITY AREA in maintenance stages of software. By knowing these factors,
project team try to prevent occurring them or establish suitable
Software complexity is a broad topic in software strategies in design and implementation phase.
engineering and has attracted many researchers since 1976 [7].
Complexity control and management have important roles in
IV. MODEL OF SOFTWARE COMPLEXITY FACTORS
risk management, cost control, reliability prediction, and
quality improvement. The complexity can be classified in two In the proposed model, we have provided software
parts: problem complexity (or inherent complexity) and complexity factors according to their importance in the first
solution complexity (also referred to as added complexity). phase of SDLC (see Figure 1).
Solution complexity is added during the development stages Based on this model, there are two main complexity factors
following the requirements phase, mostly during the designing in requirements phase: Human resource and requirements
and coding phase. document (the output of Requirement Engineering process).
89 http://sites.google.com/site/ijcsis/
ISSN 1947-5500
(IJCSIS) International Journal of Computer Science and Information Security,
Vol. 9, No. 6, 2011
Functional Requirements: Functional requirements should
define the fundamental actions that must take place in the
software. Most researchers claim the size of the product is
one of the main factors in determining its complexity. In
the other words, the more functional requirements result
in a larger and more complex system and would require
more effort and resources to solve it (especially in
maintenance phase ).
o Stability Degree: Some of the systems located in
the dynamic and competitive environment or
interact with evolving systems. The functional
requirements of these systems expose in frequent
changes. Systems which undergo frequent
modification have higher error rates, because
each modification represents an opportunity for
new errors to be generated. It may also be the
case that when systems are undergoing frequent
changes, there is less opportunity and less
interest in testing those changes thoroughly. All
of these lead to the complexity.
o Sub function: It may be appropriate to partition
the functional requirements into sub functions or
sub processes. This does not imply that the
Figure 1. The Model of Software Complexity Factors software design will also be partitioned that way.
More number of sub functions in a functional
Human resource is considered in stakeholders and project requirement means the high rate of complexity in
team levels. Stakeholders are the most important complexity that requirement.
factors, because the requirements extraction process results and Non-Functional Requirement: It refers to the system
their requested and desirable items form the system base. qualitative requirements and not fulfilling those leads to
Stakeholders are people with different backgrounds, customer's dissatisfaction. More number of non-functional
organizational and personal goals and social situations, and requirements and more force to do them lead to more
each of them has its own method for understanding and complexity in the product. A way to rank requirements is
expressing the knowledge and communicates in various ways to distinguish classes of requirements as essential,
with other people. So complexity is widely depending on the desirable, and optional.
stakeholders, and placed in the first level of the model. Design constraints: This should specify design constraints
The Input data to perform the next phases of the software that can be imposed by other standards, hardware
life cycle are documents, which are derived of the requirements limitations, etc. Some constraints are, Implementation
analysis phase. All items and stated requirements, causes the language, database integrity policies, operating
complexity and so documents have considered as the second environment, size of required resources; all of these limit
level of the model. It is necessary to say that this level of the the developers and add complexity in the system.
complexity results in inherent complexity of the system. System Interfaces: There are hardware interfaces,
Finally, project team (as a subset of human resources) is software interface, user interface, communication
considered as another complexity factor, because of differences interface, etc. These specify the logical characteristics
in cognitive, experimental, subjective skills, and placed in the between the software product and hardware components,
third level of model. In the following, the model discussed other software products, users and different
more in details. communication protocol. The more numbers of the
interfaces represent more complexity in the system.
A. Inherent Complexity Factors Input, Output, Files: Functional requirements process the
The output of the Requirement Engineering process is inputs and process and generating the outputs, also files
Software Requirement Specification (SRS). SRS is included are stored data in the system. So the number of files, input
the principles of software acceptance, and monitors software and output parameters and the relationship between them
product, not the product development process. SRS composed is very important. Many numbers of these parameters
of several items, such as functional requirements, non- represent high transactions and so complexity in the
functional requirements, design constraints, interfaces, users, system.
inputs and outputs, etc. All these items are the basis for the Users: These requirements are the number of supported
complexity identification. terminals and the number of concurrent users. High
90 http://sites.google.com/site/ijcsis/
ISSN 1947-5500
(IJCSIS) International Journal of Computer Science and Information Security,
Vol. 9, No. 6, 2011
number of any of these items represents the high Project team level: Skill and experience of the project
complexity of the system. members has also been identified as a possible factor
affecting the complexity of user requirements. The skill of
B. Added Complexity Factors
members could be measured in a number of very complex
Added complexity is added during different phases of the ways, But for experience of the project members, one
software life cycle because of various factors such as factor has been highlighted as being important: has the
inappropriate use of standards, methodologies, methods and project member worked on a similar project before, if so
tools, lack of coordination of the project team, lack of enough how many similar projects has he/she been part of and has
skills and experience, etc. Human resource is a part of the project member had experience in the same team before?
project resources and lack of it is one the complexity causes. Also sub-contract may be considered as complexity
Human resource is investigated in stakeholders and project factors. If the Requirement Engineering team is not
team levels. present in the next phases of software development, new
Stakeholder level: Stakeholders are individuals or team is not familiar with initial SRS. Hence changing or
organizations that were affected by the project and improving the software may ignore some aspect and may
directly or indirectly affect system requirements. lead to errors and complexity.
Requirement extraction is the process of identifying
stakeholder needs, and the most common challenges V. MODEL OF SOFTWARE COMPLEXITY RESULTS
during requirements elicitation process are to ensure In this section, we are going to provide a model of software
effective communication between various stakeholders complexity results and check its impact on the main features of
and elicit implicit knowledge. So effective the system namely quality and the productivity.
communication is an important factor in project success
and developing a good SRS. In the following, we are
going to describe the associated challenges with
stakeholders.
o Heterogeneity of the Organization: When doing a
project for an organization, there is strong
possibility that all stakeholders are not in a
geographical location. This means that
requirements extraction is done from various
stakeholders and in many different places. This
problem is occurred due to the heterogeneity of
the organization.
Research has been done into a number of
‘capability barriers’ which prevent effective
communication in geographically dispersed
groups [13]. The three identified problems
included not sharing a common first language,
being separated by sixteen time zones and the
difference in typing ability when communicating
via a messaging program.
o Number of stakeholders: When conducting a
project for a Virtual Organization, requirements Figure 2. The Model of Software Complexity Results
extraction from numerous stakeholders leads to
waste much of resources (time and cost); further Cost, quality and maintenance issues can be seen in the
integrating the extracted requirement is time- most related topics to software development process.
consuming and so hard. Complexity is determining factors that may affect them (see
o Stakeholder's skills: System users are a group of Figure 2).
major stakeholders. Individually, they can Complexity has affected on two important aspects of the
enhance sustainability at the company they work software: error proneness and maintenance. The main idea
for by bringing their personal skills and behind the relationship between complexity and error-
experiences to aid change and innovation. proneness is that when comparing two different solutions the
Against, an inexperienced user by providing more complex solution is also generating the more number of
irrelevant, contradictory and confused errors. This relationship is one of the most analyzed by
requirements and frequent changes in software metrics’ researchers and previous studies and
requirements may cause the complexity and thus experiments have found this relationship to be statistically
imposes heavy costs on the software. significant [14].
91 http://sites.google.com/site/ijcsis/
ISSN 1947-5500
(IJCSIS) International Journal of Computer Science and Information Security,
Vol. 9, No. 6, 2011
High levels of software complexity make software more errors. These errors show cost of current disturbances and cost
difficult to understand, and it increases the probability of hiding of future activities to fix them.
Since a great deal of software development costs are repeated testing waste resources and loss make low
directed to the software integration testing, it is crucial for the productivity.
project performance to possess the instruments for predicting
and identifying the type of errors that may occur in a specific Therefore, less product complexity has been high
module. maintainability. According to many quality models,
maintainability is determining factor of the system.
Error-proneness can influence quality. This impact is
traceable through "usability" and "reliability" of the software. Finally, we should consider that the quality and
The concept of usability is connected to what the customer productivity have a close relationship with each other. If we
expects from the product. If the customer feels that he can use focus too hard on productivity, we may improve our efficiency
it in a way that he intend to, he will more likely be satisfied and and lower our project costs, but this gain is worthless if we are
regard it as a product with high quality. Thus, a large number not building quality systems that meet the demands from our
of errors in software are presumably something that would customers. Similarly, if we are aiming to create the perfect
lower the usability of the program. system, we may lose control of the costs. Moreover, the time it
takes to improve and correct the system may cause a late
The reliability of a system is often measured by trying to delivery of the product.
determine the mean time elapsed between occurrences of faults
(the result of the software errors) in a system. More reliable VI. CONCLUSION AND FUTURE WORK
product is more stable and has fewer unexpected interruptions
than a less reliable product. Software quality and productivity depend on several factors
such as on time delivery, within budget and fulfilling user's
A defective product has a large amount of errors and it needs. Software complexity is one of the most important
should be undergone of frequent changes to fix them. Frequent indicators that affect the software quality and productivity. To
changes are not desirable to users and have negative effect on achieve higher quality and better productivity, software
product quality. On the other hand, such product needs more complexity should be controlled from the initial phases of the
resources to fix errors and thus have indirectly impact on the SDLC.
productivity.
In this article, with emphasis on Requirements Engineering
The relationship between complexity and maintainability is process, we have analyzed the influential factors in software
clear. According to Corbi's viewpoint [15], more maintenance complexity, particularly in the first phase of software
costs are spent for understanding the system rather than to development, and provide a model. We also propose a model
modify and improve it. Therefore, higher levels of system of complexity result. These models could be use as a roadmap
complexity make it difficult to understand, so maintenance to assist the manager in identifying complexity factors and
would be time-consuming and costly. avoiding them. In addition, it would be appropriate to know the
On the other hand, as shown in Figure 3, the cost required impact of complexity on important characteristics of the
to fix an error later in the life cycle increases exponentially: it system.
costs 5-10 times more to repair errors during coding phase and In future work, we are going to complete the model of
complexity factors and provide a requirement based metric.
This metric is extracted from all the factors that mentioned in
this article. By using both, we can measure and control the
software complexity much before the actual implementation
and design thus saving on cost and time especially in
maintenance phase.
between 100-200 times more during maintenance phase than REFERENCES
during the requirements phase [16].
[1] Zuse, H., "software Complexity-measures and methods. Berlin: Walter
de Gruyter", Berlin: Walter de Gruyter & Co, 1991
Figure 3. Relative Cost of Fixing Errors in Project Lifecycle
[2] Eberlein A., Requirements Acquisition and Specification for
Telecommunication Services, PhD Thesis, University of Wales,
The idea behind the relation between error-proneness and Swansea, UK, 1997
maintenance is that, maintainer spends a lot of financial and
[3] The Chaos Report the Standish Group Internatio
human resources to identify and correct errors and this means al,http://www.standishgroup.com/sample_research/index.php, 1995
lower productivity. [4] Zave P. and Jackson M. Four Dark Corners of Requirements
Mainly, complex system has frequent maintenance. Engineering, ACM Transactions on Software Engineering and
Methodology, pp. 1-30, 1997
Software maintenance challenges are system understanding,
[5] Brooks, F.P. , Essence and Accidents of Software Engineering, IEEE
considering the side effects of changes and testing the Computer, Vol. , pp. 10-19, April 1987
performed changes. Frequent changes may make less interest in [6] IEEE Std. 830-1984, IEEE Guide to Requirements Specification, 1984
testing and surly loss product quality. On the other hand,
[7] W. P. Stevens, G. J. Myers, and L. L. Constantine, "Structural Design",
IBM Systems Journal, vol. 13, no. 2, Jun. 1976, pp. 113-129
92 http://sites.google.com/site/ijcsis/
ISSN 1947-5500
(IJCSIS) International Journal of Computer Science and Information Security,
Vol. 9, No. 6, 2011
[8] Fenton N., Pfleeger S, Software Metrics- A Rigorous and Practical [9] Thomas J.McCabe, "A Complexity Measure", IEEE Transactions On
Approach, London: International Thomson Computer Press, 1996 Software Engineering, pp: 308-320, 1976
[10] Henry, S., Kafura, K.: Software structure metrics based on
information flow. IEEE, Transactions on Software Engineering, pp:
510–518, 1982
[11] Halstead M., "Element of Software Science", Amsterdam: Elsevier,
1977
[12] Kushwaha, D.S. and Misra, A.K., Improved Cognitive Information
Complexity Measure: A metric that establishes program
comprehension effort, ACM SIGSOFT Software Engineering,
Volume 31 Number 5, September 2006
[13] Toomey, L., Smoliar, S., Adams, L. Trans-Pacific Meetings in a
Virtual Space, FX Palo Alto Labs Technical Reports, 1998
[14] Curtis, B., Sheppard, B. Milliman,P. Third time charm: stronger
pediction of programmer peformnace by software complexity metric.
In proceeding of the 4th International Conference on Software
Engineering, pp: 356-360, 1979
[15] Corbi, T. A., “Program Understanding: Challenge for the 1990s”,
IBM System Journal, pp: 294-306, 1989
[16] Wieringa R.J., Requirements Engineering - Frameworks for
Understanding, John Wiley and Sons,1995
AUTHORS PROFILE
Ghazal Keshavarz received her BA. Degree in Comp. Sc. & Engg from
Shiraz Technical University, Shiraz, Iran in the year 2006. Currently she is
pursuing M.Sc. in Comp. Sc. & Engg from Islamic Azad University
(Science and Research branch), Tehran, Iran under the guidance of Dr
Modiri. She is presently working on Requirement Based Complexity metric
and Software Quality and Complexity Model.
Dr. Nasser Modiri received his M.Sc and PhD in Electronics engineering
from the University of Southampton (UK) and the University of Sussex
(UK). Assistant Professor of Department of Computer Engineering in
Islamic Azad University (Zanjan/Iran).
Dr. Mirmohsen Pedram Assistant Professor of Department of Computer
Engineering in Tarbiat Moallem University (Karaj/Iran).
93 http://sites.google.com/site/ijcsis/
ISSN 1947-5500
(IJCSIS) International Journal of Computer Science and Information Security,
Vol. 9, No. 6, June 2011
A Hierarchical Overlay Design for Peer to Peer and
SIP Integration
Md. Safiqul Islam #1 , Syed Ashiqur Rahman #2 , Rezwan Ahmed ∗3 , Mahmudul Hasan #4
#
Computer Science and Engineering Department, Daffodil International University
Dhaka, Bangladesh
1
safiqul@daffodilvarsity.edu.bd
2
ashiq797@daffodilvarsity.edu.bd
4
mhraju@daffodilvarsity.edu.bd
∗
American International University - Bangladesh
Dhaka, Bangldesh
3
a.rezwan@aiub.edu
Abstract—Peer-to-Peer Session Initiation Protocol (P2PSIP) is P2P can be integrated with a traditional SIP system. We also
the upcoming migration from the traditional client-server based examine some existing approaches to identify in achieving
SIP system. Traditional centralized server based SIP system P2PSIP. The remainder of the paper is organized as follows.
is vulnerable to several problems like performance bottleneck,
single point of failure. So, integration of Peer-to-Peer system Section II gives an overview of traditional SIP systems and
(P2P) with Session Initiation Protocol (SIP) will improve the per- terminologies involved. P2P technology is demonstrated in
formance of a conventional SIP system because a P2P system is section III. P2PSIP is introduced in section IV and section V.
highly scalable, robust, and fault tolerant due to its decentralized In Section VI, details out the current approaches for P2PSIP.
manner and self-organization of the network. However, P2PSIP In Section VII, a proposal for the integration of P2PSIP is
architecture faces several challenges including trustworthiness
of peers, resource lookup delay, Network Address Translation introduced based on the existing approaches. Finally, Section
(NAT) traversal, etc. This paper focuses on understanding the VIII takes an account of the conclusion.
needs of integration of P2P and SIP. It also reviews the existing
approaches to identify their advantages and shortcomings. Based II. S ESSION I NITIATION P ROTCOL (SIP)
on the existing approaches, it proposes a layered architecture to SIP is an application-layer control protocol for initiating,
address the major challenges introduced by P2PSIP.
terminating and modifying multimedia sessions (for example
video, voice, instant messaging, online games and multimedia
I. I NTRODUCTION
conferences). To establish a session, a traditional PSTN
The session initiation protocol (SIP) is a signaling protocol requires SS7 [3] signaling. In IP based telephony, the
[1] standardized by IETF. It is also the default standard signaling protocol is SIP. However, Session Description
protocol for VoIP. The majority of the VoIP development Protocol (SDP) [4] and Real-Time Transport Protocol (RTP)
is currently based on SIP. SIP is used to establish, modify, [5] should be used together with SIP to provide complete
or tear down a multimedia session. Most VoIP systems IP telephony system. Traditional SIP architecture uses
rely on fixed set of SIP servers for which they suffer from client-server architecture. SIP servers are classified into
performance bottlenecks, single point of failure, and Denial proxy, registrar, and redirect servers [1]. A proxy server is an
of Service (DoS) attacks. On the other hand, a Peer-to-Peer intermediate entity that can act as a server to accept a SIP
(P2P) system [2] is a popular technology which does not rely request or act as a client to forward a SIP request. Redirect
on central control and is very popular for resource sharing. servers performs redirection of SIP request. A registrar server
As in a P2P system there is no centralized server, such a accepts Register request from clients and maintains location
system has greater robustness, scalability and fault tolerance. information in order to support mobility. SIP user agents are
If SIP can be made to work over P2P systems, it will improve classified into user agent client (UAC) and user agent server
the performance of traditional SIP systems and eliminate (UAS). A user agent client [1] is a SIP entity that creates
the problems of using centralized SIP servers. Integration a new request. It uses client state machinery to send that
of P2P technologies and SIP introduces several challenges request. A user agent server is also a SIP entity that receives
such as resource lookup delays, node heterogeneity, NAT SIP requests on behalf of users and responds to these requests.
traversal, peers trustworthiness which has to be addressed
before enjoying its advantages. Each user agent is identified by SIP uniform resource
identifier (URI) for instance sip:username@somedomain.com.
This paper describes the need to integrate P2P and SIP In order to initiate a session with another user, the caller first
technologies and examines cost benefit of P2PSIP over the needs to know the SIP URI of that user. A caller can either
traditional fixed set of SIP servers. Further it discusses how send an INVITE request to a locally configured SIP server
94 http://sites.google.com/site/ijcsis/
ISSN 1947-5500
(IJCSIS) International Journal of Computer Science and Information Security,
Vol. 9, No. 6, June 2011
or directly send an INVITE to the IP address and port of the prefix based identifier where each node shares a prefix
the user’s address. A user agent registers its location with the with the data key .There is also a unique identifier for each
registrar server before initiating a session. key. The algorithm choose the node with the closest numeric
value of a key to route messages. During routing, the sender
III. P EER - TO -P EER (P2P)
node sends the message to the node whose identifier’s prefix
Peer-to-peer(P2P) technology, by definition, is a mesh net- is minimum one digit larger than the sender node. If there
work as opposed to a star network in a client/server model. In is no existence of such a node it forwards the messages
a peer-to-peer network, all nodes act simultaneously as client to the another node with the same prefix whose identifier
and server. Some of the advantages of P2P systems are: is numerically closer to the data key.The expected number
• Scalability of routing step is O(logN). The Pastry method provides
• Robustness more flexible mechanism than the chord method because the
• No single point of failure successor with the pastry identifiers is not so strictly defined
There are two types of P2P systems [6], structured and and its adjusting nodes in its routing table.
unstructured. The structured networks impose techniques to
tightly control the data placement and topology within the
The Bamboo DHT algorithm conceptualize the namespace
network, and currently only support search by identifier.
as a circle like Chord DHT which means the peer always
On the other hand unstructured networks rely on flooding
located next to the peer with the largest possible Peer-ID
techniques. In many scenarios, the increased search efficiency
[9]. Unlike Chord but like Pastry, bamboo uses prefix routing
makes structured networks preferable to the widely deployed
to accumulate on the peer responsible for the search key. It
unstructured networks. Most widely used structured P2P
uses the Pastry geometry where the term geometry is used to
system is Distributed Hash Table (DHT). There are several
refer the neighbor management algorithm or independent of
flavors of DHT each with some advantages over another. It
the routing algorithms used as well as patterns of neighbor
is very important to choose a proper DHT algorithm to have
links in DHT. The Bamboo algorithms are more incremental
good performance of SIP running on P2P.
than Pastry. In bandwidth-limited environments, the bamboo
algorithm allows to continuous churn in membership as well as
DHT is a decentralized and distributed system where all
it allows to better acceptance of changes in large membership
the peer nodes and resources are identified by unique keys.
in DHT.
A DHT table is introduced to provide efficient location and
retrieving operations. The most popular DHT algorithms are
Chord [7], Bamboo [8], Pastry [10], and Tapestry [11].
Most of the P2PSIP architectures use the Chord algorithm
to maintain the P2P overlay. The logical structure of N4 Predecessor
Chord is a ring shaped where each node is identified by a
numeric identity. For each peer, the peer with the nearest
lower identifier is called the predecessor and nearest higher N23
identifier is known as successor. Figure 1 illustrates a Chord
ring where node 23 is responsible for objects O6 to O22.
Each node maintains a finger table that contains information
N35 Successor
of half of the nodes clockwise from that node. Each node is
responsible for objects whose associated key is in between
the node’s own id and predecessor’s ID. Fig. 1. Chord Ring
Like Chord, Pastry is another 2nd generation large P2P
routing network [10]. Pastry forms a self-organized, robust
A node will search for a target node by searching the
and overlay network in the internet. The major challenge is
node in the finger table which is nearest to the target
again to form an efficient algorithm for routing. In Pastry
node. Since each node knows about nearby nodes, so after
network, each peer or node has a unique numeric 128-bit
a repeated number of searches, the target node will be
identifier which is assigned randomly when a peer joins in
discovered. Choosing the proper DHT algorithm will improve
the network . Each peer formed the overlay network on the
the performance of P2PSIP. Table 1 shows a comparison
top of the hash table and the peer contains the table of list
of different DHT algorithms [12]; from this list a suitable
of leaf nodes, a routing table and a neighborhood list. Leaf
algorithm for a P2PSIP implementation could be chosen.
nodes, a routing and a neighborhood list tables organized
based on the existing nodes of the network and here we can
see the self-organization is very similar to Chord algorithm Below is a table comparing the various features of different
except that Pastry also update its routing table. Leaf nodes flavors of DHT algorithms: Comparison among different DHT
set contains L/2 closest nodes where as routing table contains algorithms, adapted from [12] is shown in Table I.
95 http://sites.google.com/site/ijcsis/
ISSN 1947-5500
(IJCSIS) International Journal of Computer Science and Information Security,
Vol. 9, No. 6, June 2011
TABLE I
C OMPARISON OF DIFFERENT DHT ALGORITHM , ADAPTED FROM [12]
Chord CAN Pastry Bamboo Tapestry Kademlia
Lookup Recursive Recursive Recursive Recursive Recursive Iterative
Methods Semi- Semi- Semi- Semi- Semi-
Recursive Recursive Recursive Recursive Recursive
Iterative Iterative Iterative Iterative Iterative
Parallel Not No Not Yes (on It- No Yes
Lookups Suitable suitable erative)
Complexity Simple Simple Quite com- Quite com- Not Simple
plex plex complex
Bandwidth Moderate Moderate High Moderate Quite high Moderate
Consump-
tion
Node Quite sim- Very Complex Quite sim- Complex Simple
Join and ple simple join ple join
Departure
IV. P EER - TO -P EER S ESSION I NITIATION P ROTOCOL A. Resource Lookup Delay
(P2PSIP) Locating a peer or resources in P2PSIP networks takes much
more time than the traditional SIP based network. Trying to
P2PSIP is the combination of a P2P network and SIP, where reduce this delay is significant challenge for P2PSIP.
traditional fixed set of servers are replaced by a distributed
mechanism. DHT can be used which is one of the possible B. Network Address Table (NAT) Traversal
distributed mechanisms available. All the address of record Most of the P2P nodes may be behind a NAT or Firewall.
to contact URI mappings is distributed among the peers There must be some relay in between them with a public
in the P2P overlay. Currently, the P2PSIP working group IP address in order to establish communication with other
of the Internet Engineering Task Force (IETF) has defined peer. This is one of the most important challenges for P2PSIP
the terminologies, concepts in [13] and use cases in [14]. network.
Moreover, this working group is trying to standardize the
P2PSIP peer protocol. P2PSIP can be implemented into two C. Node Heterogeneity
ways: one is SIP on top of P2P and the other implements In order to maintain the scalability and service availability of
P2P over SIP. SIP on top of P2P uses P2P protocol to P2PSIP, node heterogeneity should be handled appropriately.
implement SIP location service; while the other approach Node heterogeneity can be difference in bandwidth, CPU,
uses SIP messages to transport P2P traffic. Traditional a P2P storage, and uptime of the peer. Security Issues and Trustwor-
node searching mechanism uses flooding mechanism to locate thiness of peers Security of a distributed P2P communication
the node. However, to find a target node using P2PSIP for system is another of the major challenges. Security issues
multimedia session, flooding mechanism should be avoided. concern user identification, authentication and trustworthiness.
D. Security Issues and Trustworthiness of Peers
V. P INPOINTING C HALLENGES FOR P2PSIP
Security of a distributed P2P communication system is
In the following section, we will discuss some common another of the major challenges. Security issues concern user
requirements to implement P2PSIP. In an IETF draft, Bryan, identification, authentication and trustworthiness.
et al. defines a set of requirements for P2PSIP [15]. We have
taken some important point from this paper. VI. E XISTING A PPROACHES
First, P2PSIP peers should be capable of performing opera- Integration of P2P and SIP will improve the performance
tions such as joining, leaving, storing information on behalf of traditional SIP system as P2P has several advantages.
of the overlay, or transporting messages. Secondly, the peers There are several requirements that should be met to integrate
must provide the functions offered by traditional SIP network. P2P and SIP. Those requirements have already been described
For example, P2PSIP should support the modification, estab- in Section V.
lishment, and termination of multimedia sessions. Thirdly, the
implementation should not prevent the use of existing proto- Singh and Schulzrinne propose a hybrid architecture for
cols like SSL or TLS as used in the P2P or SIP network. NAT the integration of P2P and SIP that introduces two additional
and firewall traversal should also be supported for P2PSIP. advantages: interoperability with existing SIP servers and
Finally, the functionality of the fixed set of centralized SIP no maintenance cost besides P2P scalability and reliability
servers should be distributed over the peers. Some of the other [16]. Chord was used as an underlying DHT algorithm.
challenges are described in the following sections. Their architecture is based on the concept of Super nodes
96 http://sites.google.com/site/ijcsis/
ISSN 1947-5500
(IJCSIS) International Journal of Computer Science and Information Security,
Vol. 9, No. 6, June 2011
and ordinary node where super nodes are powerful nodes briefly.
having high bandwidth, lots of CPU power, and plenty of
memory, long uptimes and public IP address compared to Dhara, et al. [22] propose a layered architecture for P2PSIP
the ordinary nodes. These authors have also implemented that separates the P2P related issues from the underlying
a P2PSIP adaptor [17] which allows existing or new user voice or transport layer. The advantage of this system is
agents to connect to the P2PSIP network. NAT/Firewall that it allows dynamic changes of overlays based on the
detection is based on the ICE [18] algorithm, then it uses requirements and properties of users and devices. Their paper
a super node as a relay to help ordinary nodes to establish focuses on a layered architecture that allows the choice of
calls and participate in calls. Node heterogeneity is identified P2P overlay based on specific parameters. The architecture is
by the super nodes and ordinary nodes. Offline messaging a Public Key Infrastructure PKI [23] based trust management
services are provided by combining the storage of sender and system where SIP is the transport protocol and chord is DHT
intermediate DHT nodes [19]. algorithm for a P2P structure. They have mentioned that
the device overlay will focus on NAT and firewall traversal,
Their architecture fails to reduce call setup delay; which is but no such implementation or methods were introduced.
higher than the traditional SIP networks.Peer trustworthiness This paper failed to address DHT maintenance cost on
and security issues are not described in their report. The an ordinary node and bootstrap node selection- Besides
paper does not propose any super node selection mechanism. the above approach, Scalable Application-Layer Mobility
Node heterogeneity is introduced, but they do not describe Protocol (SAMP) [24] deals with the session setup latency
what will happen if a more powerful node than the current by introducing two optimization techniques: Hierarchical
super node joins the existing network. Ordinary nodes always Registration (HR) and Two-Tier Caching (TTC). In HR, when
have to pay a high maintenance cost to maintain their Chord the mobile node (MN) is in a foreign domain it will register
finger table and periodically send refresh messages to update its Care of Address with an anchor SIP server instead of
their predecessors and successors. home SIP server, this is to reduce session setup delay. On the
other hand, TTC introduces two phase cache lookup where
SOSIMPLE [20] is a P2PSIP architecture where nodes are the first cache lookup is based on the MN’s cache and the
organized using the Chord DHT algorithm. SIP messages second phase lookup is based on anchor SIP server’s cache.
with a newly defined header are used to maintain the DHT, If the target is not found in either cache then a traditional
register users, locate resources and establish sessions. Based P2P lookup occurs. Hierarchical P2PSIP [25] is introduced
on the registration process, the SOSIMPLE architecture has to address the connectivity problem of heterogeneous P2P
two levels of REGISTER operations. One is user registration overlays and the overhead problem of extra SIP messages
which is the traditional use of registration and another is node overhead when SIP is used to maintain the overlays.
registration which is for DHT operation. The SOSIMPLE
paper mentions several security and user authentication Another approach to P2PSIP is described in [26]. This
mechanisms such as user certificate, email verification, etc. approach deals with the manageability of a P2P system. In
that can be implemented on their architecture. However, this this approach, the system architecture is divided into three
paper fails to describe Node heterogeneity, Bootstrap node layers: the signal control layer, the management layer, and
selection, Node maintenance cost for DHT operation, and the media transportation layer. In this architecture signaling
Resources look up delay. is used to initiate the system, maintain the system topology,
and for resource searching. The management layer deals with
In order to alleviate the problems in [16] and [20], such as grouping, playing, and media uploading management. Inter-
call setup delay, node heterogeneity, and high maintenance domain registers, resource locating and DHT creation and
cost paid by the ordinary nodes, Le and Kuo [21] propose maintenance are the functions of the SIP signal control layer.
a hierarchical and breathing overlay based network. In a Media transport layer deals with the storage management and
hierarchical overlay nodes forms different sub-overlays based media uploading. In Cooperative SIP (CoSIP) [27], both the
on node heterogeneity. Session setup delay is reduced by server based and P2PSIP networking work together.
introducing two types of lookups based on knowledge of
the destination sub overlay: oriented lookup and un-oriented VII. L AYERED A RCHITECTURE FOR P2PSIP
lookup. In oriented lookup, a node relays its request to its
father node to establish the session. In the case of un-oriented In this section, we propose a new P2PSIP architecture in
lookup, a father node will receive the request from a son order to address all the challenges. Our P2PSIP architecture
node and this (and if necessary other) upper level father is a two layered architecture based on [25]. Where the top
node will do a DHT lookup in their sub over-lays. Node layer consists of powerful super nodes and the bottom layers
heterogeneity is introduced due to forming a hierarchical consist of the ordinary nodes similar to HiLO Peer and LoLO
overlay. Lower DHT maintenance cost. The paper failed to Peer [25]. Figure 2 shows our proposed architecure for P2PSIP.
address peer trustworthiness and other security issues. The The top overlay uses Bamboo as an underlying algorithm. The
following subsections will describe their three techniques reason for choosing Bamboo is its parallel lookup capability.
97 http://sites.google.com/site/ijcsis/
ISSN 1947-5500
(IJCSIS) International Journal of Computer Science and Information Security,
Vol. 9, No. 6, June 2011
Several challenges remain to our proposed approach. However, F. Peer’s Trustworthiness and Security issues
our proposed architecture has the following properties:
We can implement a login server module to the super node.
When a peer wants to join a P2P network, it will send a SIP
Intra Domain REGISTER message along with its public key, and then the
Overlay
(Chord) login server will authenticate the node and provide it with
a signed certificate. It will help the node to provide this
Intra Domain
certificate and build trust.
Intra Domain
(Pastry)
Overlay
Inter Domain Overlay (Using Bamboo as an
Overlay
(CAN)
underlying algorithm)
VIII. C ONCLUSION
Intra Domain
This paper gave an overview of SIP and P2P and exam-
Overlay ines the integration of a P2P system with SIP. Moreover,
(Tapestry)
Super Node
it describes some advantages of their integration as well
as challenges and implications were also discussed. Several
Ordinary Node
current approaches to P2PSIP were discussed along with their
Fig. 2. Layered Architecture of P2PSIP approaches to mitigating the challenges of P2PSIP. Finally, we
proposed a layered architecture to address the major challenges
of P2PSIP. Future work will focus on implementing our
A. Node Heterogeneity proposed architecture on a simulator.
A joining node will contract the regional super node and
share his capabilities with the super node. If it has higher R EFERENCES
capabilities it will become the super node. So, here based on [1] J. Rosenberg, H. Schulzrinne, G. Camarillo, A. Johnston, J. Peterson, R.
node heterogeneity we select the super node. Sparks, M. Handley, and E. Schooler. SIP: Session Initiation Protocol,
RFC 3261, Internet Engineering Task Force, 2002.
[2] D.S. Milojicic, V. Kalogeraki, R. Lukose, K. Nagaraja, J. Pruyne, B.
B. High Maintenance Cost Problem Richard, S. Rollins, and Z. Xu. Peer-to-Peer Computing, HP Laboratories
As the ordinary node has to pay a high maintenance cost Palo Alto, 2002.
[3] Performance Technologies, Inc. SS7 Tutorial, 2006. Available at:
for the DHT, we use the concept of a breathing layer with a http://www.pt.com/tutorials/ss7/, Last Visited April 2008.
minor modification. Ordinary nodes will have two states: sleep [4] M. Handley, V. Jacobson. SDP: Session Description Protocol, RFC
and active. After a long idle time an ordinary node will go to 2327, Internet Engineering Task Force,April 1998. Available at:
http://www.ietf.org/rfc/rfc2327.txt
sleep mode by giving a sleep indication to its super node. [5] H. Schulzrinne, S. Casner, R. Frederick, V. Jacobson. RTP: A Transport
Protocol for Real-Time Applications, RFC 1889, Internet Engineering
C. Node Look Up Task Force, January 1996. Available at: http://www.ietf.org/rfc/rfc1889.txt
[6] D. Bryan, B. Lowekamp. Decentralizing SIP, ACM Portal, Vol.5, Issue
We can use two-tier caching scheme [24] to locate the target 2, pages: 34-41, March 2007
[7] I. Stoica, R. Morris, D. Liben-Nowell, D.R. Karger, M.F. Kaashoek, F.
node and along with utilizing the parallel look up feature. Dabek, and H. Balakrishnan, Chord: A Scalable Peer-to-Peer Lookup Pro-
However, if the target node information is not cached in either tocol for Internet Applications, IEEE/ACM Transactions on Networking,
the cache of super node and ordinary node, then the super page 17, 2003.
node will search using the parallel lookup feature of Bamboo [8] S. Rhea, D. Geels, T. Roscoe, and J. Kubiatowicz Kubiatowicz, Handling
Churn in a DHT In Proc. of USENIX Annual Technical Conference,
in order to locate the node. That will take less time to find the 2004.
target node’s responsible super node. [9] 3. Sean Rhea, Byung-Gon Chun, John Kubiatowicz, and Scott Shenker,
Fixing the Embarrassing Slowness of OpenDHT on PlanetLab, Proceed-
ings of USENIX WORLDS 2005, December 2005
D. NAT Traversal Problem [10] A. Rowstron and P. Druschel. Pastry: Scalable, Distributed Object
Location and Routing for Large-Scale Peer-to-Peer Systems, In IFIP/ACM
Super nodes are the most powerful nodes in terms of International Conference on Distributed Systems Platforms, 2001.
available bandwidth, CPU processing power and they have a [11] B.Y. Zhao, L. Huang, J. Stribling, S.C. Rhea, A.D. Joseph, and J.D.
public IP address. In order to solve NAT traversal problem, a Kubiatowicz. Tapestry: A Resilient Global-scale Overlay for Service
Deployment, Selected Areas in Communications, IEEE Journal on, 2004.
super node can act as a relay for an ordinary node to establish [12] J. Hautakorpi, G. Camarillo. Evaluation of DHTs from the viewpoint of
session. interpersonal communications, ACM International Conference Proceed-
ing Series; Vol. 284, Proceedings of the 6th international conference on
Mobile and ubiquitous multimedia, pages 74-83, 2007
E. Connectivity Problem [13] D. Bryan, P. Matthews, E. Shim, and D. Willis. Concepts and Termi-
Super nodes will use Bamboo to maintain the inter domain nology for Peer to Peer SIP draft-ietf-p2psip-concepts-01, Nov. 2007.
Expires: May 18, 2008.
overlay with other super nodes and regional algorithm like [14] D. Bryan, E. Shim, and B. Lowekamp. Use Cases for Peer-to-Peer
Chord, Pastry, etc. running in the inner domain overlay. Thus, Session Initiation Protocol (P2P SIP), draft-bryan-p2psip-usecases-00.txt
each Super node will maintain two DHT tables based on two , July 2007. Expires: January 3, 2008
[15] D. Bryan, S. Baset, M. Matuszewski, and H. Sinnreich. P2PSIP Protocol
different algorithms. As a result the different domains running Framework and Requirements, draft-bryan-p2psip-requirements-00.txt ,
different DHT algorithm can communicate. June 2007. Expires: Jan 2008.
98 http://sites.google.com/site/ijcsis/
ISSN 1947-5500
(IJCSIS) International Journal of Computer Science and Information Security,
Vol. 9, No. 6, June 2011
[16] K. Singh and H. Schulzrinne. Peer-to-peer Internet Telephony Using SIP,
In NOSSDAV ’05: Proceedings of the international workshop on Network
and operating systems support for digital audio and video, pages 63-68.
ACM Press, June 2005.
[17] K. Singh and H. Schulzrinne. SIPpeer: a session initiation protocol
(SIP)-based peer-to-peer Internet telephony client adaptor, White paper,
Computer Science Department, Columbia University, New York, NY, Jan
2005. http://www.cs.columbia.edu/ kns10/publication/sip-p2pdesign. pdf.
[18] J. Rosenberg. Interactive Connectivity Establishment (ICE): A Method-
ology for Network Address Translator (NAT) Traversal for Offer/Answer
Protocols, draft-ietf-mmusic-ice-08. 2006.
[19] K. Singh and H. Schulzrinne. Peer-to-peer Internet telephony using
SIP, Technical Report CUCS-044-04, Department of Computer Science,
Columbia University, New York, NY, Oct. 2004.
[20] D. A. Bryan, B. B. Lowekamp, and C. Jennings. SOSIMPLE: A Server-
less, Standards-based, P2P SIP Communication System In Proceedings
of the 2005 International Workshop on Advanced Architectures and
Algorithms for Internet Delivery and Applications (AAA-IDEA 2005),
June 2005.
[21] L. Le, G. Kuo. Hierarchical and Breathing Peer-to-Peer SIP Sys-
tem Communications, 2007, ICC ’07, IEEE International Conference,
Pages:1887 - 1892, June 2007.
[22] K.K.Dhara, V. Krishnaswamy, S. Baset. Dynamic peer-to-peer overlays
for voice systems, Pervasive Computing and Communications Workshops,
2006. PerCom Workshops 2006. Fourth Annual IEEE International Con-
ference, March 2006.
[23] Public Key Infrastructure(PKI) tutorial. available at:
http://www.cs.gmu.edu/ hfoxwell/EC511/pki.pdf. Last Visited- April
2008.
[24] S. Pack, K. Park, T. Kwon, Y. Choi. SAMP: scalable application-
layer mobility protocol, IEEE Communications Magazine, Vol. 44, Issue
6,Page(s):86 - 92. JUNE 2006.
[25] J. Shi; Y. Wang; L. Gu; L. Li; W. Lin; Y. Li; Y. Ji; P. Zhang.
A Hierarchical Peer-to-Peer SIP System for Heterogeneous Overlays
Interworking, IEEE Global Telecommunications Conference. Page(s):93
-97, November 2007.
[26] H. Jie, H. Yongfeng, L. Xing. MSPnet: Manageable SIP P2P media
distribution system, Journal of electronics(China). Volume 24, November,
2007
[27] A. Fessi, H. Niedermayer, H. Kinkelin, G. Carle. A Cooperative SIP In-
frastructure for Highly Reliable Telecommunication Services, IPTCOMM,
Proceedings of the 1st international conference on Principles, systems and
applications of IP telecommunications, July 2007.
99 http://sites.google.com/site/ijcsis/
ISSN 1947-5500
(IJCSIS) International Journal of Computer Science and Information Security,
Vol. 9, No. 6, June 2011
Evaluation of CPU Consuming, Memory
Utilization and Time Transfering Between Virtual
Machines in Network by using HTTP and FTP
techniques.
Igli TAFA, Elinda KAJO, Elma ZANAJ, Ariana BEJLERI, Aleksandër XHUVANI
Polytechnic University of Tirana, Information Technology Faculty
Computer Engineering Department
Tiranë, Albania
itafaj@gmail.com, e_kajo@yahoo.com, ezanaj@gmail.com, arianabejleri@yahoo.com,
axhuvani@yahoo.com
Abstract: In this paper we want to evaluate Transfer Hypervisor than those machines without it. Another
Time, Memory Utilization and CPU Consuming problem is Physical Memory utilization and data
between virtual machines in Network by using FTP and overhead during live migration phase. Some
HTTP benchmarks. As a virtualization platform for researchers in [13], [14] presented some methods of
running the benchmarks we have used Xen hypervisor
memory overbooking and compression of this
in para-virtualization mode. The virtual machine
technology offers some benefits such as live migration, memory in virtual machine, in order to improve
fault tolerance, security, resource management etc. The memory utilization and performance of migration.
experiments performed show that virtual machines
above the hypervisor consume more CPU, memory and In this paper we analyze Transfer Time, CPU
have bigger transfer times than in a non virtualized Consuming and Memory Utilization between Virtual
environment. machines and physical machines by using FTP [15]
and HTTP requests [13]. All results are presented in
Keywords: Transfer Time, Memory Utilization, CPU respectively tables.
Consuming, Virtual Machines, Xen-Hypervisor.
This paper is organized as follows. Section II dis
I. INTRODUCTION scribes the experimental architecture. Section III
presents the experimental evaluation. Section IV
Virtual machine technology offers a lot of benefits as presents conclusions and outlines areas of future
shown in previous research work.
[1],[2],[3],[4],[5],[6],[7],[8] such are live migration,
fault tolerance, security, resource management and II. EXPERIMENTAL ARCHITECTURE
reduced energy consumption. Some virtual machine
technologies are based on a software layer called The Figure 1 and Figure 2 we present the basic of
Hypervisor [9]. There are [10] three main types of experimental architecture. In Figure 1 there are 2
virtualization: full virtualization, OS virtualization computers which are connected with UTP cat 7 cable
and para-virtualization. Para virtualization approach using Twisted Pair technique. Communication of two
gives more flexibility than others. Based on this computers is Full duplex. Both computers can
approach we can use ESX-Server [ 11] or Xen [] . communicate with each other by network fast-
Because Xen is free open source and implements ethernet interface 100/1000 Mbit/sec. In each
ballooning method [12] we have used this Hypervisor computer we have setup the virtual machine
. environment with Xen 4.1 and CentOS 5.5 as Dom0
operating system.
Anyway, the virtualization technology offers some
“black holes”, for example the Hypervisor introduces
a slight delay during a transfers from one machine to
another one [13]. Also CPU consumes more
functionalities in machines which include the
100 http://sites.google.com/site/ijcsis/
ISSN 1947-5500
(IJCSIS) International Journal of Computer Science and Information Security,
Vol. 9, No. 6, June 2011
Evaluation of transfer time in FTP server
Evaluation of transfer time between 2 VM on
the same Host
Evaluation of transfer time between 2 VM on
different Hosts
Evaluation of transfer time between 2 physical
machines connected by twisted pair cable.
Evaluation of transfer time between 3 Physical
Fig.1 Communication between 2 Physical Machine with Twisted Machines connected by Gigabit Switch
Pair. Above those hosts are Guest Virtual Machines.
Initially we have used 2 physical machines which
In figure 2 we have installed 3 computers connected support 2 virtual machines, by using para-
with Gigabit switch. The topology of routing is Bus. virtualization approach (XEN 4.1). In both machines
Communication is Full Duplex. We used a we have installed CentOS 5.5 as Dom0. In DomU1
management Gigabit Cisco Switch, but we could use and DomU2 respectively we have installed Scientific
simple switch too. Linux 6.0 and Ubuntu 10.04 Server for the first
machine. In the second machine in DomU1 is
installed Ubuntu 10.04 Server. Initially we want to
test the transfer time from a client to a server
between 2 VM in the same physical host by using
FTP technique. We will repeat the test by using 2
virtual machines in different physical hosts and
finally we will evaluate this time between 2 physical
machines connected by twisted pair technique. We
want to transfer ISO image (XP.ISO with SP2 = 557
MB). In the machine with Scientific Linux 6.0 we run
a FTP client and in Ubuntu 10.04 a FTP server. To
realize the transfer of XP.ISO file from one machine
to another we have used Samba FTP tool (which is
part of Scientific or Ubuntu Server). We can measure
the time of file transfer from the start moment at
source machine to the destination machine.
Fig 2. Three Computers connected with a Gigabit Switch. Above The experiment is repeated again with Scientific
Host computers are setup Virtual Machines
Linux 6.0 client machine as DomU1 in host 1 and
Ubuntu 10.04 Server as DomU1 in host 2 which is
The architecture of all the machines is X86 - 64 bit, used as FTP server.
RAM 4 GB. CPU Quad-Core, supported with VT
and Hyper-threading technology. Finally we will evaluate the time transferred between
2 physical machines, respectively host 1 and host 2.
III. EXPERIMENTAL EVALUATION: Host 2 will serve as FTP server and Host 1 as a FTP
client. The results are presented in table 1:
The evaluation is separated in three phases:
TABLE1 THE EVALUATION OF TIME TRANSFERRED OF
Evaluation of transfer time for FTP and Web XP ISO IMAGE BY USING SAMBA FTP TOOL
servers
Time transferred Time transferred Time transferred
Evaluation of CPU consumption for FTP and between 2 VM on between 2 VM on between 2 Physical
Web servers the same Host different Hosts Hosts
(Host 1)
Evaluation of memory utilization for FTP and
48 sec 86 sec 36 sec
Web serves
101 http://sites.google.com/site/ijcsis/
ISSN 1947-5500
(IJCSIS) International Journal of Computer Science and Information Security,
Vol. 9, No. 6, June 2011
In Figure 2 we have 3 computers connected with a Evaluation of CPU consuming in FTP Server
Gigabit Switch. All computers communicate with
each other by Fiber channel. This communication We want to evaluate the CPU consuming of Web
will increase the performance of network (As it Server which means how much is the percentage of
knows communication with fiber channel utilize CPU dedicated to our experiment. It is presented in
more bandwidth and speed than UTP cable). The average values during the test in figure 1 and figure
third computer is a clone of second computer. Now 2. To monitor the CPU consuming we have to used
we want to evaluate the transfer time between 2 VM xentop command in /proc and System Monitor into
in different hosts (i.e between DomU1 in host 1 and System Administrator Menu. Both of them offer an
DomU1 in host3). Then will transfer the image from explicit form of CPU consuming including all located
computer 1 to computer 3. The results are presented processes into computer by calculating the average
in table 2 value [ (DomU1+DomU2+…)/n where n is the
number of virtual machines (The same thing would
TABLE 2 THE EVALUATION OF TIME TRANSFERRED OF be with physical machines) ] of these processes in
XP ISO IMAGE BY USING SAMBA FTP
virtual or physical machines in our experiment. To
calculate the total CPU consuming we have built a
Time transferred between 2 Time transferred between 2
VM on different Hosts Physical Hosts (Host 1 and
script in C which gives a formula:
(DomU1 in Host 1 and Dom Host 3)
U1 in Host3) Running process rate x nr of active process +
Sleeping process rate x nr of sleeping process= Total
83 sec 32 sec CPU Consuming (1)
TABLE.3 AVERAGE OF CPU CONSUMING DURING THE
TRANSFERRED OF 557 MB BASED ON FIGURE 1 AND
FORMULA 1
Let`s analyze Tab. 1 and Tab. 2.
CPU consuming CPU consuming CPU consuming
In Tab. 1 the time transferred between 2 VM on the between 2 VM on between 2 VM on between 2 Physical
same host is small. This is because the transfer of 557 the same Host different Hosts Hosts
MB image file from one VM to another one (Host 1)
performed over the interfaces of the same computer
architecture (In reality we are in the same computer). 61,4% 62,2 % 55.1 %
If we compare time transferred between 2 VM on
different host as it looks from table 1, total time
transferring is 86 sec. The reasons are:
TABLE. 4 AVERAGE OF CPU CONSUMING DURING THE
The transferred speed of the file between 2 TRANSFERRED OF 557 MB BASED ON FIGURE 2 AND
computers over the network is more slowly FORMULA 1
then the internal interface of computer
architecture (ISA interface). CPU consuming between 2 CPU consuming between 2
VM on different Hosts Physical Hosts (Host 1 and
Media communication between hosts is based (DomU1 in Host 1 and Dom Host 3)
on UTP cat 7 which is more slowly then other U1 in Host3)
medias (i.e fiber media, which is presented in
61,45% 55.16%
table 2)
Also in table 1 we show that time transferred between
different hosts is 36 sec (< 86 sec and < 48 sec). In
the both cases the main reason is the delay that is As it look from table 3 CPU consuming between 2
introduced from the Hypervisor. VM is higher than between 2 physical machines. The
reason is a part of CPU, consumes to maintenance
In table 2, the time decrease slightly. The reason is the Hypervisor. (61,4 % > 55,1 %). CPU consuming
media communication. In table 2 we are using doesn’t affect from the computer communication in
Gigabit Switch and Fiber channel communication network. This is the reason that in table 4, CPU
while in table 1 we are using only UTP cable. consuming has the same values as table 3.
102 http://sites.google.com/site/ijcsis/
ISSN 1947-5500
(IJCSIS) International Journal of Computer Science and Information Security,
Vol. 9, No. 6, June 2011
Evaluation of memory utilization in FTP Server Hosts (DomU1 in Host 1 and 1 and Host 3)
Dom U1 in Host3)
We want to evaluate the physical memory utilization
Host 1 MAX= 1,06 GB of Host 1 MAX=687 MB of RAM
during the transferring .iso image from one virtual
RAM (≈17 % of RAM)
machine to another, inside a physical host. Then we
want to evaluate the Physical memory utilization Host 2 MAX= 1,08 GB of Host 2 MAX= 711 MB of
while .iso image transferring between two virtual RAM RAM ( ≈ 17 % of RAM)
machines in different physical hosts. Initially we
have used Mem_Access [13], but now we have used Both (≈ 41 % of RAM)
this tool to evaluate the Mem-Utilization during file
transfer. We have implemented this tool to a script
which is write in C and called MemC. For each 10
MB transfer from Server Machine to Client Machine In table 5 and table 6 we presented that memory
it calculates Memory Utilization by activation this utilization doesn`t affect from communication in
tool. The result is resolve in a record. This record is
networks, but it affects from presence of the
part of “My SQL” Data Base which is installed in
Server Machine (mysql Ver 12.21 Distrib 4.0.14, for Hypervisor (from 41 % to 17 %)
pc-linux). Average Memory Utilization is Total
Sum of record dividing with nr of records. Always Evaluation of memory utilization and CPU
we can use Samba FTP server for transferring data consuming in Web Server
from one machine to another one. Finally we repeat
the experiment by using transfer between 2 physical We will repeat the above experiment by using Web
hosts (fig 1). Table 5 give the results for memory server instead of FTP Server. Initially we will
utilization between 2 virtual machines. accomplish the experiment inside a physical machine
between 2 VM. Then we will repeat the test between
TABLE.5 PHYSICAL MEMORY UTILIZATION DURING THE 2 Physical Machines (Fig 1).
.ISO FILE TRANSFER BY USING SAMBA FTP BASED ON
FIGURE 1. We have installed LAMP (Appache 2 and My SQL
client , My SQL Server) in Web Server Virtual
Average Memory Average Memory Average Memory Machine or Web Server Physical Machine.
utilization between Utilization between Utilization between
2 VM on the same 2 VM on different 2 Physical Hosts We have built another script in C++, called MemCP
Host (Host 1) Hosts (Dom U1 in (Host 1 is a Client which get information from MemAccess benchmark
Host 1 is a Client FTP and Host 2 is a for every request ( From Client machine to Server
FTP and Dom U1 Server FTP)
machine) and previously script which was located in
in Host 2 is a
/proc. This script make a calculation by adding the
Server FTP)
results got it in module of Apache 2 installed in our
MAX = 1,06 GB of Host 1 MAX= 1,06 Host 1 MAX=687 machine. This module is implemented in order to
RAM (≈41 % of GB of RAM MB of RAM (≈17 include the MemAccess tool for each request. The
RAM) % of RAM) results obtained present the Memory utilization in
Host 2 MAX= 1,08 each Virtual Machine for each process which is
GB of RAM Host 2 MAX= 711 located in host computer. Total results can present in
MB of RAM (≈ percentage. Each request from Client Machine to
Both (≈ 41 % of 17 % of RAM) Server Machine performed by using Httperf
RAM) Benchmark which can generate 10 request in Second.
The time duration of experiment is 1 minute. One
request is equal to 10 MB which corresponds to
.html file located in /home. Results are presented in
If we repeat the experiment based on figure 2 we will Table 7.
present the results in table 6.
TABLE 7 MEMORY UTILIZATION IN WEB SERVER BY
TABLE 6 PHYSICAL MEMORY UTILIZATION DURING THE USING HTTPERF BENCHMARK.
.ISO FILE TRANSFER BY USING SAMBA FTP BASED ON
FIGURE 2. Memory utilization Memory Utilization Memory Utilization
between 2 VM on between 2 VM on between 2 Physical
Average Memory Utilization Average Memory Utilization the same Host different Hosts Hosts (Host 1 is a
between 2 VM on different between 2 Physical Hosts (Host (Dom U1 in Host 1 Client and Host 2 is
103 http://sites.google.com/site/ijcsis/
ISSN 1947-5500
(IJCSIS) International Journal of Computer Science and Information Security,
Vol. 9, No. 6, June 2011
(Host 1) is a Client and Dom Web Server) TAB.9 TIME REQUEST IN WEB SERVER BY USING
U1 in Host 2 is HTTPERF BENCHMARK.
Web Server)
Time request in Time request in Time request in
(≈37 % of RAM) Web Server (≈37,7 Web Server (≈16,6 Web Server Web Server Web Server
% of RAM) % of RAM) between 2 VM on between 2 VM on between 2 Physical
the same Host different Hosts Hosts
(Host 1)
38 sec 66 sec 33 sec
TAB. 8 CPU CONSUMING IN WEB SERVER BY USING
HTTPERF BENCHMARK.
CPU consuming CPU consuming CPU consuming
between 2 VM on between 2 VM on between 2 Physical If we make a comparison between Tab.9 and Tab.1,
the same Host different Hosts in Hosts in Web the time request for Web Server is smaller than Time
(Host 1) in Web Web Server Server transferred in FTP (38 sec < 48 sec, 66 sec < 86 sec).
Server
This time is not affect so much in third case. In this
case we are not using Hypervisor (33 sec < 36 sec)
57,4% 57,5 % 53,6 %
The reasons are :
Request in Httperf are smaller, so that doesn’t
If we compare Table 7 and Table 5, Memory give any effect in degradation of CPU.
Utilization during File (557 MB) transferring in FTP
Server is bigger than Memory Utilizes in Http Memory utilization of Httperf ( Tab.7) is
Request in Web Server (41 % > 37 %). The reasons smaller than Memory Utilization of FTP
are: (which uses Samba FTP tool)
FTP uses 2 connections in communication IV. CONCLUSIONS
Client/Server. One for synchronization and
one for data transmission, while Web Server FTP approach consumes more CPU, Time and
uses only 1 connection. Each request reserves Memory Utilization than Web Approach
a small amount of memory.
The Hypervisor offers an additive time for
Introduction of the Hypervisor in Web Server both approaches, but it is significantly in FTP
during the requests from Client to Server approach
offers a smaller effect than FTP Server
Transfer Process between 2 Physical
If the computers don`t use Hypervisor during Machines offer a better performance than
transfers, Memory Utilization is between 2 Virtual Machines.
approximately the same.
Introduction of good environment decrease the
If we compare Table 8 and Table 3, CPU consuming time consuming between machines and
during file transfer in FTP Server is bigger than Http increase the CPU performance. (Fiber channel
request in Web Server. The reasons are the same as is better environment than UTP channel)
Memory Utilizes.
In the future we want to test the performance and
Evaluation of Time Transferring in Web Server. CPU and memory utilization by using:
Finally, we can repeat the experiment by using Live Migration between 2 machines
HTTPerf Benchmark. We want to evaluate the time
duration after 56 requests from client machine to Memory Compaction Algorithms and their
Web Server machine. Every request is 10 MB. So we effects in Memory Utilization
will test approximately in the same conditions the
results of table 9 with table 1 (10x56 requests =560 Memory Ballooning approach and it`s effect
MB in Web Request ≈ 557 MB in FTP) in CPU consuming.
Extension of these experiments in WAN.
104 http://sites.google.com/site/ijcsis/
ISSN 1947-5500
(IJCSIS) International Journal of Computer Science and Information Security,
Vol. 9, No. 6, June 2011
References: [11] Carl Waldspurger, 2006, “Memory Resource Management in
VMWare ESX Server”
[12] Weiming Zhao, Zhenling Wang, 2009, “Dyanamic Memory
[1] Hien Nguyen Van, Fr´ed´eric Dang Tran, 2009, “Autonomic
Balancing for Virtual Machines”.
virtual resource management for service hosting platforms”
[13] Jin Heo, Xiaoyun Zhu, Pradeep Padala, Zhikui Wang,2010,
[2] Michael Cardosa, Madhukar R. Korupolu, Aameek Singh,
“Memory Overbooking and Dynamic Control of Xen Virtual
2007, “Shares and Utilities based Power Consolidation in
Machines in Consolidated Environments”
Virtualized Server Environments”
[14] Hai Jin, Li Deng, Song Wu, “Live Virtual Machine Migration
[3] Keller, G.; Lutfiyya, H.; Dept. of Comput,,2010,
with Adaptive Memory Compaction”
“Replication and Migration as Resource Management Mechanisms
[15] Moreira, Miguel Elias M. Campista, Lu´ıs Henrique M. K.
for Virtualized Environments”
Costa, and Otto Carlos M. B. Duarte, 2007, “OpenFlow and Xen-
[4] Ando, R.; Zong-Hua Zhang; Kadobayashi, Y.; Shinoda,
Based Virtual Network Migration”
Y.; ,2009,” A Dynamic Protection System of Web Server in Virtual
Cluster Using Live Migration”
Authors Profile
[5] Takahiro Hirofuchi Hirotaka Ogawa Hidemoto Nakada Satoshi
Igli TAFA. He is a pedagogue in Polytechnic University, in
Itoh Satoshi Sekiguchi ,2009, “A Live Storage Migration
Computer Engineering Department. In 2008 he has finished
Mechanism over WAN for Relocatable Virtual Machine Services
the Master Thesis and now is PhD student. His PhD topic
on Clouds”
according to Virtual Machines direction.
[6] Moghaddam, F.F.; Cheriet, M.; 2010, “Decreasing live virtual
Elinda KAJO (MECE). She is a pedagogue in Polytechnic
machine migration down-time using a memory page selection
University, in Computer Engineering Department. She has
based on memory”
finished the PhD thesis at 2004 in Object Oriented
[7] Wei Wang; Ya Zhang; Ben Lin; Xiaoxin Wu; Kai
Programing direction.
Miao; ,2010, “Secured and reliable VM migration in personal
Elma ZANAJ. She is a pedagogue in Polytechnic University
cloud”
in Computer Engineering Department. She has finished the
[8] Anton Beloglazov* and Rajkumar Buyya, 2010, “Energy
PhD thesis at 2009 in Computer Sensor Network direction in
Efficient Resource Management in Virtualized Cloud Data
Polytechnic University of Ancona Italy.
Centers”
Aleksandër XHUVANI. He is a chief of Computer Software
[9] Andrew Tanenbaum, 2009, “Modern Operating System”
Department in Polytechnic University. He has finished the
[10] Espen Braastad, 2006, “Management of high availability
PhD study at Bordeaux in France. At 2004 he is graduated
services using virtualization”
as Prof.Dr.
105 http://sites.google.com/site/ijcsis/
ISSN 1947-5500
(IJCSIS) International Journal of Computer Science and Information Security,
Vol. 9, No. 6, June 2011
A Proposal for Common Vulnerability Classification
Scheme Based on Analysis of Taxonomic Features in
Vulnerability Databases
Anshu Tripathi Umesh Kumar Singh
Department of Information Technology Institute of Computer Science
Mahakal Institute of Technology Vikram University
Ujjain, India Ujjain, India
anshu _ tripathi@yahoo.com umeshsingh@rediffmail.com
Abstract— A proper vulnerability classification scheme aids in locations [3]. Results from previous researches [3-6] clearly
improving system security evaluation process. Many indicate that quantitative security evaluation of risks on
vulnerability classification schemes exist but there is lacking of vulnerability datasets partitioned in well defined classes is a
a standard classification scheme. Focus of this work is to devise meaningful metric. In [4], results of categorized vulnerability
a common classification scheme by combining characteristics
analysis shown that some vulnerability classes are more
derived from classification schemes of prominent vulnerability
databases in effective way. In order to identify a balanced set severe, this fact can be used to design optimal security
of characteristics for proposed scheme comparative analysis of solution by prioritizing severe classes. A proper
existing classification schemes done on five major vulnerability classification scheme facilitates distribution of vulnerabilities
databases. A set of taxonomic features and classes extracted as and help in prioritizing mitigation efforts according to
a result of analysis. Further a common vulnerability severity level. Efficiency of security evaluation process can
classification scheme proposed by harmonizing extracted set of be measured by its objectivity and vulnerability coverage. A
taxonomic features and classes. Mapping of proposed scheme proper classification scheme plays a major role in this regard
to existing classification schemes also presented to eliminate by increasing both objectivity and vulnerability coverage.
inconsistencies across selected set of databases.
Taxonomy is a way to classify vulnerabilities in a well
Keywords- Vulnerabilit; Classification scheme; Vulnerability formed structure so that categorization and generalization
databases; Taxonomy; Security evaluation. can be achieved [7]. In our previous work [8], we analyzed
prominent vulnerability taxonomies published with respect
I. INTRODUCTION to standard criteria and highlight issues which make them not
so usable in today's scenario. This study on past efforts at
Proper assessment and mitigation of vulnerabilities is developing such taxonomy indicates that these efforts prove
essential in order to ensure the system security. to be insufficient to address security issues associated with
Vulnerabilities are “design and implementation errors in current software products due to theoretical approach or
information systems that can result in a compromise of the being focused on limited domain.
confidentiality, integrity or availability of information stored There are many different vulnerability databases set up with
upon or transmitted over the affected system” [1]. In view of different standards and capabilities that records
the increasing population of vulnerabilities [2], it is vulnerabilities and characterize them by several attributes.
necessary to prioritize them and first remediate those that These databases serve the need of updated collection of
pose the greatest risk. Vulnerability prioritization requires vulnerability data for research. Some of the most popular
evaluation of risk levels posed by presence of vulnerabilities. databases include National Vulnerability Database (NVD)
Quantitative evaluation of system security in terms of risk [9], The Open Source Vulnerability Database (OSVDB)
levels due to presence of vulnerabilities is gaining [10], and IBM ISS-X Force[11].But there are many
importance because of objective and on time result challenges in extracting common patterns from these
generation. One of the ways for fast security evaluation is to vulnerability databases due to discrepancies involved in the
find out potential weak areas of the system. It is essential to way the information is kept. Many different classification
focus mitigation efforts in area that have a greater number of schemes used by databases to classify vulnerabilities and
vulnerabilities to meet budget and time constraints. These there is lacking of a common classification scheme. Detailed
areas can be identified by proper vulnerability classification study on the issues involved in this regard can be found in
and thus leads to identify root causes of the weaknesses. [12]. Objective of this work is to analyze vulnerability
Vulnerabilities share common properties and similar classification schemes in some most popular databases and
characteristics in generic aspects like causes, impacts, devise a common classification scheme. Main aim of
106 http://sites.google.com/site/ijcsis/
ISSN 1947-5500
(IJCSIS) International Journal of Computer Science and Information Security,
Vol. 9, No. 6, June 2011
proposing common classification scheme is to provide a schemes. First tier include categories Location, Attack
stepping stone in security risk analysis by strategically Type, Impact, Solution, Exploit, Disclosure, OSVDB.
mitigating risks. Location includes nine subcategories, Attack Type includes
The paper is organized as follows. Section 2 provides ten subcategories, Impact includes four subcategories,
overview of vulnerability classification schemes in major Solution includes seven subcategories, Disclosure includes
vulnerability databases. Section 3 presents comparison of eight subcategories and OSVDB include six subcategories.
classification schemes under taxonomic features in OSVDB supports a rich search feature under every category
prominent vulnerability databases introduced in section 2. for trend analysis. Secunia [17] is a private organization that
Section 4 presents a proposal for common classification provides services in security company defense and
scheme based on comparison in section 3 by extracting vulnerability analysis. Secunia Categorize vulnerabilities
appropriate taxonomic features and classes. Further mapping under features Impact, Critical Levels, and Exploitation
of proposed scheme to existing ones also given. Finally Location. Vulnerabilities under impact are associated to
section 5 concludes the work with directions for future work. twelve classes. Criticality levels can be five ranging from
extremely critical to not critical and attack vector
II. RELATED WORK classification includes three classes.
There are number of vulnerability classification schemes As we can see classification schemes supported by these
adopted by different vulnerability databases maintained by major vulnerability databases are disparate in terms of
various organizations. In this part we will introduce classification criteria and dimensionality. Moreover there is
classification schemes in five major vulnerability databases: no interoperability among them. Therefore it is challenging
IBM ISS X-Force, NVD, SecurityFocus, OSVDB and to compare or combine information across these databases. A
Secunia. common classification scheme can help in this regard. In
IBM ISS X-Force database [11] is one of the world‟s most next section these databases are compared and analyzed with
comprehensive threats and vulnerabilities database. At the respect to generic taxonomic features in order to extract
end of 2010, there were 54,604 vulnerabilities in the X-Force pertinent information for development of a common
Database, covering 24,607 distinct software products from classification scheme.
12,562 vendors. IBM ISS X-Force database doesn‟t include
any class or category information explicitly. Or in other III. EXTRACTION OF TAXONOMIC FEATURES AND CLASSES
words it doesn‟t specify any classification scheme. But it One of the objectives of this work is to identify a set of
inherently supports taxonomic features: impact and severity characteristics for a very specific classification scheme, one
level. In all eleven categories proposed under impact and it that can be used effectively in quantitative security
assigns risk levels in three categories: High, Medium and evaluation of system. This goal requires analysis of existing
Low. National vulnerability database [9] is managed by the schemes to deduce possible common features that will aid in
National Institute of Standards and Technology of the United security evaluation. A comparative study provides insight
States and is associated with the CVE [13]. It records into the pros and cons of the different kind of classification
vulnerabilities since 1999, total 46176 vulnerabilities listed schemes. This section compares classification schemes in
under CVE names. NVD is using CWE [14] as a major vulnerability databases introduced in previous section
classification mechanism; each individual CWE represents a under generic taxonomic features. Taxonomic features
single vulnerability type. There are total 23 vulnerability identified for analysis are: cause, impact, exploitation
types in NVD classification scheme, which are based on location and severity levels. Comparisons of features done
taxonomic features vulnerability cause and vulnerability under various heads are summarized in Table II to V. These
impact. SecurityFocus vulnerability database [15] is a vendor heads have been numbered for greater legibility and their
neutral vulnerability database managed by Symantec correspondence is shown in Table I.
Corporation from 2002. It contains more than 40,000
recorded vulnerabilities (spanning more than two decades) TABLE I. TABLE SHOWING CORRESPONDENCE OF COMPARISON
HEADS
affecting more than 105,000 technologies from more than
14,000 vendors. SecurityFocus supports a classification No. of Head Name of Head
scheme under the taxonomic feature cause. Total eleven 1 Explicit
2 Dimensionality
vulnerability categories specified based on taxonomy of 3 Class Code
security faults in Unix operating system by Taimur Aslam 4 Class Details
[16]. Other taxonomy feature supported by SecurityFocus is 5 Multivariate
exploitation location with two categories remote and local. 6 Approximate Population Percentage
Open Source Vulnerability Data Base [10] is an open source
A. Vulnerability cause
database created in 2002 by the Black Hat Conference
people, currently covers 70,789 vulnerabilities, spanning Vulnerabilities grouped under the taxonomic feature cause
32,272 products from 4,735 researchers, over 46 years. help in understanding common type of errors and conditions
OSVDB provides two tier vulnerability classification that are reason for existence of majority of vulnerabilities.
107 http://sites.google.com/site/ijcsis/
ISSN 1947-5500
(IJCSIS) International Journal of Computer Science and Information Security,
Vol. 9, No. 6, June 2011
Paying attention to common errors and mistakes result in C. Exploitation Location
mitigating multiple of vulnerabilities and also avoid future Exploitation location is main feature affecting risk level of
vulnerabilities caused by same reason. SecurityFocus system as it determines attacker community and in turn
classifies vulnerabilities explicitly under feature cause and mitigation strategies. Databases SecurityFocus, OSVDB and
NVD and OSVDB also incorporates this feature in their Secunia explicitly classify vulnerabilities under this feature
classification scheme partially. Table II provides results of ranging from 2 to 9 classes. Table IV provides results of
comparative study of classification under feature cause in comparative study of classification under feature
these three databases. Exploitation location in these databases.
B. Vulnerability Impact D. Severity level
Exploitation of vulnerabilities results in degradation of Different vulnerabilities have different level of impact on
performance of system. Different vulnerabilities have the CIA of the system, which is measured by severity level.
different kind of impact on system performance. So Severity level information provided by databases
classification of vulnerabilities under the feature impact can qualitatively or quantitatively. Number of classes is
provide useful insights. The taxonomic feature vulnerability inconsistent in databases for the feature severity level
impact is used as classification criteria in X-Force, Secunia, varying from 3 to 5 in case of qualitative as shown in
NVD and OSVDB databases. Table III provides results of column 3 of Table IV. OSVDB provides severity ratings in
comparative study of classification under feature impact in terms of CVSS scores [18] only while SecurityFocus
these databases. doesn‟t include this information. Table V provides results of
comparative study of classification under feature Severity
TABLE II. COMPARISON OF CLASSIFICATION SCHEMES UNDER
TAXONOMIC FEATURE VULNERABILITY CAUSE
level in these databases.
VDB 1 2 3 4 5 6 TABLE III. COMPARISON OF CLASSIFICATION SCHEMES UNDER
C-SF1 Configuration Error 1.19 TAXONOMIC FEATURE VULNERABILITY IMPACT
C-SF2 Boundary Condition 16.70
Error VDB 1 2 3 4 5 6
C-SF3 Environment Error 0.31 I-X1 Gain Access 49.25
C-SF4 Input Validation Error 45.59 I-X2 Gain Privileges 4.0
C-SF5 Design Error 18.81 I-X3 Bypass Security 5.75
Security I-X4 File Manipulation 1.25
Y 11 C-SF6 Race Condition Error N 1.10
Focus I-X5 Data Manipulation 16.42
C-SF7 Origin Validation Error 0.50
C-SF8 Access Validation Error 5.60 X-Force N 11 I-X6 Obtain Information N 9.0
C-SF9 Failure to Handle 10.09 I-X7 Denial of Service 12.0
Exceptional Conditions I-X8 Configuration 0.08
C-SF10 Atomicity Error 0.03 I-X9 Informational 0.05
C-SF11 Unknown 0.08 I-X10 Other 1.5
C-N1 Authentication Issues 2.48 I-X11 None 0.7
C-N2 Credentials Management 1.01 I-S1 Brute force 0.21
C-N3 Buffer Errors 11.65 I-S2 Cross site scripting 17.5
C-N4 Cryptographic Issues 1.23 I-S3 Denial of Service 13.0
C-N5 Path Traversal 5.38 I-S4 Exposure of sensitive 14.23
C-N6 Code Injection 6.05 information
C-N7 Format String 0.53 I-S5 Exposure of system 2.67
Vulnerability information
Secunia Y 12 Y
C-N8 Configuration 0.89 I-S6 Hijacking 0.40
NVD P 16 N I-S7 Manipulation of data 15.87
C-N9 Input Validation 6.79
C-N10 Numeric Errors 3.01 I-S8 Privilege escalation 5.82
C-N11 OS Command Injections 0.24 I-S9 Security bypass 5.88
C-N12 Race Conditions 0.56 I-S10 Spoofing 1.56
C-N13 Resource Management 4.94 I-S11 System Access 21.46
Errors I-S12 Unknown 1.40
C-N14 SQL Injections 13.17 I-N1 Permissions, Privileges 7.49
C-N15 Link Following 1.28 and Access Control
C-N16 Design Error 2.45 I-N2 Cross Site Request Forgery 1.49
NVD P 04 N
C-O1 Authentication 2.18 I-N3 Cross site scripting 12.60
Management I-N4 Information leak/ 3.22
OSVDB P 04 C-O2 Cryptographic N 1.62 disclosure
C-O3 Misconfiguration 0.89 I-O1 Denial of Service 11.44
C-O4 Race Condition 1.39 I-O2 Information disclosure 18.66
OSVDB P 04 N
I-O3 Infrastructure 0.15
I-O4 Input manipulation 60.64
108 http://sites.google.com/site/ijcsis/
ISSN 1947-5500
(IJCSIS) International Journal of Computer Science and Information Security,
Vol. 9, No. 6, June 2011
TABLE IV. COMPARISON OF CLASSIFICATION SCHEMES UNDER database‟s vulnerability information under head „where‟.
TAXONOMIC FEATURE EXPLOITATION LOCATION
Severity level feature considered as univariate and classified
VDB 1 2 3 4 5 6 in four classes: critical, high, medium and low. Classes
Security E-SF1 Remote 80 under severity level feature are based on CVSS scores and
Y 02 N
Focus E-SF2 Local 20
E-O1 Physical access 0.45
can be associated with severity levels defined by other
E-O2 Local access 8.50 databases on the basis of CVSS scores only.
E-O3 Remote/network access 77.42 During the harmonization process, we have merged classes
E-O4 Local/Remote 4.37 with minor vulnerability population in nearest relevant
E-O5 Context dependent 6.36 classes with the objective to produce a classification that
OSVDB Y 09 N
E-O6 Dial up access 0.06
helps in focusing on main causes and impact areas. Class
E-O7 Wireless Vector 0.27
E-O8 Mobile Phone/Hand held 0.15 Other is included to keep the scope for expansion.
device Table VI presents concise view of proposed classification
E-O9 Unknown 2.39 scheme and mapping information.
E-S1 Remote 84.0
Secunia Y 03 E-S2 Local network N 8.0 B. Discussion
E-S3 Local system 8.0 To proactively secure any system it is crucial to focus on
root causes of vulnerabilities. It signifies the classification
TABLE V. COMPARISON OF CLASSIFICATION SCHEMES UNDER
TAXONOMIC FEATURE SEVERITY LEVEL under feature vulnerability cause. But as we can see in Table
II only SecurityFocus classifies vulnerabilities explicitly
VDB 1 2 3 4 5 6
S-X1 High 34
under this feature. NVD and OSVDB include few classes
X-Force Y 03 S-X2 Medium N 59 associated with cause in their classification scheme. It‟s
S-X3 Low 07 obvious from class details given in column 5 of Table II that
S-N1 High 44.6 information across these three databases is highly
NVD Y 03 S-N2 Medium N 48.0 incompatible. Moreover many taxonomies exists [16, 19-22]
S-N3 Low 7.40
that classify vulnerabilities under feature cause.
S-S1 Extremely critical 4.0
S-S2 Highly critical 19.0
Classification in these taxonomies based on classification
Secunia Y 05 S-S3 Moderately critical N 39.0 given by Landwehr et. al. in [19]. SecurityFocus‟s
S-S4 Less critical 35.0 vulnerability classification scheme is based on taxonomy of
S-S5 Not critical 3.0 security faults in Unix operating system by Taimur Aslam
[16]. So in proposed scheme we opted to select the
classification given by SecurityFocus as basis. In all eleven
IV. PROPOSED CLASSIFICATION SCHEME AND MAPPING
classes selected as specified in column 4 of Table 6. Feature
Analysis of classification schemes in major vulnerability is listed as univariate. Classes of NVD and OSVDB mapped
databases, in section III suggests that main taxonomic nearly due to incompatibility as specified in column 5 of
features in classification at highest level in hierarchy should Table VI. Few of the classes can‟t be mapped (see column 7
be vulnerability causes, vulnerability impact, exploitation of Table VI).
location and severity level. We propose a two level Taxonomic feature impact is used by most of the
vulnerability classification scheme based on this observation. databases, but Secunia is the only database that uses it
Further, mapping of classes in proposed scheme to the explicitly as classification criteria (see Table 3). Proposed
classes in analyzed vulnerability databases also presented, to scheme classify vulnerabilities under the feature
resolve the discrepancy. Summary of complete scheme vulnerability impact based on Secunia. X-force‟s information
presented below. about consequences is compatible with Secunia but in
contrast in treating classes as multivariate/univariate. NVD
A. Overview of scheme and OSVDB include few classes under feature impact. After
Vulnerability cause feature considered as univariate and identifying impact classes from both databases they are
classified in eleven classes specified in column IV of Table mapped on classification based on Secunia and X-Force. In
VI. Classes under the feature vulnerability cause are based all nine main categories identified as listed in column 4 of
on SecurityFocus‟s classification scheme. Vulnerability Table VI, that cover main impact classes included in these
impact feature considered as multivariate and classified in four databases. Mapping information given in column 5 of
Table VI specifies that classes File manipulation and data
nine classes, listed in column IV of Table VI. Classes under
manipulation in X-Force are merged into a single category
feature vulnerability impact are based on classification
and mapped onto manipulation of data class of Secunia.
scheme of Secunia database and vulnerability consequence
information provided by X-Force database. Exploitation
location considered as univariate and classified in three
classes: remote, local network and local. Classes under
feature exploitation location are based on Secunia
109 http://sites.google.com/site/ijcsis/
ISSN 1947-5500
(IJCSIS) International Journal of Computer Science and Information Security,
Vol. 9, No. 6, June 2011
TABLE VI. PROPOSED CLASSIFICATION SCHEME AND MAPPING
Taxonomic
2 5 4 Mapping Comments
Feature
Configuration Error C-SF1, C-N8, C-O3 Crptographic issues cann‟t be
Boundary Condition Error C-SF2, C-N13, C-N10, C-N3 mapped directly because they
Environment Error C-SF3 can be associated with different
Input Validation Error C-SF4, C-N5, C-N9, C-N11, C-N14 causes for ex. Numeric errors,
Design Error C-SF5, C-N16 boundary condition etc. So they
Vulnerability Race Condition Error C-SF6, C-N12, C-O4 are required to be as per reason
11 N associated.
Cause Origin Validation Error C-SF7, C-N1, C-N2, C-N6, C-N7, C-O1
Access Validation Error C-SF8, C-N15
Failure to Handle Exceptional C-SF9
Conditions
Atomicity Error C-SF10
Other C-SF11
Gain system access I-X1, I-S6, I-S11 Classes with minor population
Gain privileges I-X2, I-S8, I-N1 are merged to relevant class.
Bypass security I-X3, I-S9 Purpose is to objectively focus
Data manipulation I-X4, I-X5, I-S7, I-O4 on main impact areas.
Vulnerability
09 Y Exposure of information I-X6, I-S1, I-S4, I-S5, I-N4, I-O2
Impact
Denial of service I-X7, I-S5, I-O1
Cross site scripting I-S2, I-N2, I-N3
Spoofing I-S10
Other I-X8, I-X9, I-X10, I-S12, I-O3
Remote E-S1, E-O3, E-O6, E-O7, E-O8, E-SF1 E-SF1 need to be categorized
Attack Local Network E-S2, E-O4, E-SF1 depending on network type
03 N
Vector Local E-S3, E-O1, E-O2, E-SF2 (LAN/WAN). Definition of E-
O5 is ambiguous.
Critical S-X1, S-N1, S-S1, S-S2 Mapping based on CVSS scores.
Severity High S-X1, S-N1, S-S3 Critical (9-10), High (7-8.9),
04 N Medium (4-6.9), Low (0-3.9)
Level Medium S-X2, S-N2, S-S4
Low S-X3, S-N3, S-S5
Similarly Exposure of sensitive information and Exposure classes and further of vulnerabilities in them. Severity level
of system information of Secunia are merged and mapped feature information included in almost all databases but
onto Obtain information of X-Force. After reviewing number of levels vary from 3 to 5 (see Table V). To remove
population distribution given in column 7 of Table III, the inconsistency in a balanced way after analyzing the
Configuration, Informational classes of X-Force are mapped population distribution specified in column 7 of Table V,
onto others because of minor population (0.05-0.08) and four level grading proposed. Further mapping of severity
these classes are also covered in feature cause. levels to CVSS scores given in column 6 of Table VI that
Hijacking of Secunia is basically gaining access and brute will resolve disparity and give both qualitative and
force is obtaining information. Spoofing is kept as a quantitative classification. Severity level can be univariate
separate category because it bypasses security as well as only.
escalates privileges. A single vulnerability can impact in
multiple ways so the feature is listed as multivariate as V. CONCLUSION AND FUTURE WORK
specified in column 3 of Table VI. Efficiency of security evaluation process depends on its
Attack vector not only determines exploitation location but objectivity and vulnerability coverage. A proper
also inform about attacker class which in turn reflects attack vulnerability classification scheme can be helpful in this
techniques. It is an important feature that affects severity regard by increasing both objectivity and vulnerability
level and also guides in designing security solution plans. coverage. Moreover a proper classification scheme also
Although all databases cover this feature but high disparity helpful in categorization of newly discovered vulnerabilities
involved (see Table IV). Classes range from 2 to 9 as listed and trend analysis. Effectiveness of a classification scheme
in column 5 of Table IV. Proposed scheme suggests three mainly depends on the taxonomic features selected as a base
levels: local, local network and remote (see column 4 of for classification. Different classification schemes exist in
Table VI). These three levels cover all the significant attack different vulnerability databases based on variety of criteria.
dimensions and accordingly security plans can be developed There is lacking of a standard classification scheme. Five
depending on exposure of machine to LAN or WAN or major vulnerability databases are selected in this work and
physical. Various vulnerabilities although listed under same classification schemes adopted by them are analyzed.
cause or impact class can damage system in different
severity levels. So classification under feature severity level
is necessary in order to understand impact level of different
110 http://sites.google.com/site/ijcsis/
ISSN 1947-5500
(IJCSIS) International Journal of Computer Science and Information Security,
Vol. 9, No. 6, June 2011
ACKNOWLEDGMENT [9] NHS and NIST, National Vulnerability Database (NVD), automating
vulnerability management, security Measurement, and compliance
We would like to thank the anonymous reviewers who checking, http://nvd.nist.gov/scap.cfm , (Accessed on 10-05-2011).
provided helpful feedback on our manuscript. [10] http://osvdb.org/ (Accessed on 10-05-2011).
[11] Internet Security Services Online database X-Force, 2008. [Online]
REFERENCES Available: http://www.iss.net/xforce/ (Accessed on 10-05-2011)
[12] Tripathi, A. Singh, U.K., “Taxonomic Analysis of Classification
Schemes in Vulnerability Databases” (Communicated)
[1] D. Turner, M. Fossi, E. Johnson, T. Mack, J. Blackbird, S. Entwisle, [13] The MITRE Corporation (2008) Common vulnerabilities and
M. K. Low, D. McKinney, and C Wueest, "Symantec global internet exposures. [Online] Available:http://cve.mitre.org/ (Accessed on 10-
security threat report: Trends for july to december 2007," Symantec, 05-2011)
Tech. Rep., 2008.
[14] R.A. Martin, Common Weakness Enumeration (CWE v1.8). 2010,
[2] Secunia. Secunia yearly report 2010, http:// National Cyber Security Division of the U.S. Department of
secunia.com/gfx/pdf/Secunia_Yearly_Report_2010.pdf, 2011. Homeland Security.
[3] Zhongqiang Chen, Yuan Zhang, Zhongrong Chen, A Categorization
Framework for Commom Vulnerabilities and Exposures. In the [15] Security Focus Vulnerability Database. [Online] Available:
computer Journal Advance Access published online on May 7, 2009, http://www.securityfocus.com, (Accessed on 10-05-2011)
http://comjnl.oxfordjournals.org,doilO.1093/comjnl/bxp040 [16] T. Aslam, "A Taxonomy of Security Faults in the Unix Operating
[4] Alhazmi, O. H., Woo, S-W., Malaiya, Y. K., “Security Vulnerability System," M.S. thesis, Dept. of Compo Sci., Purdue Univ., Coast TR
Categories in Major Software Systems”, Proc. Third IASTED 95-09, 1995
International Conference Proceedings Communication, Network, and [17] Secunia. [Online] Available: http://secunia.com, (Accessed on 10-05-
Information Security, 2006, pp. 138-143. 2011)
[5] Lutz Lowis, Rafael Accorsi, "On a Classification Approach for SOA [18] Forum Of Incident Response And Security Teams (FIRST),
Vulnerabilities," Proc. 33rd Annual IEEE International Computer “Common vulnerability scoring system 2.0,” 2007. [Online].
Software and Applications Conference, vol. 2, 2009, pp.439-444. Available: http://www.first.org/cvss
[6] Somak Bhattacharya, S.K. Ghosh, "Security Threat Prediction in a [19] C. E. Landwehr et al., “A Taxonomy of Computer Program Security
Local Area Network Using Statistical Model," Proc. IEEE Flaws,” ACM Comp. Surveys, vol. 26, no. 3, Sept. 1994, pp. 211–
International Parallel and Distributed Processing Symposium, 2007, 254.
pp.425-432. [20] K. Jiwnani and M. Zelkowitz, “Maintaining Software with a Security
[7] Aslam,T., Krsul, I. and Spafford, E.H., “Use ofATaxonomy of Perspective,” Proc. Int’l Conf. Software Maintenance, 3–6 Oct. 2002,
Security Faults”, Proc. 19th National Information Systems Security pp. 194–203.
Conf., Baltimore, USA. , 1996, pp. 551–560. [21] W. Du and A. P. Mathur, “Categorization of Software Errors that Led
[8] Tripathi, A. Singh, U.K., “Towards Standardization of Vulnerability to Security Breaches,” Proc. 21st Nat’l Info. Sys. Sec.Conf., 1998.
Taxonomy”, Proc. 2nd International Conference on Computer [22] S. Kamara et al., “Analysis of Vulnerabilities in Internet Firewalls,”
Technology and Development, Cairo, Egypt, 2010, pp. 379-384. Comp. & Sec., vol. 22, no. 3, 2003, pp. 214–232
AUTHORS PROFILE
training division of CMC Ltd., New Delhi in initial years of his career. He has
authored a book on “ Internet and Web technology “ and his various research
Anshu Tripathi holds M.Tech. degree in Computer Science from Banasthali
papers are published in national and international journals of repute. He is
Vidyapith, Banasthali-INDIA. She is currently Pursuing Ph.D. in Computer
reviewer of International Journal of Network Security (IJNS), IJCSIS,
Science from Institute of Computer Science, Vikram University,Ujjain-
reviewer and member of conference committee of European Conference of
INDIA. Her research interest includes proactive network security, security
Knowledge Management (ECKM) since 2007. He is also reviewer of 4th IEEE
measurement, and risk analysis.
International Conference on Computer Science and Information Technology
and 2011 3rd International Conference on Machine Learning and Computing.
His research interest includes Computer Networks, Network Security,
Umesh Kumar Singh received his Ph.D. in Computer Science from Devi
Internet & Web Technology, Client-Server Computing and IT based
Ahilya University, Indore-INDIA. Presently he is Director in Institute of
education.
Computer Science, Vikram University, Ujjain-INDIA. He served as Professor
in Computer Science and Principal in Mahakal Institute of Computer Sciences
(MICS-MIT), Ujjain. He has served as Engineer (E&T) in education and
111 http://sites.google.com/site/ijcsis/
ISSN 1947-5500
(IJCSIS) International Journal of Computer Science and Information Security,
Vol. 9, No. 6, June 2011
Abrupt Change Detection of Fault in Power System
Using Independent Component Analysis
1
SATYABRATA DAS, 2SOUMYA RANJAN MOHANTY 3SABYASACHI PATTNAIK
1
Asstt Prof., Department of CSE, College of Engineering Bhubaneswar, Orissa, India-751024
2
Asstt Prof., Department of EE, Motilal Neheru National Institute of Technology, Allahabad, India-211004
3
Prof., Department of I&CT, Fakir Mohan University, Balasore, India -756019
E-mail: satya.das73@gmail.com,soumya@mnnit.ac.in, spattnaik40@yahoo.co.in,
Abstract— This paper proposes a novel approach for fault independent component of current samples. The proposed
detection in a power system based on Independent Component method performance have been tested under the presence of
Analysis (ICA). The index for detection of fault is derived from
noise, harmonics and with frequency variation and found to
independent components of faulty current samples. The proposed
approach is tested on simulated data obtained from be accurate. Independent component analysis (ICA) is selected
MATLAB/Simulink for a typical power system. The proposed for feature extraction because of its reliability to extract the
approach is compared with existing approaches available in relevant and useful features. Further, the proposed approach is
literature for fault detection in time-series data. The comparison compared with three existing approaches available in
demonstrates the accuracy and consistency of the proposed
literatures. The first one of these is a detector based on
approach in considered changing conditions of a typical power
system. By virtue of its accuracy and consistency, the proposed comparison of sample value with one cycle. The second one
approach can be used in real time applications also. being a differential approach based on phasor estimation [3]
while third is a moving-sum based detector where sum over
Index Terms— Digital relaying, distance relay, fault detection, one cycle of faulty current samples is chosen as index for
independent component analysis.
detection [11].
I. INTRODUCTION Rest of the paper is arranged as follows; section II gives a
brief description of three approaches used for comparative
Every power system is provided with a protective relay
which ensures better performance while maintaining minimum assessment of proposed approach followed by section III,
disturbance and damage. In last few years, digital relays have which gives a brief description of independent component
replaced their solid-state-device counterparts due to their fast, analysis technique. Next, section IV presents the discussion on
accurate and reliable operation. The fault diagnosis unit of the proposed approach based on independent component
digital relays contains a fault detector (FD) unit in addition to analysis while section V present the testing of the proposed
fault classification and fault localization unit[1]-[2].
approach. Finally, conclusions are given in section VI.
In recent years, a number of methods is available in the
literature for detection of power system faults. Fault can be
II. FAULT DETECTION TECHNIQUES USED FOR POWER SYSTEM
detected based on the comparison of difference between the BASED ON TIME-SERIES DATA
value in current samples for two consecutive cycles being
This section gives brief description of fault detection
greater than threshold value and phasor comparison scheme
techniques used for power system based on time-series data.
[3]–[4]. However it has the limitation due to the difficulties in
These three techniques are used to carry out the comparative
modeling the fault resistance. A Kalman filter–based approach
assessment of proposed approach in changing conditions of
[5]-[7] has been proposed in order to detect power system
the system. All these approaches are based on deterministic
faults. Wavelet based approach [8] is used to detect the abrupt
modeling of faulty current signal obtained from a typical
change in the signal. The synchronized segmentation is
power system.
applied for disturbance recognition [9]. Then, application of
adaptive whitening filter and wavelet transform has been used A. Sample Comparison (SC)
to detect the abrupt change in the signal [10]. However, these The first approach for fault detection is the conventional.
methods are sensitive to frequency deviation, presence of Here, decision is taken out by computing the difference of
noise and harmonics. current sample of signal with corresponding sample of the one
In this paper, algorithm for an abrupt change detection is cycle earlier. Under normal conditions, the computed
proposed where the index for detection is derived from difference comes out to be zero. When there is a fault in the
system, the current signal gets distorted and consequently
computed difference become significant. If the computed
112 http://sites.google.com/site/ijcsis/
ISSN 1947-5500
(IJCSIS) International Journal of Computer Science and Information Security,
Vol. 9, No. 6, June 2011
difference remains greater than a threshold value for three implementation, once a new sample is obtained, the oldest
consecutive samples, a fault is reported by the FD unit. Let sample is discarded and the sum is recalculated for the new
the discrete current signal be window. Thus, only one addition and one subtraction is
i( k ) I m sin( k ) (1) required at each step of computation. For large variation in
system conditions such as frequency variations, the window
Where, I m is the peak of the signal, is the discrete angular
size needs to be made adaptive to generate the zero sums in
frequency and is the phase angle. Then the index is derived normal condition. However, such variations are not common
as follows: in large power system. Let the discrete current signal be
ichange ( k ) i( k ) i( k N ) (2) i( k ) I m sin( k ) (10)
Indexsc ( k ) | ichange ( k )| (3) Then, the derivation of index is as follows:
k
Where, k is the time-instant and N is the window size of (11)
isum (k ) il
one period. If l k N 1
Indexsc (k) ( Indexsc )threshold , (4) Where, N is the window size for one cycle.
Indexocms ( k ) | isum | (12)
a fault is reported by FD unit
If Indexocms ( k ) ( Indexocms )threshold (13)
B. Phasor Comparison (PC)
This approach for fault detection is based on estimation of A fault is reported by FD unit. Also,
the phasor [3]. It is a relatively fast algorithm based on the isum ( k ) isum ( k 1 ) i( k ) i( k N ) (14)
derivative of the current signal. If the discrete current signal is, The above eqn. (14) shows that, for on-line computation, only
i( k ) I m sin( k ) (5) one addition and one subtraction is required at each step.
Where, I m is the peak of the signal, is the discrete III. INDEPENDENT COMPONENT ANALYSIS
angular frequency and is the phase angle. Then at any Since ICA is based on the statistical properties of signals, it
instant, k, the peak-value of the signal can be estimated as, works accurately in non-deterministic modeling of the signals
2 2
2 i "(k )) i '(k ) [12]. For ICA to be applied, following assumptions for the
Iˆ (k )
m 2
(6)
mixing and demixing models needs to be satisfied:
' 1. The source signals s (ti ) is statistically independent.
is the peak estimate of the signal, and i (k ) and
^
Where I m (k )
2. At most one of the source signals is Gaussian distributed.
i '' ( k ) are the first and second derivatives of discrete current 3. The number of observations M is greater or equal to the
signal respectively. The peak estimate is the magnitude of number of sources N (MN).
fundamental phasor at k-th estimate. The magnitude of the In addition to blind separation of sources, ICA is also used for
current phasor obtained at k-th instant is compared with that at representing data as linear combination of latent variables.
(k-3)-th instant. If the difference is more than the threshold There are different approaches for estimating the ICA model
value for three successive samples, a fault is reported by FD. which are based on the statistical properties of signals. Some
The derivation of index is as follows: of the methods used for ICA estimation are:
I ˆ ˆ
(k ) I (k ) I (k 3) (7) 1. by maximization of nongaussianity
change m m
2. by minimization of mutual information
Index pc ( k ) | I change ( k )| , (8) 3. by maximum likelihood estimation,
If 4. by tensorial methods
Index pc ( k ) ( Index pc )threshold (9) Blind source separation algorithm estimates the source signals
from observed mixtures. The word ‘blind’ emphasizes that the
then FD detects the fault. As the method is derivative based, it
source signals and the way the sources are mixed, i.e. the
is found to be sensitive to noise and signal distortions.
mixing model parameters, are unknown or known very
C. One-Cycle-Moving-Sum (OCMS): imprecisely. Independent component analysis is a blind source
This approach involves the computation of one cycle sum separation (BSS) algorithm, which transforms the observed
of current samples obtained from the power system [11]. This signals into mutually statistically independent signals. The
approach is based on the symmetrical nature of the current ICA algorithm has many technical applications including
waveforms in power system. In absence of fault, the computed signal processing, brain imaging, telecommunications and
sum comes out to be zero. However, on occurrence of fault in audio signal separation [12] – [14].
the power system, the corresponding sum will be non-zero or
equivalently greater than a chosen threshold. For on-line
113 http://sites.google.com/site/ijcsis/
ISSN 1947-5500
(IJCSIS) International Journal of Computer Science and Information Security,
Vol. 9, No. 6, June 2011
A. ICA estimation by maximization of nongaussianity: components have non-Gaussian distributions. Then, after
A measure of nongaussianity is negentropy J(y) which is the estimating the matrix A, we can compute its inverse, say W,
normalized differential entropy. By maximizing the and obtain the independent component simply by:
negentropy, the mutual information of the sources is s Wx (21)
minimized. Also, mutual information is a measure of the Fast ICA is an efficient algorithm based on fixed-point
independence of random variables. Negentropy is always non- iteration used for estimation of ICs in time series data [15].
negative and zero for Gaussian variables. [12] This approach for ICs estimation is 10-100 times faster than
J ( y ) H ( y gauss ) H ( y ) (15) the other methods that are used to reduce data dimension.
The differential entropy H of a random vector y with density
IV. PROPOSED FAULT DETECTION METHOD
py(η) is defined as
H ( y ) p y log p y d (16) This section presents the algorithm of the proposed method
for detection of abrupt changes due to occurrence of fault in
In equation (15) and (16), the estimation of negentropy
the power system. An abrupt change detector based on
requires the estimation of probability functions of source
independent components of current samples is proposed in this
signals which are unknown. Instead, the following
section. The index for detection is derived from independent
approximation of negentropy is used:
2
components of current sample obtained from data acquisition
J yi J E wiT x E G wiT x E G y gauss
(17) system. The proposed algorithm has been tested on simulation
data and is explained below:
Here, E denotes the statistical expectation and G is chosen as
(i) Data has been obtained from MATLAB/Simulink model
non-quadratic. Assuming that we observe n linear mixtures x1
of the interconnected power system considered in this work.
,..., xn of n independent components :
Also in this study, the pre-fault signal can be taken as non-
x j a1s1 a2 s2 .... an sn For all j (18)
faulty signal. The signal is first passed through the detection
We assume that each mixture x j as well as each independent block followed by classification block and finally through
component sk is a random variable, instead of a time localization block for deciding logic for trip signal system.
dependent signal. Without loss of generality, we can assume This constitutes fault diagnosis system.
that both the mixture variables and the independent (ii) The simulated signal is passed through first block where
components have zero mean. If this is not true, then the the removal of mean and de-correlation (for removal of second
observable variables xj can always be centered by subtracting order dependencies) is done. This constitutes the first level of
the sample mean, which makes the model zero mean. It is pre-processing. The output of this block is fed to the next
convenient to use vector-matrix notation instead of the sums block.
like in the previous equation. Let us denote by x, the random (iii) In this block, whitening of data followed by dimension
vector whose elements are the mixtures x1 ,..., xn and likewise reduction is performed for reducing redundancy in data.
by s the random vector with elements s1 ,..., sn . Let us denote Output of this block is fed to third block.
by A the matrix with elements aij. All vectors are taken as (iv) Now, principal components (PC) of data are determined
column vectors; thus xT , or the transpose of x, is a row vector. and fed to next block.
With this vector-matrix notation, the above mixing model (v) Here, independent components of data are calculated
becomes: using fixed point iteration of Fast ICA algorithm [12], [15].
x As (19) (vi) The ICA block returns demixing or separating matrix,
denoting the column of matrix A by a j the model can also be W f along with independent component, s f of real time signal.
written as For calculation of these variables matrix, x f is constructed
n
x a s i i
(20) from real time signal samples.
i 1
(vii) The stored signals or the pre-fault signals are used to
The statistical model in eqn (20) is called independent
component analysis, or ICA model. The ICA model is a construct matrix, xn for the derivation of index.
generative model and the independent components are latent The index is derived as
variables, meaning that they cannot be directly observed. Also Index proposed (k ) (normalised (abs(W f (k ) * xn (k ) s f (k )))2 ) (21)
the mixing matrix is assumed to be unknown. All we observe The fault is detected when Index proposed (k ) is greater than a
is the random vector x, and we must estimate both A and s
certain threshold ( Index proposed )threshold . The threshold,
using it. The starting point for ICA is the very simple
assumption that the components si are statistically ( Index proposed )threshold is evaluated by decision block and
independent. We also assume that the independent appropriate actions are taken. This information is then passed
114 http://sites.google.com/site/ijcsis/
ISSN 1947-5500
(IJCSIS) International Journal of Computer Science and Information Security,
Vol. 9, No. 6, June 2011
to classification block. The flowchart of the proposed
algorithm is illustrated in Fig. 1.
Fig. 2 A 230kV, 50 Hz Power System.
TABLE 1 PARAMETERS OF THE POWER SYSTEM MODEL
System voltage 230 kV
System frequency 50 Hz
Voltage of source 1 1.0 0 degree pu
Voltage of source 2 1.0 10 degree pu
Transmission line length 200 km
R = 0.0321 (ohm/km), L =
Positive sequence 0.57(mH/km), C = 0.021 (µF/km)
Zero sequence R = 0.0321 (ohm/km), L = 1.711
(mH/km), C = 0.021 (µF/km)
Series compensated 70 % C =176.34 µF
MOV Vref 40 kV 5 MJ
Current transformer (CT) 230kV, 50 Hz, 2000:1(turns ratio)
Faults of various types are simulated at different locations
and the performance of the algorithms is assessed. To
demonstrate the potential of the approach only few cases of
Fig. 1 Flowchart of the proposed algorithm. fault occurrence towards the farther end of the line are
demonstrated here. Nevertheless, the proposed method
V. COMPARATIVE ASSESSMENT AND TESTING OF THE responds similarly to other types of power system faults too.
PROPOSED ALGORITHM Single line-to-ground faults (AG-type) at 80% of the line have
A three phase transmission line (200km, 230 kV, 50 Hz) been created at different inception angles and the
connecting two systems with MOV and series capacitor kept corresponding phasor current, tapped at Bus 2 is processed
at the middle of line as shown in Fig. 2 has been considered through the different algorithms and the detection indices are
for comparative assessment of the performance between computed and normalized for comparison. As the FD is
existing and proposed algorithms. We have demonstrated the expected to be fast enough to detect the inception of fault
comparative assessment of the performance of the various within few milliseconds, first few sampling periods are
algorithms by considering the fault at the same instant at important to adjudge the performance of the algorithm.
different conditions for the sake of better clarity of the result.
A. Abrupt change detection without noise
The typical power system model in MATLAB/Simulink is
used in obtaining simulation data. At the receiving end, the The interconnected power system as shown in the Fig. 2 is
combination of linear and non-linear load is used. Depending simulated in MATLAB/Simulink. A L-G fault has been
on the switching of non-linear load, harmonics are obtained in created at 0.065 s with the system frequency as 50 Hz.
the current signal. The testing data is obtained through Comparative assessment of proposed algorithm is carried out
simulation of considered power system under different system with existing algorithms and shown in Fig. 3. Here, all the
changing conditions. A sampling rate of 1 kHz and a full indices approximately indicates the fault situation with
cycle window of N= 20 (50 Hz nominal frequency) has been minimum (1-2) sample delay after the inception of the fault. In
chosen for testing. the post-fault region, almost all the algorithm exhibits
consistent results.
115 http://sites.google.com/site/ijcsis/
ISSN 1947-5500
(IJCSIS) International Journal of Computer Science and Information Security,
Vol. 9, No. 6, June 2011
methods fail to sense the change immediately. In presence of
DC offset, algorithm based on accumulated sum of samples in
fixed data window, i.e. moving sum deviates from nominal
value of zero or very small value of the threshold. As a matter
of fact, before the occurrence of fault, moving sum algorithm
is not suitable approach for the description of the fault
occurrence. Similarly, the other existing algorithm also
performs poorly in the pre-fault period. On the other hand, the
proposed approach does not deviate from nominal threshold in
the pre-fault region. Even if the algorithm based on PC
approach shows comparable performance but still it is
inconsistent in post fault region since its value becomes equal
Fig. 3 Fault detection without noise. to threshold value. This is misinterpreted as fault inception.
B. Abrupt change detection with noise
A phase-to-ground fault is created at 0.065 s with the system
operating at nominal frequency of 50 Hz and a noise signal of
20 dB SNR added to original one for performance assessment.
The normalized indices are shown in Fig. 4. It is observed that
values of indices determined from three algorithms as
discussed in section II are significant even before the
occurrence of fault inception. However, index of proposed
method demonstrates the instant of fault inception correctly.
Indexsc as crosses the threshold before the occurrence of fault
i.e. in the steady state situation may be mis-interpreted as the
fault even if there is no fault. Indexocms also exhibits a non-
zero variation although it sums the total noisy signal over a
Fig. 5 Fault detection in presence of dc-offset.
fixed data window i.e. 20 samples per cycle. Thus, the indices
exhibits variation in pre-fault region and are not consistent in D. Abrupt change detection with harmonics
the post-fault region as well. On contrary, the proposed With the incorporation of non-linear load at Bus 2, the
method shows almost zero index value in pre-fault region and harmonics are generated in addition to the fundamental
consistent index in the post-fault region also. components in the signal. A phase-to-ground fault has been
created at 0.065 s in the system. The current signal is
processed through the different methods as described in earlier
sections. The normalized indices have been plotted in Fig. 6.
Fig. 4. Fault detection with 20 dB noise.
C. Abrupt change detection with DC offset
Fig. 6 Fault detection in presence of harmonics.
A phase-to-ground fault has been created at 0.065 s in the A favorable detection of fault by proposed algorithm over existing
system. The current signal is processed through the different
ones is observed.
methods as described in Section II and IV. The normalized
indices are given in Fig. 5. It is observed that the existing
116 http://sites.google.com/site/ijcsis/
ISSN 1947-5500
(IJCSIS) International Journal of Computer Science and Information Security,
Vol. 9, No. 6, June 2011
E. Abrupt change detection with change in frequency [2] P. K. Dash, A. K. Pradhan, and G. Panda, “A novel fuzzy neural
network based distance relaying scheme,” IEEE Trans. on Power
Delivery, vol. 15, pp. 902–907, 2000.
Frequency variations are common in power systems. Thus, [3] T. S. Sidhu, D. S. Ghotra, and M. S. Sachdev, “An adaptive distance
the frequency estimation is indispensable for demonstrating relay and its performance comparison with a fixed data window distance
relay,” IEEE Trans. on Power Delivery, vol. 17, pp. 691–697, 2002.
the performance of the existing algorithms such as sample [4] M. S. Sachdev and M. Nagpal, “A recursive least square algorithm for
comparison, phasor approach, moving-sum approach etc. In power system relaying and measurement applications,” IEEE Trans. on
the study, firstly the frequency is estimated by variable leaky- Power Delivery, vol. 6, pp. 1008–1015, 1991.
[5] F. N. Chowdhury, J. P. Christensen, and J. L. Aravena, “Power system
least mean square (VL-LMS) that tracks the original
fault detection and state estimation using Kalman filter with hypothesis
frequency change faster than complex LMS algorithm [16]- testing,” IEEE Trans. on Power Delivery, vol. 6, pp. 1025–1030, 1991.
[20]. After frequency estimation, assessment of the existing [6] A. Girgis and D. G. Hart, “Implementation of Kalman and adaptive
algorithm is demonstrated. A phase-to-ground fault has been Kalman filtering algorithms for digital distance protection on a vector
signal processor,” IEEE Trans. on Power Delivery, vol. 4, pp. 141–156,
created at 0.065 s in the system operating at nominal 1989.
frequency of 52 Hz. The normalized indices have been plotted [7] A. Girgis, “A new Kalman filtering based digital distance relaying,”
IEEE Trans. on Power Apparatus and Systems, vol. 101, pp. 3471–
in Fig.7. As indicated, the proposed approach is still consistent 3480, 1982.
against the indices of existing algorithms in tracking the point [8] Abhishek Ukil, Rastko Živanović, ”Abrupt change detection in power
of change. system fault analysis using wavelet transform”, International
Conference on Power Systems Transients (IPST’05), Montreal, Canada.
[9] Abhisek Ukil, and Rastko Zivanovic, “Application of Abrupt Change
Detection in Power Systems Disturbance Analysis and Relay
Performance Monitoring”, IEEE Trans. on Power Delivery, Vol. 22,
no. 1, 2007.
[10] Abhisek Ukil, and Rastko Zivanovic, “Abrupt change detection in
power system fault analysis using adaptive whitening filter and wavelet
transform”, Electric Power Systems Research, vol. 76, pp. 815–823,
2006.
[11] A. K. Pradhan, A. Routray, and S. R. Mohanty, , “A Moving Sum
Approach for Fault Detection of Power Systems”, Electric Power
Components and Systems, vol. 34, no. 4, pp. 385 – 399, 2005.
[12] Aapo Hyvärinen, Juha Karhunen, Erkki Oja, “Independent Component
Analysis”, A Wiley Interscience Publication, John Wiley & Sons, Inc.,
2001.
[13] Sanna Pöyhönen, Pedro Jover, Heikki Hyötyniemi, “Independent
component analysis of vibrations for fault diagnosis of an induction
motor”, Proceedings of IASTED International Conference Circuits,
Signals, and Systems, , Cancun, Mexico, May 19-21, 2003.
[14] G. Gele, M. Colas, C. Serviere, “Blind source separation: A tool for
Fig.7 Fault detection with change in frequency. rotating machine monitoring by vibration analysis”, Journal of Sound
and Vibration, vol. 248, no. 5, pp. 865-885, 2001.
[15] Hyvärinen, “Fast and robust fixed-point algorithms for independent
VI. CONCLUSION component analysis” IEEE Trans. on Neural Networks, vol. 10, no. 3,
pp. 626-634, 1999.
Fault detection for relaying application is a challenging task [16] F. Gustafson, Adaptive Filtering and Change Detection, New York:
in the presence of noise, harmonics and frequency change of John Wiley, 2000.
[17] A.K.Pradhan, A.Routray and Abir Basak “Power System Frequency
signal. Traditional methods are based on deterministic Estimation Using Least Mean Square Technique”, IEEE Trans. Power
modeling i.e. sinusoidal behavior of current/ voltage and are Delivery, vol. 20, no. 3, pp. 1812-1816, 2005.
[18] Orlando J. Tobias and Rui Seara “On the LMS Algorithm with Constant
therefore sensitive to noise. In this paper, a novel fault and Variable Leakage Factor in a Nonlinear Environment” IEEE
detection algorithm was proposed based on the independent Trans.on Signal Processing,vol. 54, no. 9, pp. 3448-3458, 2006.
[19] Max Kemenetsky and Bernard Widrow “A Variable Leaky LMS
components of current signal. The proposed technique does Adaptive Algorithm” IEEE conf. On signals, systems and computers, (1)
not assume sinusoidal behavior of current/ voltage signal. The , pp. 125-128 , November, 2004.
[20] Scott C. Douglas “Performance Comparison of Two Implementations of
performance of the method was assessed through simulation
the Leaky LMS Adaptive Filter” IEEE Trans. On Signal Processing
with different fault data and compared with existing vol. 45, no. 8, pp. 2125-212, August, 1997.
techniques. It has been found that this method provides very
consistent results under all the fault conditions. The method
was compatible with any sampling frequency conventionally AUTHORS BIBLIOGRAPHY
being used for relaying applications.
Mr. Satyabrata Das received the degree in Computer Sc & engineering from
REFERENCES Utkal University, in 1996. He received the M.Tech. degree in CSE from
ITER, Bhubaneswar. He is a research student of Fakir Mohan University,
[1] G. Phadke and J. S. Thorp, Computer Relaying for Power Systems, New Balasore in the dept. of I&CT Currently, he is an Asst. Professor at College of
York: John Wiley, 1988. Engineering Bhubaneswar, Orissa. His interests are in AI, Soft Computing,
Data Mining, DSP, Neural Network.
117 http://sites.google.com/site/ijcsis/
ISSN 1947-5500
(IJCSIS) International Journal of Computer Science and Information Security,
Vol. 9, No. 6, June 2011
Soumya Ranjan Mohanty received the Ph.D. degree from Indian Institute of
Technology (IIT), Kharagpur, India. Currently he is an Assistant Professor in
the Department of Electrical Engineering, Motilal Nehru National Institute of
Technology (MNNIT), Allahabad, India. His research area includes digital
signal processing applications in power system relaying and power quality,
pattern recognition and distributed generations.
Dr.Sabyasachi Pattnaik received the M.Tech. degree in CSE from IIT,
Delhi. He received the Ph.D. degree in Computer Sc from Utkal University.
Currently, he is a professor at Fakir Mohan University, Balasore. His research
interests include Data Mining, AI and Soft Computing.
118 http://sites.google.com/site/ijcsis/
ISSN 1947-5500
(IJCSIS) International Journal of Computer Science and Information Security,
Vol. 9, No. 6, June 2011
Modeling and Analyze the Deep Web: Surfacing
Hidden Value
SUNEET KUMAR ANUJ KUMAR YADAV
Associate Proffessor ;Computer Science Dept. Assistant Proffessor ;Computer Science Dept.
Dehradun Institute of Technology, Dehradun Institute of Technology,
Dehradun,India Dehradun,India
suneetcit81@gmail.com anujbit@gmail.com
RAKESH BHARATI RANI CHOUDHARY
Assistant Proffessor ;Computer Science Dept. Sr. Lecturer;Computer Science Dept.
Dehradun Institute of Technology, BBDIT,
Dehradun,India Ghaziabad,India
goswami.rakesh@gmail.com ranichoudhary04@gmail.com
Abstract—Focused web crawlers have recently emerged as an percent of Web users use search engines to find needed
alternative to the well-established web search engines. While the information, but nearly as high a percentage site the inability
well-known focused crawlers retrieve relevant web-pages, there to find desired information as one of their biggest
are various applications which target whole websites instead of frustrations.[2] According to a recent survey of search-engine
single web-pages. For example, companies are represented by satisfaction by market-researcher NPD, search failure rates
websites, not by individual web-pages. To answer queries
targeted at Websites, web directories are an established solution.
have increased steadily since 1997.[3] The importance of
In this paper, we introduce a novel focused website crawler to information gathering on the Web and the central and
employ the paradigm of focused crawling for the search of unquestioned role of search engines -- plus the frustrations
relevant websites. The proposed crawler is based on two-level expressed by users about the adequacy of these engines --
architecture and corresponding crawl strategies with an explicit make them an obvious focus of investigation. Our key findings
concept of websites. The external crawler views the web as a include:
graph of linked websites, selects the websites to be examined next • Public information on the deep Web is currently 400
and invokes internal crawlers. Each internal crawler views the to 550 times larger than the commonly defined World
web-pages of a single given website and performs focused (page)
Wide Web.
crawling within that website. Our Experimental evaluation
demonstrates that the proposed focused website crawler clearly • The deep Web contains 7,500 terabytes of
outperforms previous methods of focused crawling which were information compared to 19 terabytes of information
adapted to retrieve websites instead of single web-pages. in the surface Web.
• The deep Web contains nearly 550 billion individual
Keywords- Deep Web ; Link references ; Searchable Databases ; documents compared to the one billion of the surface
Site page-views. Web.
• More than 200,000 deep Web sites presently exist.
I. INTRODUCTION • Sixty of the largest deep-Web sites collectively
A. The Deep Web contain about 750 terabytes of information --
sufficient by themselves to exceed the size of the
Internet content is considerably more diverse and the surface Web forty times.
volume certainly much larger than commonly understood.
• On average, deep Web sites receive 50% greater
First, though sometimes used synonymously, the World Wide
monthly traffic than surface sites and are more highly
Web (HTTP protocol) is but a subset of Internet content. Other
linked to than surface sites; however, the typical
Internet protocols besides the Web include FTP (file transfer
(median) deep Web site is not well known to the
protocol), e-mail, news, Telnet, and Gopher (most prominent
Internet-searching public.
among pre-Web protocols). This paper does not consider
• The deep Web is the largest growing category of new
further these non-Web protocols [1]. Second, even within the
information on the Internet.
strict context of the Web, most users are aware only of the
content presented to them via search engines such as Excite, • Deep Web sites tend to be narrower, with deeper
Google, AltaVista, or Northern Light, or search directories content, than conventional surface sites.
such as Yahoo!, About.com, or LookSmart. Eighty-five • Total quality content of the deep Web is 1,000 to
2,000 times greater than that of the surface Web.
119 http://sites.google.com/site/ijcsis/
ISSN 1947-5500
(IJCSIS) International Journal of Computer Science and Information Security,
Vol. 9, No. 6, June 2011
• Deep Web content is highly relevant to every few documents and sites. It was a manageable task to post all
information need, market, and domain. documents as static pages. Because all pages were persistent
• More than half of the deep Web content resides in and constantly available, they could be crawled easily by
topic-specific databases. conventional search engines. In July 1994, the Lycos search
engine went public with a catalog of 54,000 documents.[10]
A full ninety-five per cent of the deep Web is publicly Since then, the compound growth rate in Web documents has
accessible information -- not subject to fees or subscriptions. been on the order of more than 200% annually! [11] Sites that
were required to manage tens to hundreds of documents could
easily do so by posting fixed HTML pages within a static
B. How Search Engines Work directory structure. However, beginning about 1996, three
Search engines obtain their listings in two ways: Authors phenomena took place. First, database technology was
may submit their own Web pages, or the search engines introduced to the Internet through such vendors as Bluestone's
"crawl" or "spider" documents by following one hypertext link Sapphire/Web (Bluestone has since been bought by HP) and
to another. The latter returns the bulk of the listings. Crawlers later Oracle. Second, the Web became commercialized
work by recording every hypertext link in every page they initially via directories and search engines, but rapidly evolved
index crawling. Like ripples propagating across a pond, to include e-commerce. And, third, Web servers were adapted
search-engine crawlers are able to extend their indices further to allow the "dynamic" serving of Web pages (for example,
and further from their starting points.The surface Web Microsoft's ASP and the Unix PHP technologies). Figure 2
contains an estimated 2.5 billion documents, growing at a rate represents, in a non-scientific way, the improved results that
of 7.5 million documents per day.[4a] The largest search can be obtained by Bright-Planet technology. By first
engines have done an impressive job in extending their reach, identifying where the proper searchable databases reside, a
though Web growth itself has exceeded the crawling ability of directed query can then be placed to each of these sources
search engines[5][6] Today, the three largest search engines in simultaneously to harvest only the results desired -- with
terms of internally reported documents indexed are Google pinpoint accuracy.
with 1.35 billion documents (500 million available to most
searches),[7] Fast, with 575 million documents [8] and FIGURE 2. HARVESTING THE DEEP AND SURFACE WEB WITH A DIRECTED
QUERY ENGINE
Northern Light with 327 million documents.[9]
Moreover, return to the premise of how a search engine
obtains its listings in the first place, whether adjusted for
popularity or not. That is, without a linkage from another Web
document, the page will never be discovered. But the main
failing of search engines is that they depend on the Web's
linkages to identify what is on the Web. Figure 1 is a graphical
representation of the limitations of the typical search engine.
The content identified is only what appears on the surface and
the harvest is fairly indiscriminate. There is tremendous value
that resides deeper than this surface content. The information
is there, but it is hiding beneath the surface of the Web.
FIGURE 1. SEARCH ENGINES: DRAGGING A NET ACROSS THE WEB'S
SURFACE HIDDEN VALUE ON THE WEB
Additional aspects of this representation will be discussed
throughout this study. For the moment, however, the key
points are that content in the deep Web is massive --
approximately 500 times greater than that visible to
conventional search engines -- with much higher quality
throughout.
III. STUDY OBJECTIVES
II. SEARCHABLE DATABASES: HIDDEN VALUE ON THE
To perform the study discussed, we used our technology in
WEB
an iterative process. Our goal was to:
How does information appear and get presented on the • Quantify the size and importance of the deep Web.
Web? In the earliest days of the Web, there were relatively
120 http://sites.google.com/site/ijcsis/
ISSN 1947-5500
(IJCSIS) International Journal of Computer Science and Information Security,
Vol. 9, No. 6, June 2011
• Characterize the deep Web's content, quality, and • Estimating the total number of records or documents
relevance to information seekers. contained on that site.
• Discover automated means for identifying deep Web • Retrieving a random sample of a minimum of ten
search sites and directing queries to them. results from each site and then computing the
• Begin the process of educating the Internet-searching expressed HTML-included mean document size in
public about this heretofore hidden and valuable bytes. This figure, times the number of total site
information storehouse. records, produces the total site size estimate in bytes.
• Indexing and characterizing the search-page form on
the site to determine subject coverage.
A. What Has Not Been Analyzed or Included in Results
This paper does not investigate non-Web sources of Estimating total record count per site was often not
Internet content. This study also purposely ignores private straightforward. A series of tests was applied to each site and
intranet information hidden behind firewalls. Many large are listed in descending order of importance and confidence in
companies have internal document stores that exceed terabytes deriving the total document count:
of information. Since access to this information is restricted,
its scale can not be defined nor can it be characterized. Also, • E-mail messages were sent to the webmasters or
while on average 44% of the "contents" of a typical Web contacts listed for all sites identified, requesting
document reside in HTML and other coded information (for verification of total record counts and storage sizes
example, XML or Java script),[12] this study does not (uncompressed basis); about 13% of the sites
evaluate specific information within that code. We do, provided direct documentation in response to this
however, include those codes in our quantification of total request.
content. Finally, the estimates for the size of the deep Web • Total record counts as reported by the site itself. This
include neither specialized search engine sources -- which may involved inspecting related pages on the site,
be partially "hidden" to the major traditional search engines – including help sections, site FAQs, etc.
nor the contents of major search engines themselves. This • Documented site sizes presented at conferences,
latter category is significant. Simply accounting for the three estimated by others, etc. This step involved
largest search engines and average Web document sizes comprehensive Web searching to identify reference
suggests search-engine contents alone may equal 25 terabytes sources.
or more [13] or somewhat larger than the known size of the
• Record counts as provided by the site's own search
surface Web.
function. Some site searches provide total record
counts for all queries submitted. For others that use
B. A Common Denominator for Size Comparisons the NOT operator and allow its stand-alone use, a
All deep-Web and surface-Web size figures use both total query term known not to occur on the site such as
number of documents (or database records in the case of the "NOT ddfhrwxxct" was issued. This approach returns
deep Web) and total data storage. Data storage is based on an absolute total record count. Failing these two
"HTML included" Web-document size estimates.[11] This options, a broad query was issued that would capture
basis includes all HTML and related code information plus the general site content; this number was then
standard text content, exclusive of embedded images and corrected for an empirically determined "coverage
standard HTTP "header" information. Use of this standard factor," generally in the 1.2 to 1.4 range [14].
convention allows apples-to-apples size comparisons between • A site that failed all of these tests could not be
the surface and deep Web. The HTML-included convention measured and was dropped from the results listing.
was chosen because:
V. ANALYSIS OF STANDARD DEEP WEB SITES
• Most standard search engines that report document
sizes do so on this same basis. Analysis and characterization of the entire deep Web
• When saving documents or Web pages directly from involved a number of discrete tasks:
a browser, the file size byte count uses this
convention. • Estimation of total number of deep Web sites.
• Deep web Size analysis.
All document sizes used in the comparisons use actual byte • Content and coverage analysis.
counts (1024 bytes per kilobyte) • Site page views and link references.
• Growth analysis.
• Quality analysis.
IV. ANALYSIS OF LARGEST DEEP WEB SITES
A. Estimation of Total Number of Sites
Site characterization required three steps:
The basic technique for estimating total deep Web sites
uses "overlap" analysis, the accepted technique chosen for two
121 http://sites.google.com/site/ijcsis/
ISSN 1947-5500
(IJCSIS) International Journal of Computer Science and Information Security,
Vol. 9, No. 6, June 2011
of the more prominent surface Web size analyses.[5b][15] We multiplier applied to the entire population estimate. We
used overlap analysis based on search engine coverage and the randomized our listing of 17,000 search site candidates. We
deep Web compilation sites noted above. The technique is then proceeded to work through this list until 100 sites were
illustrated in the diagram below: fully characterized. We followed a less-intensive process to
the large sites analysis for determining total record or
FIGURE3. SCHEMATIC REPRESENTATION OF "OVERLAP" ANALYSIS
document count for the site. Exactly 700 sites were inspected
in their randomized order to obtain the 100 fully characterized
sites. All sites inspected received characterization as to site
type and coverage; this information was used in other parts of
the analysis.
C. Content Coverage and Type Analysis
Content coverage was analyzed across all 17,000 search
sites in the qualified deep Web pool (results shown in Table
1); the type of deep Web site was determined from the 700
hand-characterized sites. Broad content coverage for the entire
pool was determined by issuing queries for twenty top-level
domains against the entire pool. Because of topic overlaps,
total occurrences exceeded the number of sites in the pool; this
total was used to adjust all categories back to a 100% basis.
TABLE 1. DISTRIBUTION OF DEEP SITES BY SUBJECT AREA
Overlap analysis involves pair-wise comparisons of the
Deep web coverage
number of listings individually within two sources, na and nb,
and the degree of shared listings or overlap, n0, between them. Agriculture 2.7%
Assuming random listings for both na and nb, the total size of
Arts 6.6%
the population, N, can be estimated. The estimate of the
fraction of the total population covered by na is no/nb; when Business 5.9%
applied to the total size of na an estimate for the total
Computing Web 6.9%
population size can be derived by dividing this fraction into
the total size of na. These pair-wise estimates are repeated for Education 4.3%
all of the individual sources used in the analysis.
Employment 4.1%
To illustrate this technique, assume, for example, we know our
total population is 100. Then if two sources, A and B, each Engineering 3.1%
contain 50 items, we could predict on average that 25 of those
Government 3.9%
items would be shared by the two sources and 25 items would
not be listed by either. According to the formula above, this Health 5.5%
can be represented as: 100 = 50 / (25/50) There are two keys Humanities 13.5%
to overlap analysis. First, it is important to have a relatively
accurate estimate for total listing size for at least one of the Law/polices 3.9%
two sources in the pair-wise comparison. Second, both sources Lifestyle 4.0%
should obtain their listings randomly and independently from
one another. This second premise is in fact violated for our News/Media 12.2%
deep Web source analysis. Compilation sites are purposeful in People, companies 4.9%
collecting their listings, so their sampling is directed. And, for
search engine listings, searchable databases are more Recreation, Sports 3.5%
frequently linked to because of their information value which References 4.5%
increases their relative prevalence within the engine
listings.[4b] Thus, the overlap analysis represents a lower Science, Math 4.0%
bound on the size of the deep Web since both of these factors Travel 3.4%
will tend to increase the degree of overlap, n0, reported
between the pair wise sources. Shopping 3.2%
Hand characterization by search-database type resulted in
B. Deep Web Size Analysis assigning each site to one of twelve arbitrary categories that
In order to analyze the total size of the deep Web, we need captured the diversity of database types. These twelve
an average site size in documents and data storage to use as a categories are:
122 http://sites.google.com/site/ijcsis/
ISSN 1947-5500
(IJCSIS) International Journal of Computer Science and Information Security,
Vol. 9, No. 6, June 2011
• Topic Databases -- subject-specific aggregations of The queries were specifically designed to limit total results
information, such as SEC corporate filings, medical returned from any of the six sources to a maximum of 200 to
databases, patent records, etc. ensure complete retrieval from each source.[21] The specific
• Internal site -- searchable databases for the internal technology configuration settings are documented in the
pages of large sites that are dynamically created, such endnotes.[22] The "quality" determination was based on an
as the knowledge base on the Microsoft site. average of our technology's VSM and mEBIR computational
• Publications -- searchable databases for current and linguistic scoring methods. [23] [24] the "quality" threshold
archived articles. was set at our score of 82, empirically determined as roughly
• Shopping/Auction. accurate from millions of previous scores of surface Web
• Classifieds. documents.
• Portals -- broader sites that included more than one of
these other categories in searchable databases. VI. CONCLUSION
• Library -- searchable internal holdings, mostly for
university libraries. This study is the first known quantification and
characterization of the deep Web. Very little has been written
• Yellow and White Pages -- people and business
or known of the deep Web. Estimates of size and importance
finders.
have been anecdotal at best and certainly underestimate scale.
• Calculators -- while not strictly databases, many do
For example, Intelliseek's "invisible Web" says that, "In our
include an internal data component for calculating
best estimates today, the valuable content housed within these
results. Mortgage calculators, dictionary look-ups,
databases and searchable sources is far bigger than the 800
and translators between languages are examples.
million plus pages of the 'Visible Web.'" They also estimate
• Jobs -- job and resume postings. total deep Web sources at about 50,000 or so.[25] A mid-1999
• Message or Chat . survey by About. Com’s Web search guide concluded the size
• General Search -- searchable databases most often of the deep Web was "big and getting bigger."[26] A paper at
relevant to Internet search topics and information. a recent library science meeting suggested that only "a
D. Site Page-views and Link References relatively small fraction of the Web is accessible through
search engines."[27] The deep Web is about 500 times larger
Netscape's "What's Related" browser option, a service
than the surface Web, with, on average, about three times
from Alexa, provides site popularity rankings and link
higher quality based on our document scoring methods on a
reference counts for a given URL.[17] About 71% of deep
per-document basis. On an absolute basis, total deep Web
Web sites have such rankings. The universal power function (a
quality exceeds that of the surface Web by thousands of times.
logarithmic growth rate or logarithmic distribution) allows
Total number of deep Web sites likely exceeds 200,000 today
page-views per month to be extrapolated from the Alexa
and is growing rapidly.[28] Content on the deep Web has
popularity rankings. [18] The "What's Related" report also
meaning and importance for every information seeker and
shows external link counts to the given URL. A random
market. More than 95% of deep Web information is publicly
sampling for each of 100 deep and surface Web sites for
available without restriction. The deep Web also appears to be
which complete "What's Related" reports could be obtained
the fastest growing information component of the Web.
were used for the comparisons.
REFERENCES
E. Growth Analysis [1]. A couple of good starting references on various Internet
The best method for measuring growth is with time-series protocols can be found at http://wdvl.com/Internet/Protocols/
analysis. However, since the discovery of the deep Web is so and
new, a different gauge was necessary. Who is [19] searches http://www.webopedia.com/Internet_and_Online_Services/Int
associated with domain-registration services [16] return ernet/Internet_Protocols/.
records listing domain owner, as well as the date the domain [2]. Tenth edition of GVU's (graphics, visualization and
was first obtained (and other information). Using a random usability) WWW User Survey, May 14, 1999. [formerly
sample of 100 deep Web sites [17b] and another sample of http://www.gvu.gatech.edu/user_surveys/survey-1998-
100 surface Web sites [20] we issued the domain names to a 10/tenthreport.html.]
Who is search and retrieved the date the site was first [3]. 3a, 3b. "4th Q NPD Search and Portal Site Study," as
established. These results were then combined and plotted for reported by Search Engine Watch [formerly
the deep vs. surface Web samples. http://searchenginewatch.com/reports/npd.html]. NPD's Web
site is at http://www.npd.com/.
F. Quality Analysis [4]. 4a, 4b "Sizing the Internet, Cyveillance [formerly
Quality comparisons between the deep and surface Web http://www.cyveillance.com/web/us/downloads/Sizing_the_Int
content were based on five diverse: The five subject areas ernet.pdf].
were agriculture, medicine, finance/business, science, and law.
123 http://sites.google.com/site/ijcsis/
ISSN 1947-5500
(IJCSIS) International Journal of Computer Science and Information Security,
Vol. 9, No. 6, June 2011
[5]. 5a, 5b. S. Lawrence and C.L. Giles, "Searching the World [19]. See, for example among many, Better Who is at
Wide Web," Science 80:98-100, April 3, 1998. http://betterwhois.com.
[6]. S. Lawrence and C.L. Giles, "Accessibility of Information [20]. The surface Web domain sample was obtained by first
on the Web," Nature 400:107-109, July 8, 1999. issuing a meaningless query to Northern Light, 'the AND NOT
[7]. See http://www.google.com. ddsalsrasve' and obtaining 1,000 URLs. This 1,000 was
[8]. See http://www.alltheweb.com and quoted numbers on randomized to remove (partially) ranking prejudice in the
entry page. order Northern Light lists results.
[9]. Northern Light is one of the engines that allows a "NOT [21]. An example specific query for the "agriculture" subject
meaningless" query to be issued to get an actual document areas is "agriculture* AND (swine OR pig) AND 'artificial
count from its data stores. See http://www.northernlight.com insemination' AND genetics."
NL searches used in this article exclude its "Special [22]. The Bright-Planet technology configuration settings
Collections" listing. were: max. Web page size, 1 MB; min. page size, 1 KB; no
[10]. See date range filters; no site filters; 10 threads; 3 retries allowed;
http://www.wiley.com/compbooks/sonnenreich/history.html. 60 sec. Web page timeout; 180 minute max. Download time;
[11]. 11a, 11b. This analysis assumes there were 1 million 200 pages per engine.
documents on the Web as of mid-1994. [23]. The vector space model, or VSM, is a statistical model
[12]. Empirical Bright-Planet results from processing millions that represents documents and queries as term sets, and
of documents provide an actual mean value of 43.5% for computes the similarities between them. Scoring is a simple
HTML and related content. Using a different metric, NEC sum-of-products computation, based on linear algebra. See
researchers found HTML and related content with white space further: Salton, Gerard, Automatic Information Organization
removed to account for 61% of total page content (see 7). Both and Retrieval, McGraw-Hill, New York, N.Y., 1968; and,
measures ignore images and so-called HTML header content. Salton, Gerard, Automatic Text Processing, Addison-Wesley,
Reading, MA, 1989.
[13]. Rough estimate based on 700 million total documents [24]. See, as one example among many, CareData.com, at
indexed by AltaVista, Fast, and Northern Light, at an average [formerly http://www.citeline.com/pro_info.html].
document size of 18.7 KB (see reference 7) and a 50% [25] See the Help and then FAQ pages at [formerly
combined representation by these three sources for all major http://www.invisibleweb.com].
search engines. Estimates are on an "HTML included" basis. [26] C. Sherman, "The Invisible Web," [formerly
http://websearch.about.com/library/weekly/aa061199.htm]
[14]. For example, the query issued for an agriculture-related [27] I. Zachery, "Beyond Search Engines," presented at the
database might be "agriculture." Then, by issuing the same Computers in Libraries 2000 Conference, March 15-17, 2000,
query to Northern Light and comparing it with a Washington, DC; [formerly
comprehensive query that does not mention the term http://www.pgcollege.org/library/zac/beyond/index.htm]
"agriculture" [such as "(crops OR livestock OR farm OR corn [28] The initial July 26, 2000, version of this paper stated an
OR rice OR wheat OR vegetables OR fruit OR cattle OR pigs estimate of 100,000 potential deep Web search sites.
OR poultry OR sheep OR horses) AND NOT agriculture"] an Subsequent customer projects have allowed us to update this
empirical coverage factor is calculated. analysis, again using overlap analysis, to 200,000 sites. This
[15]. K. Bharat and A. Broder, "A Technique for Measuring site number is updated in this paper, but overall deep Web size
the Relative Size and Overlap of Public Web Search Engines," estimates have not. In fact, still more recent work with foreign
paper presented at the Seventh International World Wide Web language deep Web sites strongly suggests the 200,000
Conference, Brisbane, Australia, April 14-18, 1998. The full estimate is itself low.
paper is available at
http://www7.scu.edu.au/1937/com1937.htm.
[16]. See, for example,
http://www.surveysystem.com/sscalc.htm, for a sample size
calculator.
[17. 17a, 17b. See http://cgi.netscape.com/cgi-
bin/rlcgi.cgi?URL=www.mainsite.com./dev-scripts/dpd
[formerly http://cgi.netscape.com/cgi-
bin/rlcgi.cgi?URL=www.mainsite.com./dev-scripts/dpd]
[18]. See reference 38. Known page-views for the logarithmic
popularity rankings of selected sites tracked by Alexa are used
to fit a growth function for estimating monthly page-views
based on the Alexa ranking for a given URL.
124 http://sites.google.com/site/ijcsis/
ISSN 1947-5500
(IJCSIS) International Journal of Computer Science and Information Security,
Vol. 9, No. 6, 2011
Instigation of Orthogonal Wavelet Transforms using
Walsh, Cosine, Hartley, Kekre Transforms and their
use in Image Compression
Dr. H. B.Kekre Dr. Tanuja K. Sarode Sudeep D. Thepade Ms. Sonal Shroff
Sr. Professor, Asst. Professor Associate Professor, Lecturer,
MPSTME, SVKM’s Thadomal Shahani Engg. MPSTME, SVKM’s Thadomal Shahani Engg.
NMIMS (Deemed-to-be College, NMIMS (Deemed-to-be College
University, Vileparle(W), Bandra (W), Mumbai-50, University, Vileparle(W), Bandra (W), Mumbai-50,
Mumbai-56, India. India. Mumbai-56, India. India.
Abstract—In this paper a novel orthogonal wavelet transform molecular dynamics, astrophysics, optics, quantum mechanics
generation method is proposed. To check the advantage of etc. This change has also occurred in image processing, blood-
wavelet transforms over the respective orthogonal transform in pressure, heart-rate and ECG analyses, DNA analysis, protein
image compression, the generated wavelet transforms are applied analysis, climatology, general signal processing, speech, face
to the color images of size 256x256x3 on each of the color planes
R, G, and B separately, and thus the transformed R, G, and B
recognition, computer graphics and multifractal analysis.
planes are obtained. Form each of these transformed color Wavelet transforms are also starting to be used for
planes, the 70% to 95% of the data (in form of coefficients having communication applications. One use of wavelet
lower energy values) is removed and image is reconstructed. The approximation is in data compression. Like other transforms,
orthogonal transforms Discrete Cosine Transform (DCT), Walsh wavelet transforms can be used to transform data then, encode
Transform, Hartley Transform and Kekre Transform are used the transformed data, resulting in effective compression [8].
for the generation of DCT Wavelets, Walsh Wavelets, Hartley Wavelet compression can be either lossless or lossy. The
Wavelets, and Kekre Wavelets respectively. From the results it is wavelet compression methods are adequate for representing
observed that the respective Wavelet transform outperforms the high-frequency components in two-dimensional images.
original orthogonal transform.
So far wavelets of only Haar transform have been studied.
The paper presents the wavelet generation of transforms alias,
I. INTRODUCTION Walsh transform, DCT, Hartley transform and Kekre
transform. Also the use of these transform wavelets is
The development of wavelets can be linked to several separate proposed and strudied for image compression. The
trains of thought, starting with Haar's work in the early 20th experimental results have shown better data compression can
century [16,17]. Wavelets are mathematical tools that can be be achieved in transform wavelets than using image
used to extract information from many different kinds of data, transforms themselves.
including images [21,22,24]. Sets of wavelets are generally
needed to analyze data fully. A set of "complementary" II. EXSISTING TRANSFORMS
wavelets will reconstruct data without gaps or overlap so that
the deconstruction process is mathematically reversible and is This section discusses some of the existing transforms, Walsh,
with minimal loss. Generally, wavelets are purposefully DCT, Hartley and Kekre.
crafted to have specific properties that make them useful for A. DCT
image processing. Wavelets can be combined, using a "shift,
A discrete cosine transform (DCT) expresses a sequence of
multiply and sum" technique called convolution, with portions
finitely many data points in terms of a sum of cosine functions
of an unknown signal(data) to extract information from the oscillating at different frequencies. In particular, a DCT is a
unknown signal. Wavelet transforms are now being adopted Fourier-related transform similar to the discrete Fourier
for a vast number of applications, often replacing the transform (DFT), but using only real numbers. DCTs are
conventional Fourier transform. They have advantages over equivalent to DFTs of roughly twice the length, operating on
traditional fourier methods in analyzing physical situations real data with even symmetry. There are eight standard DCT
where the signal contains discontinuities and sharp spikes[1- variants, of which four are common. The DCTs are important
4]. In fourier analysis the local properties of the signal are not to numerous applications in science and engineering, from
detected easily. STFT(Short Time Fourier Transform)[5] was lossy compression of audio and images to spectral methods for
introduced to overcome this difficulty. However it gives local the numerical solution of partial differential equations. For
properties at the cost of global properties. Wavelets overcome compression, the cosine functions are much more efficient
this shortcoming of Fourier analysis [6,7] as well as STFT. whereas for differential equations the cosines express a
Many areas of physics have seen this paradigm shift, including particular choice of boundary conditions.
125 http://sites.google.com/site/ijcsis/
ISSN 1947-5500
(IJCSIS) International Journal of Computer Science and Information Security,
Vol. 9, No. 6, 2011
B. Walsh Transform
The Walsh matrix was proposed by Joseph Leonard Walsh ⎧ 1 ,x ≤ y
in 1923 [18,19]. Each row of a Walsh matrix corresponds to a ⎪
Walsh function. A Walsh matrix is a square matrix, with K x, y = ⎨− N + ( x + 1) , x = y + 1
dimensions a power of 2. The entries of the matrix are either +1 ⎪ 0 ,x > y +1
or −1. It has the property that the dot product of any two
distinct rows (or columns) is zero [20,23,25]. The sequency
⎩
ordering of the rows of the Walsh matrix can be derived from
(1)
the ordering of the Hadamard matrix by first applying the bit-
reversal permutation and then the Gray code permutation[9]. All diagonal elements and the upper diagonal elements are
The Walsh matrix (and Walsh functions) are used in computing one, while lower diagonal elements except the one exactly
the Walsh transform and have applications in the efficient below the diagonal are zero.
implementation of certain signal processing operations.
III. GENERATING WAVELET FROM ANY ORTHOGONAL
C. Hartley Transform TRANSFORM
Hartley transform was proposed by R. V. L. Hartley in Wavelet transform matrix of size P2 x P2 can be generated
1942, as an alternative to the Fourier transform[10]. It is one of from any orthogonal transform M of size PxP. For example, if
many known Fourier-related transforms. Compared to the we have orthogonal transform matrix of size 9x9, then its
Fourier transform, the Hartley transform has the advantages of corresponding wavelet transform matrix will have size 81x81.
transforming real functions to real functions (as opposed to i.e. for orthogonal matrix of size P, wavelet transform matrix
requiring complex numbers) and of being its own inverse. size will be Q, such that Q = P2.
D. Kekre Transform Consider orthogonal transform M of size pxp as shown
below.
Kekre transform[11] matrix is the generic version of
Kekre’s LUV color space matrix[12-15]. Most of the other
transform matrices have to be in powers of 2. This condition is
not required in Kekre transform. Any term in the Kekre
transform is generated as
M11 M12 ... M1 (P-1) M1P
M21 M22 ... M2 (P-1) M2P
. . ... . .
. . . .
MP1 MP2 ... MP (P-1) MPP
Figure 1 : PxP orthogonal transform matrix
1st column of M 2nd column of M pth column of M
Repeated P times Repeated P times Repeated P times
M11 M11 ... M11 M12 M12 ... M12 ... M1P M1P ... M1P
M21 M21 ... M21 M22 M22 ... M22 ... M2P M2P ... M2P
. . ... . . . ... . ... . . ... .
. . ... . . . ... . ... . . ... .
MP1 MP1 ... MP1 MP2 MP2 ... MP2 ... MPP MPP ... MPP
M21 M22 ... M2P 0 0 ... 0 ... 0 0 ... 0
0 0 ... 0 M21 M22 ... M2P ... 0 0 ... 0
. . . . . . . . . . . . .
. . . . . . . . . . . . .
0 0 ... 0 0 0 ... 0 ... M21 M22 ... M2P
... ... ...
... ... ...
MP1 MP2 ... MPP 0 0 ... 0 ... 0 0 ... 0
0 0 ... 0 MP1 MP2 ... MPP ... 0 0 ... 0
. . . . . . . . . . . . .
. . . . . . . . . . . . .
0 0 ... 0 0 0 ... 0 ... MP1 MP2 ... MPP
Figure.2: QxQ wavelet transform generated from PxP orthogonal transform( (Q = P2 )
126 http://sites.google.com/site/ijcsis/
ISSN 1947-5500
(IJCSIS) International Journal of Computer Science and Information Security,
Vol. 9, No. 6, 2011
Figure 2 shows QxQ wavelet transform matrix generated from
PxP orthogonal transform matrix such that Q = P2. To Table 1,3,5 and 7 shows the comparison of MSE values
generate the wavelet matrix, the every column of the obtained from data compressed using DCT, Walsh, Hartley
orthogonal transform matrix is repeated P times. Then the and Kekre transforms applied on all the eleven test images
second row is translated P times to generate next P rows. respectively.
Similarly all rows are translated to generate P rows
corresponding to each row. Finally we get the wavelet matrix Table 2,4,6 and 8 shows the comparison of MSE values
of the size QxQ, where Q = P2 obtained from data compressed using DCT wavelet, Walsh
wavelet, Hartley wavelet and Kekre wavelet transforms
applied on all the eleven test images respectively.
IV. PROPOSED METHOD
In this section, the image compression using wavelet Figure 4: Comparison of average MSE with respect to 95% to
transform’s application is proposed. 70% of data compress using DCT wavelet, Walsh wavelet,
Step 1. Consider an image of size 256x256. The wavelet Hartley wavelet, Kekre wavelet, DCT, Walsh, Hartley and
transform matrix of size 256x256 is generated Kekre transform.
from orthogonal matrix of size 16x16.
Figure 5,6,7,8 shows the results of Balls image obtained from
Step 2. The wavelet transform is applied on each of the
DCT wavelet, Walsh wavelet, Hartley wavelet and Kekre
image plane i.e. R-plane, G-plane, B-plane wavelet respectively for 70% to 95% of data compress.
separately. Thus, transformed R-plane, G-plane,
B-plane are obtained.
Step 3. From the transformed R-plane, G-plane, B-plane
separately, the 70% to 95% coefficients having
lowest energy values are removed. And then the
image is reconstructed.
Step 4. Mean Square error between the reconstructed
image and the original image is computed.
V. RESULTS AND DISCUSSION
In this section, the image compression using wavelet
transform’s application is proposed. The proposed method is
implemented using MatLab 7.0 on Core 2 Duo processor.
DCT, Walsh, Hartley and Kekre wavelets were generated by
the method discussed in the section 3. The eleven different Figure 3:Eleven original color test images namely Aishwariya, Balls, Bird,
color images belonging to different categories, of size Boat, Flower, Ganesh, Scenary, Strawberry, Tajmahal, Tiger and Viharlake
256x256 were compressed using the proposed method. (from left to right and top to bottom) belonging to different categories
Figure 3 shows the eleven color test images of size 256x256x3
belonging to different categories.
Table 1: Comparison of MSE values obtained for 95% to 70% data compressed using DCT applied on all eleven images.
%data compressed 95 90 85 80 75 70
%data retained 5 10 15 20 25 30
Aishwariya 16.0803 8.1392 4.542 2.7457 1.7756 1.2044
Balls 75.5739 62.2747 50.7583 40.078 30.6593 22.6352
Bird 23.4414 19.7856 17.1511 14.846 12.6658 10.6395
Boat 63.2849 56.6238 49.9363 43.0231 36.116 29.7334
Flower 23.2196 13.3896 8.2092 5.158 3.3155 2.1642
Ganesh 66.9069 60.1663 53.5089 47.1965 41.0909 34.991
Scenary 32.4582 26.3064 21.5738 17.7571 14.5049 11.6967
Strawberry 42.358 30.1477 21.656 15.8716 11.6467 8.5902
Tajmahal 49.7616 39.457 30.907 23.5085 17.5614 12.7884
Tiger 67.5201 53.9452 42.909 33.8272 26.4423 20.1247
Viharlake 42.4999 35.5583 29.7518 24.3079 19.42 15.3702
Average 45.7368 36.89035 30.08213 24.39269 19.56349 15.4489
127 http://sites.google.com/site/ijcsis/
ISSN 1947-5500
(IJCSIS) International Journal of Computer Science and Information Security,
Vol. 9, No. 6, 2011
Table 2: Comparison of MSE values obtained for 95% to 70% data compressed using DCT wavelets applied on all eleven images.
%data Compressed 95 90 85 80 75 70
%data retained 5 10 15 20 25 30
Aishwariya 13.9138 5.4578 2.5735 1.3751 0.7916 0.4853
Balls 67.3254 51.4344 38.2372 27.1908 18.4033 11.9991
Bird 15.1267 7.2307 3.5948 1.975 1.2121 0.7956
Boat 54.5218 44.1746 35.3552 27.2747 19.6998 13.4656
Flower 18.9017 7.4338 3.2424 1.5961 0.8935 0.5555
Ganesh 65.2921 56.5384 48.5545 40.7238 33.151 25.9447
Scenary 29.7195 20.5452 13.5051 8.2269 4.8059 2.7274
Strawberry 40.4291 27.01 17.4447 10.6523 6.2435 3.5483
Tajmahal 41.9902 29.0375 19.7188 12.8007 8.0717 5.0573
Tiger 65.7406 49.9408 37.7845 27.6822 19.5049 13.2931
Viharlake 38.3256 29.412 21.8512 15.5274 10.4845 6.8427
Average 41.02605 29.83775 21.98745 15.91136 11.20562 7.701327
Table 3: Comparison of MSE values obtained for 95% to 70% data compressed using Walsh transform applied on all eleven images.
%data Compressed 95 90 85 80 75 70
%data retained 5 10 15 20 25 30
Aishwariya 28.1012 18.5335 13.2878 9.6173 6.9535 4.9937
Balls 81.3445 72.1406 63.6927 55.4022 47.672 39.9922
Bird 29.0555 24.9949 21.8778 19.0824 16.3321 13.764
Boat 66.874 60.8236 55.0104 49.0244 42.8836 36.6938
Flower 36.1558 26.6554 20.6424 15.9753 12.2055 9.2632
Ganesh 71.3259 65.8241 60.7583 55.4545 49.8489 43.9028
Scenary 36.995 30.8505 26.1749 22.1124 18.3582 14.9854
Strawberry 50.8574 42.5104 35.597 29.5755 23.963 18.9102
Tajmahal 56.1151 46.753 39.0347 32.1825 26.1253 20.74
Tiger 76.8846 67.4124 59.7075 52.2118 44.8674 37.3638
Viharlake 46.1636 40.4746 35.3193 30.1336 25.1187 20.408
Average 52.71569 45.17936 39.19116 33.70654 28.57529 23.72883
Table 4: Comparison of MSE values obtained for 95% to 70% data compressed using Walsh wavelets applied on all eleven images.
%data compressed 95 90 85 80 75 70
%data retained 5 10 15 20 25 30
Aishwariya 24.2798 14.1835 8.4923 5.1467 3.2017 2.0188
Balls 71.2873 59.9048 50.0267 40.7423 31.9771 24.201
Bird 21.37 12.2772 6.7889 3.7529 2.1759 1.3296
Boat 57.1472 47.5557 39.4294 31.806 24.5572 17.8547
Flower 29.8175 18.8151 11.6577 6.9575 4.0902 2.3961
Ganesh 68.7866 61.5798 54.7242 47.9489 41.1375 34.1747
Scenary 33.2859 24.4968 17.4245 11.8686 7.8334 5.1252
Strawberry 47.5166 37.5645 29.5413 22.3788 16.4636 11.6458
Tajmahal 46.4039 34.44 25.3372 18.217 12.5737 8.4354
Tiger 74.0328 62.677 52.7802 43.7662 35.2198 27.521
Viharlake 41.6009 33.4509 26.2753 19.7348 14.0675 9.508
Average 46.86623 36.99503 29.31615 22.93815 17.57251 13.11003
Table 5: Comparison of MSE values obtained for 95% to 70% data compressed using Hartley transform applied on all eleven images.
%data compressed 95 90 85 80 75 70
%data retained 5 10 15 20 25 30
Aishwariya 17.5702 9.2244 5.3507 3.3036 2.144 1.4455
Balls 76.1777 62.8777 51.2288 40.6618 31.174 23.0542
Bird 24.0468 19.8922 17.1642 14.8753 12.6649 10.6047
Boat 63.743 56.9175 50.3089 43.4315 36.585 30.0655
Flower 23.245 13.3901 8.1816 5.132 3.3002 2.1621
Ganesh 67.4761 60.5399 54.0034 47.5902 41.4778 35.3603
Scenary 33.3664 26.7334 22.0444 18.2306 14.8923 11.9873
Strawberry 43.5429 31.7365 23.165 17.109 12.8251 9.5586
Tajmahal 49.833 39.5094 30.807 23.4858 17.5716 12.8477
Tiger 67.9142 54.7827 44.2022 35.0442 27.6414 21.3417
Viharlake 43.0531 35.9083 30.1151 24.6871 19.7647 15.5462
Average 46.36076 37.41019 30.59739 24.86828 20.00373 15.8158
128 http://sites.google.com/site/ijcsis/
ISSN 1947-5500
(IJCSIS) International Journal of Computer Science and Information Security,
Vol. 9, No. 6, 2011
Table 6: Comparison of MSE values obtained for 95% to 70% data compressed using Hartley wavelets applied on all eleven images.
%data compressed 95 90 85 80 75 70
%data retained 5 10 15 20 25 30
Aishwariya 25.213 13.1629 7.0521 4.0416 2.4411 1.5258
Balls 71.1273 57.2092 45.1996 34.8136 25.8726 18.5723
Bird 23.6101 13.4562 7.3995 4.0684 2.3198 1.4062
Boat 57.524 47.4436 38.7197 30.4678 22.8547 16.2068
Flower 30.4708 17.2874 9.5996 5.2322 2.9442 1.6909
Ganesh 68.6207 59.8072 51.8952 44.4754 36.9332 29.7399
Scenary 33.7757 24.0033 16.4779 10.8463 6.9661 4.4146
Strawberry 47.5595 35.0583 25.241 17.6 12.0036 8.0276
Tajmahal 45.9356 33.1194 23.4711 16.1064 10.6663 6.9495
Tiger 71.4263 57.522 46.3082 36.2968 27.6191 20.1696
Viharlake 40.7793 31.5152 23.8878 17.1808 11.82 7.7949
Average 46.91294 35.41679 26.84106 20.10266 14.76734 10.59074
Table 7: Comparison of MSE values obtained for 95% to 70% data compressed using Kekre transform applied on all eleven images.
%data compressed 95 90 85 80 75 70
%data retained 5 10 15 20 25 30
Aishwariya 104.7137 98.3782 89.9261 79.8792 71.1881 63.4727
Balls 95.4663 94.6395 91.7457 85.9287 78.2736 70.4866
Bird 71.0262 66.9461 61.3682 53.14 44.6532 36.7741
Boat 96.431 91.469 85.2311 77.9438 70.1285 62.199
Flower 75.3232 73.098 70.8086 66.3523 61.4276 54.4408
Ganesh 89.5643 85.6352 79.994 73.6673 66.8255 59.6006
Scenary 69.4835 66.4225 60.9814 54.9584 48.8438 42.7133
Strawberry 91.5023 86.9626 82.2165 76.2722 69.2823 61.9791
Tajmahal 87.7596 81.797 74.3466 66.3947 58.3154 50.4829
Tiger 103.1722 96.723 90.382 84.1157 77.8424 71.2272
Viharlake 65.3572 60.0596 54.1362 48.2416 42.1389 36.1324
Average 86.34541 82.01188 76.46695 69.71763 62.62903 55.40988
Table 8: Comparison of MSE values obtained for 95% to 70% data compressed using Kekre wavelets applied on all eleven images.
%data compressed 95 90 85 80 75 70
%data retained 5 10 15 20 25 30
Aishwariya 41.1364 30.8323 22.567 15.7857 10.5192 6.8614
Balls 78.2216 69.5023 60.7559 51.6477 42.7972 34.186
Bird 30.9089 18.9243 11.0784 6.3362 3.7573 2.3177
Boat 61.2107 51.5525 43.1938 35.063 27.1555 20.0594
Flower 43.864 33.5713 24.2075 16.0388 9.7355 5.4288
Ganesh 73.9496 67.8945 61.3198 54.3184 46.8956 39.3169
Scenary 40.3419 30.1718 22.2297 15.6323 10.7463 7.3131
Strawberry 58.15 49.6929 41.7717 34.4112 27.2761 20.7789
Tajmahal 53.9875 41.6384 32.1871 23.958 16.8148 11.1398
Tiger 86.4872 76.9422 68.3485 60.3057 51.9975 43.7026
Viharlake 46.5487 39.6689 32.3033 25.1071 18.4207 12.622
Average 55.8915 46.39922 38.17843 30.78219 24.19234 18.5206
Table 9: Comparison of average MSE values obtained for 95% to 70% data compressed using DCT, Walsh, Hartley, Kekre transforms and their
corresponding wavelets applied on all eleven images.
%data compressed 95 90 85 80 75 70
%data retained 5 10 15 20 25 30
DCT Wavelets 41.02605 29.83775 21.98745 15.91136 11.20562 7.701327
Walsh Wavelets 46.86623 36.99503 29.31615 22.93815 17.57251 13.11003
Hartley Wavelets 46.91294 35.41679 26.84106 20.10266 14.76734 10.59074
Kekre Wavelets 55.8915 46.39922 38.17843 30.78219 24.19234 18.5206
DCT 45.7368 36.89035 30.08213 24.39269 19.56349 15.4489
Walsh 52.71569 45.17936 39.19116 33.70654 28.57529 23.72883
Hartley 46.36076 37.41019 30.59739 24.86828 20.00373 15.8158
Kekre 86.34541 82.01188 76.46695 69.71763 62.62903 55.40988
129 http://sites.google.com/site/ijcsis/
ISSN 1947-5500
(IJCSIS) International Journal of Computer Science and Information Security,
Vol. 9, No. 6, 2011
Figure 4: Comparison of average MSE with respect to 95% to 70% of data compressed using DCT wavelet, Walsh wavelet, Hartley wavelet, Kekre wavelet,
DCT, Walsh, Hartley and Kekre transform.
Figure 5: Results of Balls image obtained from DCT wavelet for 70% to 95% of data compressed.
130 http://sites.google.com/site/ijcsis/
ISSN 1947-5500
(IJCSIS) International Journal of Computer Science and Information Security,
Vol. 9, No. 6, 2011
Figure 6: Results of Balls image obtained from Walsh wavelet for 70% to 95% of data compressed.
Figure 7: Results of Balls image obtained from Hartley wavelet for 70% to 95% of data compressed.
131 http://sites.google.com/site/ijcsis/
ISSN 1947-5500
(IJCSIS) International Journal of Computer Science and Information Security,
Vol. 9, No. 6, 2011
Figure 8: Results of Balls image obtained from Kekre wavelet for 70% to 95% of data compressed.
From Table 9, it is observed that performance of all Proceedings of the International Computer Music Conference (ICMC-
87, Tokyo), Computer Music Association, 1987.
wavelet transforms is better than that of their respective
[6] S. Mallat, "A Theory of Multiresolution Signal Decomposition: The
orthogonal transforms as indicated by lower MSE values. Wavelet Representation," IEEE Trans. Pattern Analysis and Machine
Figure 4 compares the average MSE with respect to 95% Intelligence, vol. 11, pp. 674-693, 1989.
to 70% od data compress using. DCT wavelet, Walsh [7] Strang G. "Wavelet Transforms Versus Fourier Transforms." Bull.
wavelet, Hartley wavelet, Kekre wavelet, DCT, Walsh, Amer. Math. Soc. 28, 288-305, 1993.
Hartley and Kekre transform. [8] P. P. Kanjilal, “Adaptive Prediction and Predictive Control”, IET, p 210,
1995
VI. CONCLUSION [9] Yuen, C. “Remarks on the Ordering of Walsh Functions”, IEEE
Transactions on Computers, C-21: 1452, 1972.
In this paper, novel orthogonal wavelet transform
[10] Hartley, R. V. L., “A more symmetrical Fourier analysis applied to
generation method is proposed. The proposed method can transmission problems”, Proc. IRE 30, 144–150, 1942.
be used to generate the wavelet transform from any [11] H.B.Kekre, Sudeep D. Thepade, “Image Retrieval using Non-
orthogonal transform. To test the efficiency of wavelet Involutional Orthogonal Kekre’s Transform”, International Journal of
transform, they are applied on the eleven different color Multidisciplinary Research and Advances in Engineering (IJMRAE),
Ascent Publication House, 2009, Volume 1, No.I, 2009. Abstract
images for the purpose of data compression. The available online at www.ascent-journals.com
orthogonal transforms used in this paper are DCT, Walsh, [12] H. B.Kekre, Sudeep D. Thepade, “Image Blending in Vista Creation
Hartley and Kekre. From the results, it can be concluded using Kekre's LUV Color Space”, SPIT-IEEE Colloquium and Int.
that wavelet transforms outperforms their respective Conference, SPIT, Andheri, Mumbai, 04-05 Feb 2008.
orthogonal transform as indicated by lower MSE values [13] H.B.Kekre, Sudeep D. Thepade, “Boosting Block Truncation Coding
using Kekre’s LUV Color Space for Image Retrieval”, WASET Int.
REFERENCES Journal of Electrical, Computer and System Engineering (IJECSE),
Vol.2, Num.3, Summer 2008. Available online at
www.waset.org/ijecse/v2/v2-3-23.pdf
[1] K. P. Soman and K.I. Ramachandran. ”Insight into WAVELETS From [14] H.B.Kekre, Sudeep D. Thepade, “Color Traits Transfer to Grayscale
Theory to Practice”, Printice -Hall India, pp 3-7, 2005. Images”, In Proc.of IEEE First International Conference on Emerging
[2] Raghuveer M. Rao and Ajit S. Bopardika. “Wavelet Transforms – Trends in Engg. & Technology, (ICETET-08), G.H.Raisoni COE,
Introduction to Theory and Applications”, Addison Wesley Longman, Nagpur, INDIA. Available on IEEE Xplore.
pp 1-20, 1998. [15] H.B.Kekre, Sudeep D. Thepade, “Creating the Color Panoramic
[3] C.S. Burrus, R.A. Gopinath, and H. Guo. “Introduction to Wavelets and Viewusing Medley of Grayscale and Color Partial Images”, WASET Int.
Wavelet Transform” Prentice-hall International, Inc., New Jersey, 1998. Journal of Electrical, Computer and System Engg. (IJECSE), Volume 2,
[4] Amara Graps, ”An Introduction to Wavelets”, IEEE Computational No. 3, Summer 2008. Available online at www.waset.org/ijecse/v2/v2-3-
Science and Engineering, vol. 2, num. 2, Summer 1995, USA. 26.pdf
[5] Julius O. Smith III and Xavier SerraP“, An Analysis/Synthesis Program [16] Dr. H.B.kekre, Sudeep D. Thepade, Adib Parkar, “A Comparison of
for Non-Harmonic Sounds Based on a Sinusoidal Representation'', Haar Wavelets and Kekre’s Wavelets for Storing Colour Information in
a Greyscale Image”, International Journal of Computer Applications
132 http://sites.google.com/site/ijcsis/
ISSN 1947-5500
(IJCSIS) International Journal of Computer Science and Information Security,
Vol. 9, No. 6, 2011
(IJCA), Volume 1, Number 11, December 2010, pp 32-38. Available at Professor at MPSTME, SVKM’s NMIMS. He has guided 17 Ph.Ds, more than
www.ijcaonline.org/archives/volume11/number11/1625-2186 100 M.E./M.Tech and several B.E./ B.Tech projects. His areas of interest are
[17] Dr. H.B.kekre, Sudeep D. Thepade, Adib Parkar “Storage of Colour Digital Signal processing, Image Processing and Computer Networking. He
Information in a Greyscale Image using Haar Wavelets and Various has more than 270 papers in National / International Conferences and Journals
Colour Spaces”, International Journal of Computer Applications (IJCA), to his credit. He was Senior Member of IEEE. Presently He is Fellow of IETE
Volume 6, Number 7, pp.18-24, September 2010. Available online at and Life Member of ISTE Recently 11 students working under his guidance
http://www.ijcaonline.org/volume6/number7/pxc3871421.pdf have received best paper awards. Two of his students have been awarded Ph.
[18] Dr.H.B.Kekre, Sudeep D. Thepade, Juhi Jain, Naman Agrawal, “IRIS D. from NMIMS University. Currently he is guiding ten Ph.D. students.
Recognition using Texture Features Extracted from Walshlet Pyramid”,
ACM-International Conference and Workshop on Emerging Trends in Dr. Tanuja K. Sarode has Received Bsc.(Mathematics) from Mumbai
Technology (ICWET 2011),Thakur College of Engg. And Tech., University in 1996, Bsc.Tech.(Computer
Mumbai, 26-27 Feb 2011. Also will be uploaded on online ACM Portal. Technology) from Mumbai University in 1999,
M.E. (Computer Engineering) degree from
[19] Dr.H.B.Kekre, Sudeep D. Thepade, Akshay Maloo, “Face Recognition
Mumbai University in 2004, Ph.D. from Mukesh
using Texture Features Extracted form Walshlet Pyramid”, ACEEE
Patel School of Technology, Management and
International Journal on Recent Trends in Engineering and Technology
Engineering, SVKM’s NMIMS University, Vile-
(IJRTET), Volume 5, Issue 1, www.searchdl.org/journal/IJRTET2010
Parle (W), Mumbai, INDIA. She has more than 12
[20] Dr.H.B.Kekre, Sudeep D. Thepade, Juhi Jain, Naman Agrawal, years of experience in teaching. Currently working
“Performance Comparison of IRIS Recognition Techniques using as Assistant Professor in Dept. of Computer
Wavelet Pyramids of Walsh, Haar and Kekre Wavelet Transforms”, Engineering at Thadomal Shahani Engineering
International Journal of Computer Applications (IJCA), Number 2, College, Mumbai. She is life member of IETE, member of International
Article 4, March 2011, Association of Engineers (IAENG) and International Association of Computer
http://www.ijcaonline.org/proceedings/icwet/number2/2070-aca386 Science and Information Technology (IACSIT), Singapore. Her areas of
[21] Dr.H.B.Kekre, Sudeep D. Thepade, Akshay Maloo, “Face Recognition interest are Image Processing, Signal Processing and Computer Graphics. She
using Texture Features Extracted from Haarlet Pyramid”, International has 90 papers in National /International Conferences/journal to her credit.
Journal of Computer Applications (IJCA), Volume 12, Number 5,
December 2010, pp 41-45. Available at Sudeep D. Thepade has Received B.E.(Computer) degree from North
www.ijcaonline.org/archives/volume12/number5/1672-2256 Maharashtra University with Distinction in
[22] Dr.H.B.Kekre, Sudeep D. Thepade, Juhi Jain, Naman Agrawal, “IRIS 2003. M.E. in Computer Engineering from
Recognition using Texture Features Extracted from Haarlet Pyramid”, University of Mumbai in 2008 with
International Journal of Computer Applications (IJCA), Volume 11, Distinction, currently submitted thesis for
Number 12, December 2010, pp 1-5, Available at Ph.D. at SVKM’s NMIMS, Mumbai. He has
www.ijcaonline.org/archives/volume11/number12/1638-2202. more than 08 years of experience in teaching
[23] Dr.H.B.Kekre, Sudeep D. Thepade, Akshay Maloo, “Performance and industry. He was Lecturer in Dept. of
Comparison of Image Retrieval Techniques using Wavelet Pyramids of Information Technology at Thadomal Shahani
Walsh, Haar and Kekre Transforms”, International Journal of Computer Engineering College, Bandra(w), Mumbai for
Applications (IJCA) Volume 4, Number 10, August 2010 Edition, pp 1- nearly 04 years. Currently working as
8, http://www.ijcaonline.org/archives/volume4/number10/866-1216 Associate Professor in Computer Engineering
at Mukesh Patel School of Technology
[24] Dr.H.B.Kekre, Sudeep D. Thepade, Akshay Maloo, “Query by image Management and Engineering, SVKM’s NMIMS, Vile Parle(w), Mumbai,
content using color texture features extracted from Haar wavelet INDIA. He is member of International Association of Engineers (IAENG) and
pyramid”, International Journal of Computer Applications (IJCA) for the International Association of Computer Science and Information Technology
special edition on “Computer Aided Soft Computing Techniques for (IACSIT), Singapore. He is member of International Advisory Committee for
Imaging and Biomedical Applications”, Number 2, Article 2, August many International Conferences. He is reviewer for various International
2010. http://www.ijcaonline.org/specialissues/casct/number2/1006-41 Journals. His areas of interest are Image Processing Applications, Biometric
[25] Dr.H.B.Kekre, Sudeep D. Thepade, “Image Retrieval using Color- Identification. He has about 110 papers in National/International
Texture Features Extracted from Walshlet Pyramid”, ICGST Conferences/Journals to his credit with a Best Paper Award at International
International Journal on Graphics, Vision and Image Processing (GVIP), Conference SSPCCIN-2008, Second Best Paper Award at ThinkQuest-2009
Volume 10, Issue I, Feb.2010, pp.9-18, Available online National Level paper presentation competition for faculty, Best paper award at
www.icgst.com/gvip/Volume10/Issue1/P1150938876.html Springer international conference ICCCT-2010 and second best research
project award at ‘Manshodhan-2010’.
AUTHORS PROFILE
Dr. H. B. Kekre has received B.E. (Hons.) in Telecomm. Engineering. from Ms. Sonal Shroff has Received B.Sc.(Physics) from University of Mumbai in
Jabalpur University in 1958, M.Tech 1996, B.Sc.Tech.(Computer Technology) from
(Industrial Electronics) from IIT Bombay in University of Mumbai in 1999. She has more than
1960, M.S.Engg. (Electrical Engg.) from 10 years of experience in teaching. Currently
University of Ottawa in 1965 and Ph.D. working as Lecturer in Dept. of Computer
(System Identification) from IIT Bombay Engineering at Thadomal Shahani Engineering
in 1970 He has worked as Faculty of College. She is life member of ISTE. Her areas of
Electrical Engg. and then HOD Computer interest are Image Processing, Signal Processing
Science and Engg. at IIT Bombay. For 13 and Computer Graphics.
years he was working as a professor and head
in the Department of Computer Engg. at
Thadomal Shahani Engineering. College, Mumbai. Now he is Senior
133 http://sites.google.com/site/ijcsis/
ISSN 1947-5500
(IJCSIS) International Journal of Computer Science and Information Security,
Vol. 9, No. 6 June 2011
Analysing Assorted Window Sizes with LBG and
KPE Codebook Generation Techniques for Grayscale
Image Colorization
Dr. H. B.Kekre Dr. Tanuja K. Sarode Sudeep D. Thepade Ms. Supriya Kamoji
Sr. Professor,MPSTME, Asst. Professor, Asst. Professor, Sr.Lecturer,
NMIMS Deemed-to-be Thadomal Shahani Engg. MPSTME, Fr.Conceicao Rodrigues
University,Vileparle (W), College, NMIMS Deemed-to-be College of Engg,
Mumbai-56, India. Bandra (W), Mumbai-50, University,Vileparle (W), Bandra (W),
India. Mumbai-56, India. Mumbai-50, India.
Abstract—This paper presents use of assorted window sizes and Gray scale image is represented by only the luminance values
their impact on colorization of grayscale images using Vector that can be matched between the two images. Because a single
Quantization (VQ) Code Book generation techniques. The luminance value could represent entirely different parts of an
problem of coloring grayscale image has no exact solution. image, the remaining values within the pixel’s neighborhood
Attempt is made to minimize the human efforts needed in
manually coloring grayscale images. Here human interaction is
are used to guide the matching process. Once the pixel is
only to find reference image of similar type. The job of matched, the color information is transferred but original
transferring color from reference image to grayscale is done by luminance value is retained [2].
proposed techniques. Vector quantization algorithms Linde Buzo
and Gray Algorithm (LBG) and Kekre Proportionate Error The details in color image can be utilized for analysis and study
(KPE) are used to generate color palette in RGB and Kekre’s LUV of particular image in the applications like medical
color space. For colorization source color image is taken as
tomography, information security, image segmentation, etc.
reference image which is divided into non overlapping pixel
windows. Initial clusters are formed using VQ algorithms LBG Coloring of old Black and White movies and rare images of
and KPE, used to generate the color palette. Grayscale image monuments, celebrities is one of the best applications which
which is to be colored is also divided in non overlapping pixel give good feel and understanding.
windows. Every pixel window of gray image is compared with
color palette to get the nearest color values. Best match is found In case of pseudo-coloring [3] where the mapping of luminance
using least mean squared error. To test the performance of these values to color values is automatic, the choice of color map is
algorithms, color image is converted into gray scale image and the
same grayscale image is recolored back. Finally MSE of recolored
commonly determined by human decision. The main concept of
image and original image is compared. Experiment is conducted colorization techniques exploits textual information. The work
on both RGB and Kekre’s LUV color space for the different pixel of Welsh et al , which is inspired by the color transfer [4] and
windows of size 1x2, 2x1, 2x2, 2x3, 3x2, 3x3, 1x3, 3x1, 2x4, 4x2, by image analogies [5], examines the luminance values in the
1x4, 4x1. However Kekre’s LUV color space gives outstanding neighborhood of each pixel in the target image and add to its
performance. For different pixel windows KPE with 1x2 and LBG luminance the chromatic information of a pixel from a source
with 2x1 pixel window perform well with respect to image quality. image with best neighborhoods matching .This technique works
on images were differently colored regions give rise to distinct
Keywords- Colorization , Pixel Window, ColorPalette, Vector
textures otherwise, the user must specify rectangular swatches
Quantization(VQ) , LBG, KPE.
indicating corresponding regions in the two images.
Color traits transferred to gray scale images [6] presents novel
I. INTRODUCTION coloring techniques where color palette is prepared using pixel
Colors always provide more clear information than gray windows of some degree taken from reference coloring image.
scale digital images. Colorization is the art of adding color to a For every window of gray scale image the palette is searched
monochrome image or movie. Colors we perceive in an object for equivalent color values which could be used to color gray
are determined by nature of light reflected from the object. Due scale window [19].
to the structure of human eye, all colors are seen as variable
combinations three basic colors Red, Green, Blue (RGB). The In this paper, adjacent pixels are grouped together to form a
task of coloring a grayscale image involves assigning RGB (pixel window) grid. Vector Quantization algorithms LBG and
values to an image which varies along only the luminance KPE are applied on different pixel window sizes 1x2, 2x1, 2x2,
value. Since different colors may have the same luminance but 2x3, 3x2, 3x3, 1x3, 3x1, 2x4, 4x2, 1x4, 4x1and codebook of size
vary in hue and saturation, the problem of coloring gray scale 512 is obtained. Vector Quantization algorithms LBG and KPE
needs human interaction [1]. are applied. Depending on minimum Euclidean distance, LUV
134 http://sites.google.com/site/ijcsis/
ISSN 1947-5500
(IJCSIS) International Journal of Computer Science and Information Security,
Vol. 9, No. 6 June 2011
components of reference image are transferred to input gray two vectors are generated by using constant error addition to the
image. codevector. Euclidean distances of all the training vectors are
computed with vectors v1 & v2 and two clusters are formed
II. KEKRE’S LUV COLOR SPACE [15,21] based on closest of v1 or v2. This modus operandi is replaced
for every cluster. The shortcoming of this algorithm is that the
In the proposed technique Kekre’s LUV color space is used. cluster elongation is +135O to horizontal axis in two dimensional
Where L gives luminance and U and V gives chromaticity cases resulting in inefficient clustering.
values of color image. Positive values of U indicate prominence
of red components in color image and negative value of V
indicates prominence of green component. The RGB-to LUV
and LUV-to-RGB conversion matrices are given in equation 1
and 2 respectively.
⎡L ⎤ ⎡ 1 1 1⎤ ⎡R ⎤
⎢U ⎥ = ⎢− 2 1 1⎥ * ⎢G ⎥ (1)
⎢ ⎥ ⎢ ⎥ ⎢ ⎥
⎢V ⎥
⎣ ⎦ ⎢ 0 1 1⎥
⎣ ⎦ ⎢B ⎥
⎣ ⎦
⎡ R ⎤ ⎡1 − 2 0⎤ ⎡ L / 3 ⎤ Figure1 LBG for Two dimensional case.
⎢G ⎥ = ⎢1 1 1 ⎥ * ⎢U / 6⎥ (2)
B. Kekre’s Proportionate Error (KPE) Algorithm [9,10]
⎢ ⎥ ⎢ ⎥ ⎢ ⎥
⎢ B ⎥ ⎢1 1 1 ⎥ ⎢V / 2 ⎥ Here to generate two vectors v1 & v2 proportionate error is
⎣ ⎦ ⎣ ⎦ ⎣ ⎦ added to the codevector. Magnitude of elements of the
codevector decides the error ratio. Hereafter the procedure is
same as that of LBG. While adding proportionate error a safe
III. VECTOR QUANTIZATION guard is also introduced so that neither v1 nor v2 go beyond the
training vector space eliminating the disadvantage of the LBG.
Vector Quantization (VQ) [7],[8] is an efficient and lossy Fig. 2, shows the cluster elongation after adding proportionate
technique for compression of data and has been successfully error.
used in various applications like an pattern recognition[11],
speech recognition and face detection[12][13],image
segmentation[14],speech data compression [16],content based
image retrieval CBIR[17],[18] etc.
Vector Quantization can be define as a mapping function
that maps k-dimensional vector space to a finite set CB = {C1,
C2,C3, ..…., CN}. The set CB is called codebook consisting of
N number of codevectors and each codevector Ci= {ci1, ci2, ci3,
……, cik} is of dimension k. The key to VQ is the good
codebook. Codebook can be generated in spatial domain by Figure 2 orientation of line joining two vectors v1 and v2 after addition of
clustering algorithms. proportionate error to the centroid.
IV. PROPOSED COLORING TECHNIQUE
In color transfer phase, image is divided into non
overlapping blocks and each block then is converted to the Since the coloring problem always requires human
training vector Xi = (xi1, xi2, ……., xik ). The codebook is then interaction. So reference image of same class and of same
searched for the nearest codevector Cmin by computing squared feature as of input grayscale image. The color transfer
Euclidian distance as presented in equation (3) with vector Xi algorithm is discussed for LUV color space for different m x n
with all the codevectors of the codebook CB. This method is pixel grid size. The main steps of algorithm for a color transfer
called exhaustive search (ES). are:
d(Xi, Cmin) = min1≤j≤N{d(Xi,Cj)} (3) • Convert RGB components of source color image into
where d(Xi,Cj) = ∑(Xip - Cjp)2 respective Kekre’s LUV color components.
It is obvious that, if the codebook size is increased to reduce the • Divide the image in to blocks of m x n pixels. Hence
distortion the searching time will also increase. m x n x3 dimensional training vector set
The following section describes the VQ codebook corresponding to LUV components of each pixel is
Generation Algorithms. obtained. On this set LBG and KPE algorithms are
applied and color palette is generated i.e. codebook of
A. Linde Buzoand Gray Algorithms(LBG) [7,8] size 512.
• The input gray image is divided in mxn blocks of
In this algorithm centroid is first calculated by taking
pixels. Each block (pixel window) is searched for
average as the first code vector for the training set. In figure1
135 http://sites.google.com/site/ijcsis/
ISSN 1947-5500
(IJCSIS) International Journal of Computer Science and Information Security,
Vol. 9, No. 6 June 2011
• nearest code vector of color palette. While searching
only luminance is compared
• Once the nearest match is obtained gray image pixel
window is replaced by LUV codevector
• The final colored image in LUV domain is then
converted into RGB plane and MSE of original color
image and recolored image are calculated.
6(a) 6(b) 6(c) 6(d)
Original image Gray image
V. RESULTS 1x2 Grid LBG 1x2 Grid KPE
MSE: 92 MSE: 81
The algorithms discussed above are implemented using Figure 6 shows reconstruction of face grayscale image using similar source
image for pixel window 1x2.
MATLAB 7.0 on Pentium IV, 1.66GHz, 1GB RAM. To test the
performance of these algorithms we have converted color
image to grayscale image and the same gray image is recolored
back. Finally MSE of original image and colored image is
compared. Five color images belonging to different classes of
size 128x128x3 are used. 7(a) 7(b) 7(c) 7(d) 7(d)
Figure3 to Figure6. Shows the results of LBG and KPE for Original Reference Gray 1x2 Grid 1x2 Grid KPE
Zebra, Book, Cartoon and Face images considering same image Image Image Image LBG MSE 709
MSE 990
as reference image. Figure 7 shows reconstruction of Scenery grayscale image using
Figure7 and Figure8. Shows the results of LBG and KPE for different source image.
scenery and dog images considering different image as
reference image.
8(a) 8(b) 8(c) 8(d) 8(e)
Original Reference Gray Image 1x2Grid 1x2 Grid
Image Image LBG KPE MSE
MSE 303 285
Figure 8 shows reconstruction of Dog grayscale image using different
source image.
3(a) 3(b) 3(c) 3(d)
Original Gray image 1x2 Grid LBG 1x2 Grid KPE Various images, each of size 128x128 pixels, were
image MSE: 178.4 MSE: 73.85 used to build the color palette, and their grayscale
Figure 3 shows reconstruction of Zebra grayscale image using similar equivalents were colored using color palette for various
source image for pixel window 1x2
pixel windows. The fig. 9, shows bar chart of average
mean squared error obtained across all five images with
respect to initial few pixel windows for RGB and Kekre’s
LUV color space. It is observed that, Kekre’s LUV color
space gives less MSE compared to RGB color space.
Hence in table 1 only Kekre’s LUV color space results for
different images using 12 varying pixel window
sizes(1x2,2x1,2x2,2x3,3x2,3x3,1x3,3x1,2x4,4x2,1x4,4x1)
4 (a) 4(b) 4(c) 4(d) are given.
Original Image Gray Image 1x2GridLBG 1x2 Grid KPE
MSE73.8 MS53.32
Figure 4 shows reconstruction of book grayscale image using similar
source mage for pixel window 1x2
Figure 9– Average MSE across various Grid sizes for different color
spaces
5(a) 5(b) 5(c) 5(d)
Original image Gray image 1x2 Grid LBG 1x2 Grid KPE
MSE: 1260 MSE:1023
Figure 5 shows reconstruction of cartoon grayscale image using similar
source mage for pixel window 1x2
136 http://sites.google.com/site/ijcsis/
ISSN 1947-5500
(IJCSIS) International Journal of Computer Science and Information Security,
Vol. 9, No. 6 June 2011
Table I. Shows the Results of LBG and KPE for five color images from different categories of size 128x128x3.
Input VQ Grid Sizes
Images Alg.
1x2 2x1 2x2 2x3 3x2 3x3 2x4 4x2 1x3 3x1 1x4 4x1
Image1 LBG 92.70 87.11 107.42 116.75 114.65 2302 5576 5652 107.32 106 122.29 114.
KPE 81.59 77.49 89.141 90.75 95.83 2122 5557 5663 87.32 89.09 91.41 87.3
Image2 LBG 1260 1244 1056 1493 1420 6350 6928 9595 1251 1243 1532 1392
KPE 1023 1153 1363 1233 1150 6243 6725 9359 1278 1138 1013 1093
Image3 LBG 178.4 147 440.90 721.73 675.15 2003 2486 4236 373 286 610.66 465.7
KPE 73.85 76.12 225.10 461.69 441.62 1823 2264 4663 143.5 142 273.38 226.0
Image4 LBG 73.89 76.64 107.12 123.82 131.86 2340 3400 3664 90 97 116.92 128.9
KPE 53.32 52.73 79.094 103.43 95.439 2813 3371 3701 65.3 65 73.93 76.09
Image5 LBG 1203 1244 1244 1406 1388 6340 7833 7916 1246 1240 1274 1266
KPE 1178 1174 1182 1414 1399 6384 7935 7968 1193 1175 1209 1212
Average LBG 561.5 559.7 591 772.26 745.9 3867 5244 6212 613 594 731 673
Average KPE 481.9 506.6 587.8 660.5 636.3 3877 5170 6270 553.4 521.8 532.1 538.8
From the data given in table1, it is seen that the performance REFERENCES
gradually decreases as the pixel window size increases.Further [1] V. Karthikeyani, K. Duraisamy, Mr.P.Kamalkakkannan, " Conversion
MSE for unidirectional pixel window is less compared to of grayscale image to color image with and without texture synthesis",
IJCSNS International journal of Computer science and network
bidirectional. Pixel window sizes 1x2 and 2x1 are showing security, Vol.7 No.4 April 2007.
better results as compared to large pixel window sizes. [2] E.Reinhard, M. Ashikhmin, B. Gooch and P Shirley, “Colour Transfer
Fig.10, shows the comparison of average mean sqared error between images”, IEEE Transactions on Computer Graphics and
obtained across all images on Kekre’s LUV color space for top Applications 21, 5, pp. 34-41.
five pixel windows. It can be seen from the chart , KPE [3] Rafael C. Gonzalez & Paul Wintz, “ Digital Image Processing”,
Addison Wesley Publications, May 1987.
performs well with respect to LBG. Also performance
[4] A. Hertzmann, C. E Jacobs, N. Oliver, B. Curless and D.H. Salesin,
deteriorates as pixel window size increases and becomes “image Anologies”, in the proceedings of ACM SIGGRAPH 2002, pp.
bidirectinal. 341-346.
[5] G. Di Blassi, and R. D. Reforgiato, “Fast colourization of gray
images”, In proceedings of Eurographics Italian Chapte, 2003.
[6] H.B.Kekre, Sudeep. D. Thepade, “Color traits transfer to gray scale
images”, in Proc of IEEE International conference on Emerging Trends
in Engineering and Technology, ICETET 2008 Raisoni College of
Engg, Nagpur.
[7] R. M. Gray, "Vector quantization", IEEE ASSP Mag., pp. 4-29,
Apr11984.
[8] Y. Linde, A. Buzo, and R. M. Gray, "An algorithm for vector quantizer
design," IEEE Trans.Commun., vol. COM-28, no. 1, pp. 8495, 1980.
Figure 10 Average MSE across various Grid sizes
[9] H. B. Kekre, Tanuja K. Sarode, "New Fast Improved Codebook
VI. CONCLUSION Generation Algorithm for Color Images using Vector Quantization,"
International Journal of Engineering and Technology, vol.1, No.1, pp.
In this paper, the idea of colorization of grayscale images 67-77, September 2008.
using VQ codebook generation techniques is presented using [10] H. B. Kekre, Tanuja K. Sarode, "An Efficient Fast Algorithm to
two famous codebook generation algorithms alias LBG and Generate Codebook for Vector Quantization," First International
KPE. For both the algorithms 12 assorted pixel window sizes are Conference on Emerging Trends in Engineering and Technology,
ICETET-2008, held at Raisoni College of Engineering, Nagpur, India,
considered for preparing the color palettes. As quality of July 2008, Available at online IEEE Xplore.
colorization is subjective to source color image and grayscale to [11] Ahmed A. Abdelwahab, Nora S. Muharram, "A Fast Codebook Design
be colorized image, the grayscale version of 5 color images are Algorithm Based on a Fuzzy Clustering Methodology", International
recolored using total 48 variations of proposed techniques with 2 Journal of Image and Graphics, vol. 7, no. 2 pp. 291302, 2007.
color spaces (RGB and Kekre’s LUV), 12 pixel window sizes [12] Chin-Chen Chang, Wen-Chuan Wu, "Fast Planar-Oriented Ripple
and 2 codebook generation techniques( LBG and KPE). The Search Algorithm for Hyperspace VQ Codebook", IEEE Transaction
comparison of original color image and recolored image has on image processing, vol 16, no. 6, June 2007.
shown that Kekre’s LUV color space outperforms RGB color [13] C. Garcia and G. Tziritas, "Face detection using quantized skin color
space. Further, it can be observed from results that unidirectional regions merging and wavelet packet analysis," IEEE Trans.
Multimedia, vol. 1, no. 3, pp. 264-277, Sep. 1999.
pixel windows gives better colorization than bidirectional pixel
[14] H. B. Kekre, Tanuja K. Sarode, Bhakti Raul, "Color Image
window sizes. The KPE performs better than LBG for Segmentation using Kekre's Fast Codebook Generation Algorithm
colorization. In all the best performance is shown by KPE with Based on Energy Ordering Concept", ACM International Conference
1x2 window size in Kekre’s LUV color space. on Advances in Computing, Communication and Control (ICAC3-
2009), 23-24 Jan 2009, Fr. Conceicao Rodrigous College of Engg.,
Mumbai. Available on online ACM portal.
137 http://sites.google.com/site/ijcsis/
ISSN 1947-5500
(IJCSIS) International Journal of Computer Science and Information Security,
Vol. 9, No. 6 June 2011
[15] Dr.H.B. Krkre, Sudeep D. Thepade, “Image Blending in Vista Creation Dr. Tanuja K. Sarode has Received Bsc.(Mathematics) from Mumbai
using Kekre’s LUV Color Space”, In Proc. Off PIT-IEEE Colloquium, University in 1996, Bsc.Tech.(Computer
Mumbai, Feb 4-5,2008. Technology) from Mumbai University in 1999,
[16] H. B. Kekre, Tanuja K. Sarode, "Speech Data Compression using M.E. (Computer Engineering) degree from Mumbai
Vector Quantization", WASET International Journal of Computer and University in 2004, Ph.D. from Mukesh Patel
Information Science and Engineering (IJCISE), vol. 2, No. 4, 251254, School of Technology, Management and
Fall 2008. available: http://www.waset.org/ijcise. Engineering, SVKM’s NMIMS University, Vile-
[17] H. B. Kekre, Ms. Tanuja K. Sarode, Sudeep D. Thepade, "Image Parle (W), Mumbai, INDIA. She has more than 12
Retrieval using Color-Texture Features from DCT on VQ Codevectors years of experience in teaching. Currently working
obtained by Kekre's Fast Codebook Generation", ICGST-International as Assistant Professor in Dept. of Computer
Journal on Graphics, Vision and Image Processing (GVIP),Volume 9, Engineering at Thadomal Shahani Engineering
Issue 5, pp.: 1-8, September 2009. Available online at College, Mumbai. Engineering, SVKM’s NMIMS University, Vile-Parle (W),
http://www.icgst.com/gvip/Volume9/Issue5/P1150921752.html. Mumbai, INDIA. She has more than 12 years of experience in teaching.
Currently working as Assistant Professor in Dept. of Computer Engineering at
[18] H.B.Kekre, Tanuja K. Sarode, Sudeep D. Thepade, "Color-Texture
Thadomal Shahani Engineering College, Mumbai. She is life member of IETE,
Feature based Image Retrieval using DCT applied on Kekre's Median
member of International Association of Engineers (IAENG) and International
Codebook", International Journal on Imaging (IJI),Available online at
Association of Computer Science and Information Technology (IACSIT),
www.ceser.res.in/iji.html.
Singapore. Her areas of interest are Image Processing, Signal Processing and
[19] Dr. H. B. Kekre, Sudeep D. Thepade, Nikita Bhandari, “Colorization of Computer Graphics. She has 90 papers in National /International
Gereyscale images using Kekre’s Bioorthogonal Color Spaces and Conferences/journal to her credit.
Kekre’s Fast Codebook Generation “,CSC Advances in Multimedia
An international journal (AMU), volume 1, Issue 3,pp.49-58, Available Sudeep D. Thepade has Received B.E.(Computer) degree from North
at Maharashtra University with Distinction in
www.cscjournals.org/csc/manuscript/journals/AMIJ/volume1/Issue3/A 2003. M.E. in Computer Engineering from
MU-13.pdf. University of Mumbai in 2008 with
[20] Dr. H. B. Kekre, Sudeep D. Thepade,Adib Parkar, “A Comparison of Distinction, currently submitted thesis for
Harr Wavelets and Kekre’s Wavelets for Storing Color Information in Ph.D. at SVKM’s NMIMS, Mumbai. He has
a Greyscale Images”, International Journal of Computer more than 08 years of experience in
Applications(IJCA), Volume 1, Number 11, December 2010,pp 32-38. teaching and industry. He was Lecturer in
Available at www.ijcaonline.org/archives/volume11/number11/1625- Dept. of Information Technology at
2186. Thadomal Shahani Engineering College,
[21] Dr. H. B. Kekre, Sudeep D. Thepade,Archana Athawale, Adib Parkar, Bandra(w), Mumbai for nearly 04 years.
“Using Assorted Color Spaces and pixel window sizes for Colorization Currently working as Associate Professor in
of Grayscale images’,ACM International Conferences and workshops Computer Engineering at Mukesh Patel School of Technology Management
on emerging Trends in Technology(ICWET 2010), Thakur College of and Engineering, SVKM’s NMIMS, Vile Parle(w), Mumbai, INDIA. He is
Engg. And Tech.,Mumbai,26-27 Feb 2010. member of International Association of Engineers (IAENG) and International
[22] H. B. Krekre,Sudeep Thepade, Adib Parkar, “A comparison of Kekre’s Association of Computer Science and Information Technology (IACSIT),
Fast Search and Exhaustive Search for various grid sizes used for Singapore. He is member of International Advisory Committee for many
coloring a Grayscale Image” Second International conference on signal International Conferences. He is reviewer for various International Journals.
Acquisition and Processing, (ICSAP2010), IACSIT,Banglore,pp.53- His areas of interest are Image Processing Applications, Biometric
57,9-10 Feb 2010. Identification. He has about 110 papers in National/International
Conferences/Journals to his credit with a Best Paper Award at International
Conference SSPCCIN-2008, Second Best Paper Award at ThinkQuest-2009
Author Biographies National Level paper presentation competition for faculty, Best paper award at
Springer international conference ICCCT-2010 and second best research project
Dr. H. B. Kekre has received B.E. (Hons.) in Telecomm. Engineering. from award at ‘Manshodhan-2010’.
Jabalpur University in 1958, M.Tech
(Industrial Electronics) from IIT Bombay in
1960, M.S.Engg. (Electrical Engg.) from Supriya Kamoji has received B.E. in Electronics and Communication
University of Ottawa in 1965 and Ph.D. Engineering with Distinction from Karnataka
(System Identification) from IIT Bombay University in 2001. Currently pursuing M.E. from
in 1970 He has worked as Faculty of Thadomal Shahani College of Engineering,
Electrical Engg. and then HOD Computer Mumbai, India. She has more than 8years of
Science and Engg. at IIT Bombay. For 13 teaching experience. Currently working as an
years he was working as a professor and head Senior Lecturer in Fr.Conceicao Rodrigues
in the Department of Computer Engg. at College of Engineering. Mumbai, India. She is a
Thadomal Shahani Engineering. College, Mumbai. Now he is Senior Professor life time member of Indian society of Technical
at MPSTME, SVKM’s NMIMS. He has guided 17 Ph.Ds, more than 100 Education (ISTE). Her areas of interest are Image
M.E./M.Tech and several B.E./ B.Tech projects. His areas of interest are Digital Processing, Computer Organization and Architecture and Distributed
Signal processing, Image Processing and Computer Networking. He has more Computing.
than 270 papers in National / International Conferences and Journals to his
credit. He was Senior Member of IEEE. Presently He is Fellow of IETE and
Life Member of ISTE Recently 11 students working under his guidance have
received best paper awards. Two of his students have been awarded Ph. D. from
NMIMS University. Currently he is guiding ten Ph.D. students.
138 http://sites.google.com/site/ijcsis/
ISSN 1947-5500
(IJCSIS) International Journal of Computer Science and Information Security,
Vol. 9, No. 6, June 2011
Evolving Fuzzy Classification Systems from
Numerical Data
Pardeep Sandhu Shakti Kumar Himanshu Sharma Parvinder Bhalla
Department of Electronics Computational Intelligence Department of Electronics Computational Intelligence
& Communication, Laboratory & Communication, Laboratory
Maharishi Markandeshwar Institute of Science and Maharishi Markandeshwar Institute of Science and
University, Mullana, Technology, Klawad, University, Mullana, Technology, Klawad,
Haryana, INDIA Haryana, INDIA Haryana, INDIA Haryana, INDIA
er.pardeepsandhu@gmail.co shaktik@gmail.com himanshu.zte@gmail.com parvinderbhalla@gmail.com
m
Abstract — Fuzzy Classifiers are an important class of fuzzy systems are also called as Fuzzy Rule Based Systems
systems. Evolving fuzzy classifiers from numerical data has (FRBSs) [4]. These systems have been successfully
assumed lot of significance in the recent past. This paper applied to a wide range of problems from different areas
proposes a method of evolving fuzzy classifiers using a three presenting uncertainty and vagueness in different ways
step approach. In the first step, we applied a modified Fuzzy [5], [6], [7]. These FRBS‘s can be categorized as
C–Means Clustering technique to generate membership
knowledge based systems and data driven systems. There
functions. In the second step, we generated rule base using
Wang and Mendel algorithm. The third step was used to are two ways of providing knowledge to the systems. In
reduce the size of the generated rule base. This way rule first type of systems called knowledge driven modeling,
explosion issue was successfully tackled. The proposed the rule base is provided by an expert who has the
method was implemented using MATLAB. The approach complete knowledge of the domain while in second type
was tested on four very well known multi dimensional of models called data driven models, this rule base is
classification data sets. The bench mark classification data generated from available numerical data [8].
sets contain: Iris Data, Wine Data, Glass Data and Pima
Indian Diabetes Data sets. The performance of the proposed In data driven systems to automatically generate the
method was very encouraging. We further implemented our rule base, a number of classical approaches like Hong and
algorithm on a Mamdani type control model for a quick Lee‘s Algorithm [9], Wang and Mendel Algorithm [4],
fuzzy battery charger data set. This integrated approach was [6], [10], [11], [12], Online Learning Algorithm [13],
able to evolve model quickly. Multiphase Clustering Approach [14] and soft computing
Keywords — Linguistic rules, Fuzzy classifier, Fuzzy logic, techniques like Artificial Neural Networks [15], [16], [17],
Rule base.
Genetic Algorithm [18], [19], Swarm Intelligence based
I. INTRODUCTION techniques [20], Ant Colony Optimization [21], Particle
Swarm Optimization [22], Biogeography based
The theory of fuzzy sets and fuzzy logic was introduced Optimization [23], Big Bang – Big Crunch Optimization
by Lotfi A. Zadeh through his seminal paper in 1965 [1]. technique [24] are available in the literature [25].
Both these, fuzzy set theory and fuzzy logic act as a
powerful methodology for dealing with imprecision and This paper is based on an integrated approach that
nonlinearity in an efficient way [2], [3]. As far as the need makes use of a modified Fuzzy C–Means Clustering
of fuzzy set theory is concerned, there are numerous approach (FCM) [26] and Wang and Mendel method [6].
situations in which classical set theory of 0‘s and 1‘s is not The approach was implemented in MATLAB for fuzzy
sufficient to describe human reasoning. Thus, for such classification problems [27] of Iris data of Fisher [28],
situations we need a more appropriate theory that can also Wine data, Glass data, Pima Indian Diabetes (PID) data
define membership grades in between ‗0‘ and ‗1‘ thereby and Battery Charger data (control problem) [29]. A system
providing better results in terms of human reasoning. was evolved using set of training examples and system‘s
Fuzzy set theory attempts to do this. performance was then evaluated using test data set for the
given system. The system performances were evaluated in
Further this theory of fuzzy logic leads to the terms of Average Classification Rate (for classification
development of fuzzy logic based systems, the systems problems) and Mean Square Error (for control problem).
which are capable of making a decision on the basis of
knowledge or intelligence provided to the system through The paper is organized as follows: Section II introduces
linguistic rule bases. As a particular combination of input Fuzzy Logic Based Systems. Section III discusses the
is given to the system, system on the basis of knowledge proposed integrated approach and WM method for rule
embedded into it in the form of linguistic rules makes a base generation. In section IV the result analysis along
decision and processes those inputs. As the intelligence of with the comparative study for above mentioned standard
these systems depends upon linguistic rule base, these data sets are shown and section V includes conclusions.
139 http://sites.google.com/site/ijcsis/
ISSN 1947-5500
(IJCSIS) International Journal of Computer Science and Information Security,
Vol. 9, No. 6, June 2011
II. FUZZY RULE BASED SYSTEMS Rule: IF antecedent……THEN consequent……. (2)
Fuzzy logic is a mathematical approach to emulate the The antecedent part provides the input variable
human way of thinking and learning [30]. This logic is an conditions using IF statements and consequent provides
extension of classical set theory which says a fuzzy set is a the output using THEN statements. For example, if X and
class of objects with a continuum of grades of Y are the input and output universes of discourse of a
membership. Such a set is characterized by a membership fuzzy system with a rule base of size ‗N‘, then the rule
mapping the elements of a domain, space or universe of will be of the form as shown by equation (3):
discourse ‗U‘ to the interval {0, 1}. If ‗U‘ is a collection Rule ith: IF x is Ai THEN y is Bi (3)
of objects denoted by x, then a fuzzy set ‗A‘ in the
universe of discourse ‗U‘ can be defined as a set of Where, x and y represent input and output fuzzy
ordered pairs as shown in equation (1) [5], [8]: linguistic variables respectively, and Ai Є X and Bi Є Y
(1≤ i ≤N) are fuzzy sets representing linguistic values of x
A xi , A ( xi) x A (1) and y [5].
Here x refers to ith element of the set and µA (xi) is the In Mamdani type systems the consequent is represented
membership grade of xi in set ‗A‘. using fuzzy sets while in Sugeno type systems, it is a
Fuzzy Logic Based Systems or Fuzzy Rule Based fuzzy singleton. Also in TSK type systems, it is a function
Systems (FRBS) are intelligent systems those are based on of inputs [23].
mapping of input spaces to output spaces where the way of
III. PROPOSED APPROACH
representing this mapping is known as fuzzy linguistic rules.
These intelligent systems provide a framework for representing We first broke the system identification problem into
and processing information in a way that resembles human three sub–problems and solved these one by one as
communication and reasoning process. follows:
1. Classify all the relevant input and output domains
into various membership functions using modified
FCM method [26].
2. Apply Wang and Mendel algorithm [6] for creating
a fuzzy rule base, evolved as a combination of rules
generated from numerical examples and linguistic
rules supplied by human experts.
3. Keep the number of rules to bare minimum. We
used a rule reduction technique as proposed in [32],
Figure 1. Fuzzy Logic System [33] to keep the rule base as compact as possible.
Each fuzzy rule based system, typically possesses a The backbone of this approach is the Wang and Mendel
fuzzy inference system (shown in Figure 1) composed of algorithm [6] which has proved to be very effective.
four major modules: Fuzzification module, Inference
Engine, Knowledge Base and Defuzzification module Suppose the given set of desired input–output data pairs
[31]. The fuzzification module performs the is:
transformation of crisp inputs into fuzzy domain values. It
is mainly done to find the belongingness of data sets to x(1) (1) (1)
1 , x2 ; y x
, ( 2 ) ( 2) ( 2 )
1 , x2 ; y ,....... (4)
different membership functions. The fuzzification can be
Here x1, x2 are inputs and y is the output. The problem
performed by either with the help of domain experts or
formulation consists of generating fuzzy rules and to use
directly from the available numerical data. These fuzzy
these rules to determine a mapping from inputs (x1, x2) to
domain values are then processed by inference engine
output (y).
which is composed of composition, implication and
aggregation processes. The method of processing the The following steps present our integrated approach:
inputs is supplied by the knowledge base and rule base Step 1: Divide the input output spaces into fuzzy
module as it contains the knowledge of the application regions:
domain and the procedural knowledge. Finally, the
processed output of inference engine is transformed from We divide input spaces into desired number of
fuzzy domain to crisp domain by defuzzification module. membership functions using modified FCM [26].
One of the biggest challenges in the field of modeling Assuming that the domain intervals of inputs x1, x2 and
fuzzy rule based systems is the designing of rule base as it output y (equation (4)) lies in [x1-, x1+], [x2-, x2+] and
is characterized by a set of IF–THEN linguistic rules. This [y-, y+]. Here, the domain interval means the values for a
rule base can be defined either by an expert or can be particular variable will lie in this interval. Each of these
extracted from numerical data using any computerized input and output, spaces are partitioned into (2N+1)
techniques as mentioned in section I. A rule in fuzzy regions. The number N can be different for each of the
domain can be represented by equation (2): variables. E.g. if the value of N = 2, then there will be five
140 http://sites.google.com/site/ijcsis/
ISSN 1947-5500
(IJCSIS) International Journal of Computer Science and Information Security,
Vol. 9, No. 6, June 2011
membership functions [6]. A number of other methods IV. RESULT ANALYSIS
are also available to divide the input output spaces into This section presents the performances obtained by our
fuzzy regions. integrated approach that uses modified Fuzzy C–Means
Step 2: Generate fuzzy rules from given input–output Clustering [26] and Wang and Mendel algorithm [6] to
data pairs: evolve fuzzy rule based systems. We applied our approach
on four very well known classification data sets from
In this step, first the degree of a given data set (x1(i), machine learning repository and one control data set. In
(i)
x2 ; y(i)) into different fuzzy membership functions are each experiment, the input and output domain intervals are
determined. fuzzified using modified FCM approach. The training data
Second, assign a given data set (x1(i), x2(i); y(i)) to the samples are selected from available data sets in
region with maximum degree and obtain one rule from correspondence with the peaks of the input membership
one data set. functions. This sequence is used to train the systems
which are then tested using testing data sets.
Step 3: Assign a degree to each rule:
A. Example 1: Iris Data Classification Problem
A degree to each generated rule can be assigned using
following formula of equation (5): The proposed approach has been applied on Iris Data
classification problem. The Iris data set is a widely used
Drule A ( x1 ) B ( x2 ) C ( y) (5) benchmark for classification and pattern recognition
studies [27], [28]. The dataset contains 150 samples of
That is the product of membership grade of input x1 in data (50 samples for each species) with four attributes as
fuzzy set ‗A‘, membership grade of input x2 in fuzzy set inputs, Sepal Length, Sepal Width, Petal Length and Petal
‗B‘ and membership grade of output y in fuzzy set ‗C‘. Width and three classes of iris plants namely: Iris Setosa,
Also at this point if an expert is available and he assigns Iris Versicolor and Iris Virginica as output. All the input
his degree of belief in the correctness of a particular data variables have measurement units in centimeter while the
set then that degree ‗m‘ must be multiplied with the above output is the type of iris plant. The learning sequence
expression. includes 24 data samples while the system is tested on all
Step 4: Create a combined fuzzy rule base: 150 data samples. By applying the proposed method on
the learning sequence, a set of 24 classification rules (one
The combined fuzzy rule base is assigned rules from rule per training data sample) is obtained. From this
either those generated from numerical data or linguistic combined rule base, the redundant rules are then removed
rules (we assume that a linguistic r
Get documents about "